COS 333 Assignment 1: Regular Expression Matching

Due midnight, Friday, February 12

Fri Feb 5 15:05:11 EST 2010

Updated 2/5 to fix last example test case.

In ancient Unix times, filename regular expression matching ("wildcards") was done not in the shell itself, but in a separate program called glob. This lives on in various nooks and crannies -- csh and tcsh have a shell built-in called glob that does this operation -- and a variety of programming languages have functions or operators that match regular expressions of this class (for instance, Like in Visual Basic and SQL, each with its own syntax).

This assignment has two parts. The first is to modify the regular expression code in Section 9.2 of The Practice of Programming to process regular expressions of this class. The second part is to develop a compact specification of a set of test cases, and a program that will run the tests from the specification.

The detailed structure of the RE code in TPOP and the interface it presents to the rest of the program is quite specialized -- the goal was to present the essence of pattern-matching as compactly as possible, not to make it easy to generalize. So although the basic algorithm is fine for our purposes here, the packaging and some of the details have to be different. As always, we want to hide the implementation details, so they can be readily changed without affecting the rest of the program.

1. Modify the regular expression code

To begin, read the description in the book carefully and be sure you understand it. Then download the code and modify match and the other functions so they match a different class of regular expressions:

You have to modify match to handle this class of regular expressions. int match(char *re, char *s) returns 1 if the regular expression re matches the string s, and 0 if not; it returns -1 if there is any kind of error, such as an illegal regular expression. match does not print anything ever, nor does it ever exit or abort.

You do not need to handle pathological regular expressions; with this definition, there aren't many anyway. If you make only straightforward modifications to the existing code, odd cases like a**b will just work.

The original TPOP code interprets the input regular expression as matching proceeds. For these extensions, you may find it helpful to separate compilation from execution: make a pass over the regular expression to create an internal representation that is convenient for subsequent matching. If you do, you'll also have to modify the code to use that representation as it does matching. For example, a shorthand like \D is longer than a single character and it might (or might not) be easier to represent that as a single token. It's your choice how you do this, but users of your match() function must not know or care which choice you made.

For this part, you will have to submit a single C file regexp.c that contains only your function definitions. The only name that should be visible outside the file is match, so everything else must be declared static. We will test your code by compiling and linking your regexp.c with our own driver, using this declaration only:

	int match(char *regexp, char *text); 
and this compile command:
	gcc -pedantic -ansi -Wall our_grep.c your_regexp.c 

Your code must compile and link without change and without warning messages.

For calibration, my regexp.c is at most 60 lines longer than the original matching functions, and my grep.c is almost unchanged; if your solution is a lot longer than this, you are almost surely off on the wrong track.

Don't get wrapped up in language lawyering; the task is to convert the original code to our definition, not to invent your own definition. Follow the original code. Do not create your own matching algorithm or reinterpret what a regular expression is!

2. Create tests and run them

For the second part of this assignment, we will use a simple variant of the testing language for Awk that will be described in class. Each test case consists of one line with three tab-separated fields. The first field is -1, 0 or 1. The second field is a regular expression. The third field is a text string. If the first field is 1, the regular expression matches the string; if 0, it does not match the string; and if -1, the second field is an invalid RE and the third field is ignored. Blank lines will be ignored so you can use them to set off groups of related tests.

If you surround any string by "...", we will strip off the quotes before using the string. Thus a field consisting of "" will be treated as empty and " " as a single blank. No other processing will be applied, so don't nest quotes, don't use interior quotes, don't expect \ processing within quotes, etc.

Thus a typical set of lines might read

-1	\	ignored
1	x	x
0	x	""
0	x	y
0	x	xx
0	x	xy
1	*	""
1	*	x
1	*	"now is the time"
1	x*	x
1	x*	xy
0	x*	""
0	x*	a
0	x*	ax
1	x*y	xay
0	x*y	""
0	x*y	xx
1	\d	1
0	\D	2
0	\s	""
1	\s	" "
Be sure you understand these tests. Then create a bunch of tests of your own and use them to mechanically verify that your program works properly.

It's up to you how you arrange to run all your tests so long as it's automated. I wrote a program called retest.c (about 50 lines of C) that reads through a file of tests and reports when something unexpected happens; it prints no output unless something goes wrong. You must submit your testing program so we can look at it, but we won't grade it. Make sure it has at least a few comments that tell us what it's supposed to do.

We will test your code by compiling and linking your regexp.c with our own driver, using only the definition of match above, and using our test cases in addition to yours. We will be reasonable about borderline syntax cases, so be careful but not compulsive; correct behavior on the basic functionality is the important part of this assignment.

3. Advice

It is always best to attack such a task in stages, making sure each one works before the next one is started. I started with leftmost longest, then did literals, then the metacharacters * and ?, then isolated matching into regexp.c, then converted to a suitable representation, then added \, then finally added character class shorthands, testing after each stage. Your approach may differ, but do not try to do it all at once.

Define suitable functions rather than writing everything in one routine. Keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is a recipe for disaster. If things go wrong, use printf or master a debugger, or both. Here's my one-page cheat sheet on gdb.

Talking over the precise meaning of the assignment with friends and the TAs is encouraged. You must write your code yourself, however, so once you start coding, you may no longer discuss programming issues with other students. You may run your test cases through anyone else's implementation, but only by using his or her object file (regexp.o) and you have to write your own tests, not copy theirs. It's also OK to tell someone that a specific test appears to fail on their implementation. That is, this is OK:

  • Jack compiles his regexp.c and sends his regexp.o to Jill.
  • Jill runs her tests using her driver and Jack's regexp.o.
  • If one of Jill's tests fails, and she is convinced that it should have worked, she can tell Jack "This apparently valid test case doesn't work right with your implementation."
  • Jack and Jill together decide whether the test is valid. Both can submit it.

    This is meant to be a sort of arm's-length interoperability check, not a collaboration. We're not going to be up tight about this, but code and test cases really have to be fundamentally your own work.

    Prepare good test cases ahead of time; your tests ought to cover essentially all the lines of code in your program. Test early, test often, know what you are testing. Automate the testing as much as you can. Steve McConnell's Code Complete has a good chapter on testing. (Advertisement: so does TPOP. I also wrote a short note on how AWK is tested; you might skim it to see if it suggests anything useful.)

    4. Submission

    When you are finished, submit regexp.c, retests.txt and your_retester, using the CS Dropbox system. (your_retester can be written in any language; you don't have to use C.) You can submit multiple times; we'll grade the latest one.

    We'll grade this out of 100 somewhat along these lines:

    As you can see, core dumps are discouraged. Be sure to check arrays for potential overflow, and if you use malloc or the like, check its return value. We also care about style; you should follow the guidelines and general format of Chapter 1 of The Practice of Programming, and your code should look more or less like that.

    Please follow the rules on what to submit. It will be a help if you get the filenames right and submit exactly what's asked for. Thanks.