COS 333 Assignment 3: Size Matters

Due midnight, Friday, Feb 26

Mon Feb 15 20:11:48 EST 2010

This assignment is an exercise in using scripting languages to explore and manipulate two largish text files. Your job is to compress these files (separately) as much as you can, using standard Unix tools and scripting languages like Awk, Perl, Python, PHP, and so on; you can't use general-purpose languages like C, C++, or Java. Here are help files for Awk, Perl, Python and PHP that might get you started.

There are many Unix tools that will help too: gzip and bzip2 do general-purpose compression; grep, sed, sort, uniq and wc let you find, count and rearrange things; and of course shells and scripting languages let you do your own processing and combine other pieces.

Your compression must be lossless, so the original data can be reconstituted exactly. If you remove something redundant, you have to be able to recreate it. Similarly, if you rearrange any part of the data, for example by sorting, you have to be able to recover what you started with. Your techniques should also be expected to achieve similar compression on an input file that is similar.

1. What's in a Name?

The file names.txt (3.1 MB) contains nearly 89,000 surnames from the 1990 census, sorted by frequency of occurrence. The original data and background information can be found at the Census Bureau.

Compress this file as much as you can, using only standard Unix tools and scripting languages. You have to also provide the procedure for uncompressing. For calibration, I squeezed it by a factor of nearly 10 with only 6-8 lines of Awk, a factor of 12 is well within range, and the best ever was over 14.

It's probably easiest to do your experiments by focusing first on a compression script, but always keeping in mind the process by which you will uncompress. Once you think the compression is good enough, collect all the compression operations into a script that can be run by a single command, and all the uncompression operations into another script.

You must submit two scripts c1 and u1 that compress and uncompress respectively. c1 creates a single compressed file from a single input file, and u1 creates an uncompressed file from a single compressed input file. We will test your work by running just those two scripts on the census data. After

	c1 names.txt compressed.form
	u1 compressed.form newnames.txt 
newnames.txt must match names.txt exactly.

2. Past Performance is No Guarantee of Future Results

The file dji.txt (20433 lines, 1.16 MB, from Yahoo Finance) contains the complete history of the Dow Jones Industrial Average since its inception in October, 1928. Each line lists the date; the opening, high, low and closing prices; the number of shares traded; and the "adjusted" price, which includes adjustments like dividend payouts. The first and last lines of the file are
	2010-02-12,10137.23,10140.18,9962.13,10099.14,4160680000,10099.14
	...
	1928-10-01,239.43,242.46,238.24,240.01,3500000,240.01
The last field says that the original index value was 240.01 and on 2/12/10 the index was 10099.14.

Your job is to write a script c2 to compress this file as much as you can. Again, c2 must be accompanied by an uncompression script u2 that will restore the file to its original state. Again, the compression script creates a single output file; the decompression script reads a single input file. For calibration, a factor of 6 is not too hard.

3. Advice

There are no right answers for this assignment. We're looking for a combination of straightforward ideas and an occasional flash of insight, expressed as an orderly set of programs that do the job automatically. There are many possible approaches that you can mix and match, so play with it over a period of time rather than doing it in a single sitting; this gives your subconscious a chance to contribute. The basic idea of compression is to remove redundancy -- things that are repeated or similar to others -- and to use fewer bits per item for items that occur often. Bzip2 does a great job of that. What you can do that bzip2 can't is take advantage of semantic relationships, and you can also rearrange the input so long as you do it reversibly.

We will compute the compression factor approximately like this:

c1 names.txt compressed.form
factor = bytes in names.txt / (bytes in compressed.form + all your submitted files)
It's probably unkind of us, but comments, blank lines, etc., in your programs count as part of the size. It won't be a big effect, however. The "Check" buttons in the submission scripts at Dropbox will tell you what your compression factor is if your scripts work.

There's no credit for gaming the system, for example by hiding the file contents somewhere and then magically finding them, since we might well try them on analogous but different data. You are trying to compress the file to send it to someone else. You have to include in the transmission all the information necessary to recreate it at the other end. So if you write decompression scripts or build data tables of your own, those must be included in the size. Standard tools like sort, bzip2, Awk, Perl, etc., can be assumed to exist everywhere and are not included in any size computation.

The credit will be divided more or less evenly among the two files. You should be able to get a factor of 12 for names.txt and 6 for dji.txt without undo strain. Try to do better than that; there won't be enormous credit for doing worse than my sleazy hacks. Along with fleeting fame and a good grade, there will be some utterly useless prizes for the best couple of compressions. Don't kill yourself, however -- this assignment is meant to be a bit of a change of pace after the first few. Enjoy!

4. Submission

You can mix and match Unix commands and scripts in scripting languages as you like; the result must be runnable by invoking a single command. This command might be self-contained or it might invoke other files; that's your choice. It doesn't have to be robust; it just has to work if it's used properly. It should run at a reasonable speed, however -- if it takes more than a couple of minutes, something's wrong.

c1 and u1 are like the cp command: one input, one output. We will test correctness by

	c1 names.txt compressed.form
	u1 compressed.form newnames
	cmp names.txt newnames
in a directory containing nothing but your submission and names.txt.

The scripts for dji.txt are to behave the same, but must be named c2 and u2.

To make this more concrete, here are working versions of c1 and u1, to show the general format. This achieves a factor of 5.25 merely by discarding multiple blanks; for contrast, bzip2 compresses this file by about 4.95.

	# c1: compress names.txt
	awk '{print $1,$2,$3,$4}' $1 | bzip2 >$2

	# u1: uncompress for names.txt
	mv $1 $1.bz2
	bunzip2 -c $1.bz2 | awk '{printf("%-14s %5s %6s %6d\n", $1,$2,$3,$4)}' >$2

We will test your code by running in a directory that contains only your code and either names.txt or dji.txt. You have to submit all the pieces necessary to make this just plain work when we run it. Submit using this Dropbox link for all files related to names.txt and this Dropbox link for all files related to dji.txt.

Please follow the rules on what to submit. Don't include standard Unix tools or the data files. It will help your grade if you get the filenames right and submit exactly what's asked for. Thanks.