COS 333 Assignment 5: Size Matters (Spring 2018)

Due midnight, Friday, March 30 (after break)

Sun Mar 4 09:57:39 EST 2018

Introduction

This assignment is an exercise in using scripting languages to explore and manipulate two largish text files. Your task is to compress these files (separately) as much as you can, while managing one important criterion: your compression must be lossless, so the original data can be reconstituted exactly. The assignment is also an exercise in exploratory data analysis -- your job is to explore a variety of approaches to see which ones combine to produce the best outcomes.

Your tools for this job are scripting languages like AWK, Perl, Python, PHP, Ruby, and so on. Importantly, you can't use general-purpose languages like C, C++, or Java. As a reminder, here are help files for AWK and Python from previous assignments. You might also reference the lecture notes on Bash, AWK, and Python.

There are many Unix tools that might also help: gzip, bzip2 and xz do general-purpose compression; grep, sed, sort, uniq and wc let you find, count and rearrange things; and of course shells and scripting languages let you do your own processing and combine other pieces. You can use any tool that's found on the CS Dropbox, which is essentially the same as on the CS cycles computers or the Nobel cluster at OIT.

Part 1: A 168 By Any Other Name Would Smell As 1111

The file names.txt (5.5 MB) contains nearly 152,000 surnames from the 2010 census, sorted by frequency of occurrence. Background information about the data can be found at the Census Bureau.

Your task is to write programs for compression and uncompression in which the former compresses this file as much as possible, while the latter can still recover the original file exactly. As noted above, you can only use standard Unix tools and scripting languages.

For calibration of what "as much as you can" entails, my straightforward 14-line script uses mostly AWK and compresses names.txt by a factor of 11.5 and a 23-line script of mostly AWK recovers the original. The highest compression factor obtained last year for this file was 12.01, while the 90th percentile was 11.92. (This year's will be different, of course -- hopefully better, since you have a record to aim for!) We will set the midline, used in calculations below, at 11.5, which corresponds to just under the 50th percentile last year.

You must submit two scripts c1 and u1 that compress and uncompress respectively: c1 creates a single compressed file, written to standard output, from a single input file given as a command line argument, and u1 creates an uncompressed file, written to standard output, from a single compressed input file given as a command line argument. We will test your work by running just those two scripts. The result of u1 must match the original input exactly (for instance, as might be determined by the cmp command).

To make this more concrete, here are working versions of c1 and u1, to show the general format. This c1 achieves a factor of 5.48 merely by discarding multiple blanks; for contrast, bzip2 compresses this file by about 5.17.

# c1: compress names.txt
awk '{print $1,$2,$3,$4}' $1 | bzip2 -c
# u1: uncompress for names.txt
mv $1 $1.bz2
bunzip2 -c $1.bz2 | awk '{printf("%-15s %5s %6s %6d\n", $1,$2,$3,$4)}'

Part 2: Rank Hath Its Privileges

The file crank.txt (143K lines, 4.95 MB) contains the hourly Amazon rank of The C Programming Language from June 14, 2001 to March 3, 2018. Each line gives the rank and the date in standard Unix form. There is a tab between the rank and the date, and the date is in this format, with exactly one space between each date field:

5,669	Sat Mar 3 08:33:58 EST 2018 

You must submit two scripts c2 and u2 that compress and uncompress respectively: c2 creates a single compressed file, written to standard output, from a single input file given as a command line argument, and u2 creates an uncompressed file, written to standard output, from a single compressed input file given as a command line argument. We will test your work by running just those two scripts. The result of u2 must match the original input exactly.

For calibration here, my simple 12-line script using mostly AWK compresses crank.txt by a factor of 19.4 and a 20-line script of mostly AWK recovers the original. The highest compression factor obtained for last year's version of this file was 22.365, while the 90th percentile was 21.6. This year will be different, of course, partly because there is more data, but you should expect similar numbers. We will set the midline (used in calculations below) at 20.00, which corresponds to just under the 50th percentile last year.

General Requirements for Both Parts

You can mix and match Unix commands and scripts in scripting languages any way you like; each operation must be runnable by invoking a single command (one of your four scripts, {c,u}{1,2}). This command must be self-contained. It doesn't have to be robust; it just has to work if it's used properly. It should run at a reasonable speed, however. We will judge this by the Dropbox timeouts: if it takes more than a minute, something's wrong, and your script will be killed. If a tool doesn't run on the CS Dropbox, you can't use it no matter how useful it might be; the same goes for Python modules and the like. Use the dropbox as a reasonable test of whether your tools are valid in this environment.

Your compression and uncompression techniques should be expected to work correctly, and largely achieve similar compression on an input file that has the same structure but where some of the exact data values might differ. Examples might be the equivalent census data from 2000 or 2020, or the equivalent Amazon data if updated.

Advice

The first piece of advice is one that is often heart-wrenching to Princeton students: there are no "right answers" for this assignment! Instead, seek a combination of straightforward ideas, iteration and perseverance over experiments with multiple techniques, and perhaps an occasional flash of insight. There are many possible approaches that you can mix and match, so play with it over a period of time rather than doing it in a single sitting; this gives your subconscious a chance to contribute.

The basic idea of compression is to remove redundancy -- things that are repeated or that can be derived from others -- and to use fewer bits per item for items that occur often. Bzip2 and other general-purpose compression programs do a great job of that, and it is a waste of your time to try to invent special mechanisms that are subsumed by what they already do. We cannot stress this point too much: do not try to create a better bzip. What you can do that bzip2 can't is take advantage of semantic relationships, and you can also rearrange the input so long as you do it reversibly.

It is probably easiest to do your experiments by focusing first on a compression script, but always keeping in mind the process by which you will uncompress. If you remove something redundant, you have to be able to recreate it. Similarly, if you rearrange any part of the data, for example by sorting, you have to be able to recover what you started with. Once you think the compression is good enough, collect all the compression operations into a script that can be run by a single command, and all the uncompression operations into another script.

The rank data has been massaged to remove some real glitches, like the time in March, 2012 when the rank went to 5,448,067 for a single hour. If you find something really weird in the data, let us know; we're not trying to trick you with flaky data. But real data is often very flaky and you have to cope with it. This data comes from nearly 16 years of scraping; it would be most improbable that there was data for every hour (and in fact over the past year Amazon has gotten much more aggressive about refusing to serve requests that appear to come from programs). Changes from EST to EDT can lead to surprises, and there will be other oddities as well. Your job is to compress and uncompress it in spite of such glitches.

And before you ask: the fact that the cumulative percentage in names.txt doesn't match exactly is not a bug, it's a feature of rounding, and you must deal with it.

Submission and Evaluation

Submit your files c1, u1, c2 and u2 using the dropbox link https://dropbox.cs.princeton.edu/COS333_S2018/asgn5.

The dropbox script will do a straightforward check of your code when you submit. We will test your code by running in a directory that contains only your code, names.txt and crank.txt. Please be careful that your scripts do not alter the input test files. The "Check All Submitted Files" button on Dropbox will report your compression factors if your scripts are correct (lossless).

We will test correctness by a sequence of commands such as these (but do not assume filenames of either the input or output files):

c1 names.txt >compressed.form
u1 compressed.form >newnames
cmp names.txt newnames
c2 crank.txt >compressed.form
u2 compressed.form >newcrank
cmp crank.txt newcrank

We will compute the compression factor like this:

c1 names.txt >compressed.form
factor = bytes in names.txt / (bytes in compressed.form + c1 + u1)
c2 crank.txt >compressed.form
factor = bytes in crank.txt / (bytes in compressed.form + c2 + u2)

Unfortunately, this means comments, spaces, etc., count as part of the compressed size. In reality it won't have a big effect, but if you would like to aim for absolute minimization, we strongly suggest developing within a structure of well-commented readable code, and only minifying as a separate process.

The credit will be divided evenly between the two parts of the assignment. For each part, the score breakdown is as follows, with exact details described below:

40% (20 points): Submission and Correctness
30% (15 points): Baseline-to-Midline Performance
30% (15 points): Midline-to-Highline Performance

Incorrect programs receive no performance points; your program must be correct before it can receive any performance points. If you do not reach the baseline (bzip2), you will receive no performance points.

If you reach the midline specified in each part above, you will receive all 15 of the first set of performance points. If you do not reach the midline, we will scale down your score in proportion to the baseline (bzip2). So, for example, if you got 7, where the baseline was 5 and the midline was 11, you would earn ((7-5)/(11-5))*15, or about 5 of the 15 points in the first set of performance points.

Beyond the midline, the second set of performance points will be relative to the best compression ratios in the submissions from this semester. To avoid any perception or reality of fostering cutthroat behavior, but balancing against disincentivizing striving to maximize one's ratio, we will define this to be "among the top 10%". Thus, if you get in the top 10% of compression ratios, you will earn all the points in the second set. Otherwise, your score will be scaled based on the midline and the lowest ratio in the top 10%. So, for example, if you got 20 when the midline was 19 and the 90th percentile was 21, even if the highest compression score in the class was 50(not likely!), you would still earn ((20-19)/(21-19))*15 = 7.5 of the 15 points in the second set of performance points.