COS 226 Programming Assignment Checklist: Burrows-Wheeler Data Compression

Frequently Asked Questions

What program should I use for reading characters? You must use CharStdIn.java. This reads in bytes, whereas StdIn.java and In.java read in Unicode characters.

My programs don't work properly with binary data. Why not? System.out.print(ch) translates the character ch into one or more bytes according to the platform's default character encoding. This matters for non-ASCII characters (128 - 255). To print it as a byte, use System.out.write(ch), and flush the stream at end of your program with System.out.flush().

How do I read and write data for the move-to-front encoder and decoder?

while (!CharStdIn.isEmpty()) {
   char ch = CharStdIn.readChar();
   ...
   System.out.write( /* a char */ );
}
System.out.flush();

How do I read and write data for the Burrows-Wheeler encoder?

// read data
String s = CharStdIn.readAll();

...

// write data
System.out.println( /* an int */ );
for (int i = 0; i < N; i++) {
   ...
   System.out.write( /* a char */ );
}
System.out.flush();

How do I read and write data for the Burrows-Wheeler decoder?

// read data
int first = Integer.parseInt(CharStdIn.readLine());
String s  = CharStdIn.readAll();

...

// write data
for (int i = 0; i < N; i++) {
   ...
   System.out.write( /* a char */ );
}
System.out.flush();

How do I read and write data for the Huffman encoder

// read data
String s = CharStdIn.readAll();

...

// write preorder traversal by calling System.out.write(c)
System.out.println();

// write number of characters in input
System.out.println(numberOfChars);

// write sequence of bytes by calling System.out.write(c)
System.out.flush();

How do I read and write data for the Huffman decoder

// read preorder traversal by calling CharStdIn.readChar()
// see lecture notes on Huffman Decoder -- but you'll need to modify that code
CharStdIn.readLine();   // to eat up newline

// read number of characters to be printed
int numberOfChars = Integer.parseInt(CharStdIn.readLine());

// read bytes using CharStdIn.readChar() and write using System.out.write()
System.out.flush();

When I run the Huffman encoding algorithm on the input abracadabra! I get a different output for the bytes? Different operating systems may use different character encodings for extended ASCII symbols. As a result, the acute accent and the no-break space may be represented differently on your system (or may appear as a ? if your terminal font does not support it). For example, using IBM DOS extended ASCII, the resulting symbols should be |⊣|á, i.e., the vertical bar, the left tack, the vertical bar, and a acute.

How should I determine the size of the compressed file? On OS X, Linux, or Unix, compare "wc -c input.txt" with "java BWTE < input.txt | java MTFE | hufe | wc -c". On Windows, type "java BWTE < input.txt | java MTFE | hufe > compressed.txt" and use the File Manager to inspect the file sizes.

How much memory can my program consume? The Burrows-Wheeler encoder may use quite a bit, so you may need to use the -Xmx option when executing. You must use space linear in the input size N. (Industrial strength Burrows-Wheeler compression algorithms typically use a fixed blocksize, and encode the message in these smaller chunks. This reduces the memory requirements, at the expense of some loss in compression ratio.)

Why not use 8-bit bytes instead of 16-bit chars? You certainly can, but if you do, be careful to note that bytes are signed integers (whereas chars are unsigned). Also, it's convenient to be able to manipulate a sequence of bytes using the built-in String data type.

Where can I learn about bit-whacking operators in Java Here's one resource.

How do I use gzip and bzip on Windows? It's fine to use pkzip instead.

How can I compare the contents of two files (to check that the decoded version equals the original)? On Linux or OS X, use the command diff file1 file2; on Windows use the command fc file 1 file2.

I'm curious. What compression algorithm is used in PKZIP? In gzip? In bzip2? PKZIP uses LZW compression followed by the Shannon-Fano trees algorithm (an entropy encoder similar to Huffman). The Unix utility gzip combines a variation of LZ77 (similar to LZW) and Huffman coding. The program bzip2 combines the Burrows-Wheeler transform, Huffman coding, and a (fancier) move-to-front style rule.

When I try to run compressit.csh and timeit.csh, I get a "permission denied" message. What can I do to activate these scripts? Use chmod to add execute privileges:

chmod +x compressit.csh
chmod +x timeit.csh

When I try to execute compressit.csh or timeit.csh, it gives the error: "command not found". Is the directory containing compressit.csh or timeit.csh in your path? If not and the scripts are in the current directory, invoke them using:

./compressit.csh  ...
./timeit.csh

When I try to execute timeit.csh it give me the error: "could not openprintf: warning: ignoring excess arguments, starting with `abra.txt'" Open up the script timeit.csh in an editor. You'll see this line:

# adjust as needed
set inputpath = "/u/cos226/ftp/textfiles"

Change the path "/u/cos226/ftp/textfiles" so it ends in the directory where your textfiles are. Do something similar in compressit.csh

Input, Output, and Testing

Input. Here are some sample input files. To fully test your program, you should also try to compress and uncompress binary files (e.g., .class or .jpg files). But don't kill yourself over this if it doesn't work - you will receive only a minor deduction if your program works on text only.

Reference solutions. For reference, we have provided the output of compressing aesop.txt and us.gif. We have also provided the results of applying each of the three encoding algorithms in isolation. Note that the GIF file is a binary file and is already compressed.

Compression script. Linux/OS X users may wish to use the scripts compressit.csh to automate the process of determining the compressed file sizes. The following command tests your implementation against cat, Huffman encoding alone, gzip, and bzip2.

compressit.csh burrows cat HUFE gzip226 bzip226

Note that burrows is a script that combines the three phases of the Burrow Wheeler data compression algorithm. The scripts gzip226 and bzip226 run the standard Unix gzip and bzip programs and are only included for reference.

Timing script. The timeit.csh script measures execution times of the compression programs.

timeit.csh cat HUFE HUFD MTFE MTFD BWTD BWTE burrows gzip226 bzip226

The scripts MTFE, MTFD, BWTD, and BWTE each take a command line argument and call the corresponding Java program. Be careful when using /usr/bin/time with pipes - it only measures the running time of the first command. Gluing the commands together in the script burrows enables you to account for the total execution time.

Submission

readme.txt. Use the following readme file template. Besides providing details about your implementation which are not clear from the comments within your source code, your readme.txt file should also:

Report on the compression ratios for various test files.
Analyze the asymptotic running time of your encoding and decoding algorithms.

What to submit. Submit BWTE.java, BWTD.java, MTFE.java, MTFD.java, HUFE.java, HUFD.java and any other files needed by your programs. Do not submit CharStdIn.java.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

Download the directory burrows to your system. It contains StdCharIn.java and some testing data.
Implement the move-to-front encoding and decoding algorithms. Not including comments and declarations, our solutions are about 10 lines of code each. When debugging, cast each character to an integer so that you will see its ASCII value instead of a (potentially) unprintable character. Once you're convinced that your code works, print it out as a character using System.write.
Implement the binary version of the Huffman coding and decoding algorithms.
For the Burrows-Wheeler transform, you can use the method substring in the string library to form the suffixes. Recall that it does not explicitly form the suffixes, so you can build the N suffixes in linear time. Hint: to make the strings appear as if they were cyclic, consider string concatenation.
The Burrows-Wheeler decoding is the trickiest part, but it is very little code once you understand how it works. (Not including declarations and input, our solution about 10 lines of code.) You may find the key-indexed counting algorithm from the radix sorting lecture to be useful.