|
What program should I use for reading and writing the data? You must use BinaryStdIn.java and BinaryStdOut.java. These read and write sequences of bytes, whereas StdIn.java and StdOut.java read and write sequences of Unicode characters. These are in stdlib.jar.
My programs don't work properly with binary data. Why not? Be absolutely sure that you use only BinaryStdIn.java and BinaryStdOut.java for input and output. Also, be sure that you call BinaryStdOut.flush() or BinaryStdOut.close() after you are done writing—see Huffman.expand() for an example.
Why does BinaryStdIn return the 8-bits as a (16-bit unsigned) char instead of as (an unsigned 8-bit) byte? The primitive type byte is a bit annoying in Java. When you operate on a byte, it is typically promoted to an int. E.g., to convert a byte b to a char c, you must write c = (char) (b & 0xff) instead of c = (char) b. By using char, we avoid the hassle.
For the Burrows-Wheeler encoder, in which order do I sort the suffixes? The input is a sequence of extended ASCII characters (00 to FF), which you can read in with BinaryStdIn.readString(). You should sort the suffixes according to extended ASCII order, which is the natural order of the String data type.
For the Burrows-Wheeler decoder, does next[0] always equal first? And wouldn't this mean that the index first is redundant? No, this is just a coincidence with the input string "ABRACADABRA!". Consider any two input strings that are cyclic rotations of one another, e.g., "ABRACADABRA!" and "CADABRA!ABRA". They will have the same sorted suffixes and t[] array—their only difference will be in the index first.
How should I download the sample input files and reference solutions? Be careful to download them as binary files—some browsers will corrupt them if you view the file and use File -> Save. Do not edit them in a text editor—some editors will corrupt them by inserting bogus newline characters.
Can I assume that the decode() method in BurrowsWheeler receives only valid inputs (that were created by a call to the encode() method)? Yes.
How can I compare the contents of two files (to check that the decoded version equals the original)? On OS X and Linux, use the command diff file1 file2; on Windows, use the command fc file1 file2.
How can I view the contents of a binary file? Use HexDump.java, as in the assignment. The command-line argument specifies the number of bytes per line to print.
How do I determine the sizes of the original and compressed files? Use HexDump.java, as in the assignment. Use a command-line argument of 0 to suppress all output except for the number of bytes.
How much memory can my program consume? The Burrows-Wheeler encoder may use quite a bit, so you may need to use the -Xmx option when executing. You must use space linear in the input size N and alphabet size R. (Industrial strength Burrows-Wheeler compression algorithms typically use a fixed block size, and encode the message in these smaller chunks. This reduces the memory requirements, at the expense of some loss in compression ratio.) Therefore, depending on your operating system and configuration there may be some very large files for which your program will not have enough memory even with the -Xmx option.
How do I use gzip and bzip2 on Windows? It's fine to use pkzip or 7-zip instead.
I'm curious. What compression algorithm is used in PKZIP? In gzip? In bzip2? PKZIP uses LZW compression followed by Shannon-Fano (an entropy encoder similar to Huffman). The Unix utility gzip uses a variation of LZ77 (similar to LZW) followed by Huffman coding. The program bzip2 combines the Burrows-Wheeler transform, Huffman coding, and a (fancier) move-to-front style rule.
|
Input. Here are some sample input files. To fully test your program, you should also try to compress and uncompress binary files (e.g., .class or .jpg files).
Timing Your Program. To time your program in Linux or Mac OS, simply use the time command, e.g. time java BurrowsWheeler - < mobydick.txt | java MoveToFront - | java Huffman - > mobyDickOutputFileName. You want to record the "real" value.
In Windows, the process is a bit more involved. We recommend using a simple batch file. The process for creating and running a batch file is as follows:
echo %time% java BurrowsWheeler - < mobydick.txt | java MoveToFront - | java Huffman - > mobyDickOutputFileName echo %time%
Timing using gzip, bzip2, 7zip, etc.
In Linux or Mac OS X, this is easy, simply use the time command as above.In Windows, there is no (easy) way to compress a file from the command line. We recommend downloading the free 7-zip program. After instaling 7-zip, create a new batch file (any filename ending in .bat) with the following text:
echo %time% 7za a -tzip mobyDickOutputFileName.zip mobydick.txt echo %time%
This creates a file in .zip format (the same used natively by Windows for compression). To test unzipping time, use the following:
echo %time% 7za e mobyDickOutputFileName.zip echo %time%
If you're interested in testing against other compression formats, then see this page.
Reference solutions. For reference, we have provided the output of compressing aesop.txt and us.gif. We have also provided the results of applying each of the three encoding algorithms in isolation. Note that the .gif file is a binary file and is already compressed.
|
These are purely suggestions for how you might make progress. You do not have to follow these steps.