|
What program should I use for reading characters? You must use CharStdIn.java. This reads in bytes, whereas StdIn.java and In.java read in Unicode characters.
My programs don't work properly with binary data. Why not? System.out.print(ch) translates the character ch into one or more bytes according to the platform's default character encoding. This matters for non-ASCII characters (128 - 255). To print it as a byte, use System.out.write(ch), and flush the stream at end of your program with System.out.flush().
How do I read in the data for the Burrows-Wheeler encoder?
String s = CharStdIn.readAll();
How do I write data for the Burrows-Wheeler encoder?
System.out.println( /* an int */ ); for (int i = 0; i < N; i++) { ... System.out.write( /* a char */ ); } System.out.flush();
How do I read in the data for the Burrows-Wheeler decoder?
int first = Integer.parseInt(CharStdIn.readLine()); String s = CharStdIn.readAll();
How do I write data for the Burrows-Wheeler decoder?
for (int i = 0; i < N; i++) { ... System.out.write( /* a char */ ); } System.out.flush();
How do I read and write data for the move-to-front encoder and decoder?
while (!CharStdIn.isEmpty()) { char ch = CharStdIn.readChar(); ... System.out.write( /* a char */ ); } System.out.flush();
When I run the Huffman encoding algorithm on the input abracadabra! I get a different output for the bytes? Different operating systems may use different character encodings for extended ASCII symbols. As a result, the acute accent and the no-break space may be represented differently on your system (or may appear as a ? if your terminal font does not support it). For example, using IBM DOS extended ASCII, the resulting symbols should be |⊣|á, i.e., the vertical bar, the left tack, the vertical bar, and a acute.
How should I determine the size of the compressed file? On OS X, Linux, or Unix, compare "wc -c input.txt" with "java BWTE < input.txt | java MTFE | hufe | wc -c". On Windows, type "java BWTE < input.txt | java MTFE | hufe > compressed.txt" and use the File Manager to inspect the file sizes.
How much memory can my program consume? The Burrows-Wheeler encoder may use quite a bit, so you may need to use the -Xmx option when executing. You must use space linear in the input size N. (Industrial strength Burrows-Wheeler compression algorithms typically use a fixed blocksize, and encode the message in these smaller chunks. This reduces the memory requirements, at the expense of some loss in compression ratio.)
Why not use 8-bit bytes instead of 16-bit chars? You certainly can, but if you do, be careful to note that bytes are signed integers (whereas chars are unsigned). Also, it's convenient to be able to manipulate a sequence of bytes using the built-in String data type.
Where can I learn about bit-whacking operators in Java Here's one resource.
How do I use gzip and bzip on Windows? It's fine to use pkzip instead.
How can I compare the contents of two files (to check that the decoded version equals the original)? On Linux or OS X, use the command diff file1 file2; on Windows use the command fc file 1 file2.
I'm curious. What compression algorithm is used in PKZIP? In gzip? In bzip2? PKZIP uses LZW compression followed by the Shannon-Fano trees algorithm (an entropy encoder similar to Huffman). The Unix utility gzip combines a variation of LZ77 (similar to LZW) and Huffman coding. The program bzip2 combines the Burrows-Wheeler transform, Huffman coding, and a (fancier) move-to-front style rule.
|
Input. Here are some sample input files. To fully test your program, you should also try to compress binary files (e.g., .class or .jpg files). But don't kill yourself over this if it doesn't work - you will receive only a minor deduction if your program works on text only.
Reference solutions. For reference, we have provided the output of compressing aesop.txt and us.gif. We have also provided the results of applying each of the three encoding algorithms in isolation. Note that the GIF file is a binary file and is already compressed.
Compression script. Linux/OS X users may wish to use the scripts compressit.csh to automate the process of determining the compressed file sizes. The following command tests your implementation against cat, Huffman encoding alone, gzip, and bzip2.
Note that burrows is a script that combines the three phases of the Burrow Wheeler data compression algorithm. The scripts gzip226 and bzip226 run the standard Unix gzip and bzip programs and are only included for reference.compressit.csh burrows cat HUFE gzip226 bzip226
Timing script. The timeit.csh script measures execution times of the compression programs.
The scripts MTFE, MTFD, BWTD, and BWTE each take a command line argument and call the corresponding Java program. Be careful when using /usr/bin/time with pipes - it only measures the running time of the first command. Gluing the commands together in the script burrows enables you to account for the total execution time.timeit.csh cat HUFE HUFD MTFE MTFD BWTD BWTE burrows gzip226 bzip226
|
readme.txt. Use the following readme file template. Besides providing details about your implementation which are not clear from the comments within your source code, your readme.txt file should also:
What to submit. Submit BWTE.java, BWTD.java, MTFE.java, MTFD.java, HUFE.java, HUFD.java and any other files needed by your programs. Do not submit CharStdIn.java.
|
These are purely suggestions for how you might make progress. You do not have to follow these steps.