Due Thursday February 28 at 10:00 pm
Sun Feb 17 13:55:02 EST 2019
Base64 encoding provides a way to encode arbitrary binary data into printable ASCII for mailing or shipment to a web browser, then decode it back into exactly its original form. A Base64 encoder reads an input stream of arbitrary 8-bit bytes. It processes 6 bits at a time, mapping each 6-bit chunk into an 8-bit printable letter, digit or other character that is guaranteed to travel end-to-end safely through all systems. This process expands the input by only 33%, so it is acceptably efficient. A decoder reverses the process, converting a stream of 8-bit ASCII characters back into properly packed bytes that reproduce the original input exactly.
This assignment is an exercise in writing code to a published standard; in theory, and we hope in practice, your implementation will interoperate properly with anyone else's implementation. It's also an exercise in bit-whacking: manipulating bits as raw bits, not numbers or characters. That kind of code is often necessary for writing device drivers, processing network packets, and other tasks that require packing bits into words or extracting them.
Keep clearly in mind that your programs must handle arbitrary bytes, not just printable ASCII characters.
The file base64.html contains an excerpt (most of Section 6.8) from RFC 2045, the standard for MIME (Multipurpose Internet Mail Extensions) encoding. Begin by taking a quick look at the standard to see how such things are organized and written, then carefully read the excerpt that describes Base64 excoding.
Your task is to:
Your programs must read from stdin and write to stdout. The encoder output must end with a newline unless the input is empty. The programs must not throw exceptions or print any other output. Your encoder must work on all inputs. Your decoder must work on valid inputs like those produced by openssl, and should ignore whitespace in encoded Base64. Since there is no clear definition of what a decoder should do with invalid input, we will only test your decoder on invalid input to make sure it doesn't crash; it doesn't matter what its output is.
For testing, create at least 15 short test files named test01, test02, test03, etc., to be used as inputs. These tests should thoroughly explore critical boundary conditions and other potential trouble spots of the standard and your code. No submitted test file should be longer than about 1K bytes, though of course you should test your program on long files as well. You might find it helpful to read about testing in Chapter 6 of The Practice of Programming.
The testing script must be called tester.sh; running it as
tester.sh [list of test files]runs all the tests. See below for a sample implementation.
Keep clearly in mind that your programs must handle arbitrary bytes, not just printable ASCII characters.
The main complexities of implementation are mapping 3 bytes of input to 4 bytes of output, and vice versa, with some tricky boundary conditions. You will probably want to do the encoder and decoder first in Java, presuming that you know it better. You should also complete your testing suite to help with your Java implementation.
Once all the kinks are worked out with the algorithm and its corner cases, it should be straightforward to transliterate your Java into Python. Here's a Python help file and the official Python 2 documentation.
This is the version of Python installed on the CS servers and the Tigerfile computer. There are significant incompatibilities between Python 2 and Python 3. You must use Python 2.
$ python -V Python 2.7.5 $ python Python 2.7.5 (default, Jul 3 2018, 19:30:05) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is a recipe for disaster. For calibration, my Python encoder and decoder are about 50-60 lines each, and my Java versions are about 80 lines each. If your solutions are a lot bigger, you are probably off on the wrong track.
Keep clearly in mind that your programs must handle arbitrary bytes, not just printable ASCII characters.
For both languages, find library functions that will let you read and write raw bytes, perhaps one at a time, analogous to getchar and putchar in C. The more levels of functions that are trying to interpret the input, the harder it is to see what the input really was. Use only standard Java libraries; do not use local Princeton libraries from earlier courses. For Python, you should not need to import anything beyond sys.
The Unix tool od ("octal dump") is useful for looking at binary data; try od -tx1 to see bytes in hex, or use xxd, where the -b flag shows you the output in bits.
You can use the program openssl as a reference implementation:
openssl enc -e -base64 <input >base64output # encode openssl enc -d -base64 <base64input >output # decode
The standard says "The encoded output stream must be represented in lines of no more than 76 characters each." The openssl encoder outputs a newline more frequently, after every 64 characters. You will find it easier to do direct comparisons if your encoder also outputs a newline after every 64 characters. Also be aware that in some cases, openssl does not put a newline after the final line of output. This seems like a bad implementation; your encoder must put a newline at the end of its output unless the input is empty.
Prepare good test cases ahead of time: this standard is quite clear and compact, but still requires careful attention. Test early, test often, know what you are testing. Automate the testing as much as you can, with a shell script or a makefile. Part of the job here is to mechanize testing but you will also find it helpful to validate your code on real MIME messages like the ones that encode attachments in mail. Grab those with an editor, put them in a file, decode, and see if the result is acceptable to the target application (Word, Excel, MP3 player, etc.).
There are lots of Base64 libraries and implementations floating around. Do not use or consult them. The purpose of this assignment is for you to write your own code to implement a common standard, using only that standard as a reference. Referencing any other material or implementations related to that standard is a grievous breach of academic integrity.
When you are finished, submit your four source files encode.java, decode.java, encode.py, decode.py, your testing program tester.sh, and a tarball called tests.tar that contains all of your test files. Use the CS Dropbox link tigerfile.cs.princeton.edu/COS333_S2019/asgn3. You can put your properly-named test files in a tar file with this command:
tar cf tests.tar test??
We will test your code by a process somewhat like this:
tar xf tests.tar javac encode.java javac decode.java for i in $(ls test?? ourtests*) do python encode.py <$i >x0 # your Python encoder java decode <x0 >y0 # your Java decoder cmp $i y0 # compare to original java encode <$i >x1 # your Java encoder openssl enc -d -base64 <x1 >y1 # reference decoder cmp $i y1 # compare to original openssl enc -e -base64 <$i >x2 # reference encoder python decode.py <x2 >y2 # your Python decoder cmp $i y2 # compare to original # etc. for other combinations done
You can use this as a model for your own tester.sh if you wish.
Note that it is not sufficient that your encoders and decoders work together; each must work independently.