COS 333 Assignment 3 (Spring 2018)

COS 333 Assignment 3 (Spring 2018): All Your Base64 Are Belong To Us

Due midnight, Friday, March 2

Tue Feb 20 07:53:15 EST 2018

Base64 encoding provides a way to encode arbitrary binary data into printable ASCII for mailing or shipment to a web browser, then decode it back into exactly its original form. The encoding takes the input 6 bits at a time and maps each 6-bit chunk into an 8-bit letter, digit or other character that is guaranteed to travel end-to-end safely through all systems. This process expands the input by only 33%, so it is acceptably efficient. Decoding reverses the process, taking 8-bit character input back into properly packed bytes that reproduce the original input exactly.

This assignment is an exercise in writing code to a published standard; in theory, and we hope in practice, your implementation will interoperate properly with anyone else's implementation. It's also an exercise in bit-whacking: manipulating bits as raw bits, not numbers or characters. That kind of code is often necessary for writing device drivers, processing network packets, and other tasks that require packing bits into words or extracting them.

Keep clearly in mind that your programs must handle arbitrary bytes, not just ASCII characters.

Specification

The file base64.html contains an excerpt (most of Section 6.8) from RFC 2045, the standard for MIME (Multipurpose Internet Mail Extensions) encoding. Begin by taking a quick look at standard to see how such things are organized and written, then carefully read the excerpt that describes Base64 excoding.

Your task is to:

write encoders in Java and Python that encode any input stream into Base64 as specified by the standard.
write decoders in Java and Python that decode valid Base64 back into its original form.
create test files that will exercise your code thoroughly.
write a testing script that will run all your tests with a single command.

Your programs must read from stdin and write to stdout. The encoder output must end with a newline unless the input is empty. The programs must not throw exceptions or print any other output. Your encoder must work on all inputs. Your decoder must work on valid inputs like those produced by openssl, and should ignore whitespace in encoded Base64. Since there is no clear definition of what a decoder should do with invalid input, we will only test your decoder on invalid input to make sure it doesn't crash; it doesn't matter what its output is.

For testing, create at least 15 short test files named test01, test02, test03, etc., to be used as inputs. These tests should thoroughly explore critical boundary conditions and other potential trouble spots of the standard and your code. No submitted test file should be longer than about 1K bytes, though of course you should test your program on long files as well. You might find it helpful to read Chapter 6 of The Practice of Programming on testing.

The testing script must be called tester.sh; running it as

tester.sh [list of test files]

runs all the tests. See below for a sample implementation.

Keep clearly in mind that your programs must handle arbitrary bytes, not just ASCII characters.

Advice

The main complexities of implementation are mapping 3 bytes of input to 4 bytes of output, and vice versa, with some tricky boundary conditions. You will probably want to do the encoder and decoder first in Java, presuming that you know it better. You should also complete your testing suite to help with your Java implementation.

Use only standard Java libraries; do not use local Princeton libraries from earlier courses.

Once all the kinks are worked out with the algorithm and its corner cases, it should be straightforward to transliterate your Java into Python. Here's a Python help file and the official Python 2.7 documentation.

This is the version of Python installed on the CS servers and the dropbox computer. Note that there are significant incompatibilities between Python 2 and Python 3. You must use Python 2.

	$ python
	Python 2.7.5 (default, Aug  9 2017, 01:27:27)
	[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2

Keep it simple: this program does not need to be fast, or save memory, or be the slightest bit clever. As is true with all programming, trying to be fast and/or clever is a recipe for disaster.

For calibration, my Python encoder and decoder are about 50-60 lines each, and my Java versions are about 80 lines each. If your solutions are a lot bigger, you are probably off on the wrong track.

Keep clearly in mind that your programs must handle arbitrary bytes, not just ASCII characters. You have to use I/O routines that read and write a byte at a time. The Unix tool od ("octal dump") is useful for looking at binary data; try od -tx1 to see bytes in hex, or use xxd, where the -b flag shows you the output in bits.

You can use the program openssl as a reference implementation:

	openssl enc -e -base64 <input >base64output   # encode
	openssl enc -d -base64 <base64input >output   # decode

The standard says "The encoded output stream must be represented in lines of no more than 76 characters each." The openssl encoder outputs a newline more frequently, after every 64 characters. You will find it easier to do direct comparisons if your encoder does the same. Also be aware that in some cases, openssl does not put a newline after the final line of output. This seems like a bad implementation; your encoder must put a newline at the end of its output unless the input is empty.

Prepare good test cases ahead of time: this standard is quite clear and compact, but still requires careful attention. Test early, test often, know what you are testing. Automate the testing as much as you can, with a shell script or a makefile. Part of the job here is to mechanize testing but you will also find it helpful to validate your code on real MIME messages like the ones that encode attachments in mail. Grab those with an editor, put them in a file, decode, and see if the result is acceptable to the target application (Word, Excel, MP3 player, etc.).

There are lots of Base64 libraries and implementations floating around. Do not use or consult them. The purpose of this assignment is for you to write your own code to implement a common standard, using only that standard as a reference. Referencing any other material or implementations related to that standard is a grievous breach of academic integrity.

Submission and Evaluation

When you are finished, submit your four source files encode.java, decode.java, encode.py, decode.py, your testing program tester.sh, and a tarball called tests.tar that contains all of your test files. Use the CS Dropbox link dropbox.cs.princeton.edu/COS333_S2018/asgn3. You can put your properly-named test files in a tar file with this command:

	tar cf tests.tar test??

We will test your code by a process somewhat like this:

	tar xf tests.tar
	javac encode.java
	javac decode.java
	for i in $(ls test?? ourtests*)
	do
	    python encode.py <$i >x0		# your Python encoder
	    java decode <x0 >y0   		# your Java decoder
	    cmp $i y0				# compare to original
	    java encode <$i >x1   		# your Java encoder
	    openssl enc -d -base64 <x1 >y1	# reference decoder
	    cmp $i y1				# compare to original
	    openssl enc -e -base64 <$i >x2	# reference encoder
	    python decode.py <x2 >y2		# your Python decoder
	    cmp $i y2				# compare to original
	    # etc. for other combinations
	done

You can use this as a model for your own tester.sh.

Note that it is not sufficient that your encoders and decoders work together; each must work independently.