COS 126: Assignment xyz. Prefix Codes

COS 126

Prefix Codes

Programming Assignment xyz

Due: Wednesday, 11:59pm

Write a program, decipher.c, to decipher messages encoded using a prefix code, given the encoding tree. Such codes are widely used in applications that must compress data into as few bits as possible, including JPEG for images and MP3 for music.

Prefix codes. A prefix code is most easily represented by a binary tree in which the external nodes are labelled with single characters that are combined to form the message. The "code" for each character is a binary encoding of the path down from the root of the tree to the external node that holds the character: a 0 bit identifies a left branch in the path, and a 1 bit identifies a right branch. For example, in the following tree, in which shaded circles are internal nodes and transparent ovals are external nodes,

the code for `b' is 111, because the external node holding the is reached from the root by taking 3 consecutive right branches. The other codes in this tree are

The underscore, "_", denotes a space character. Note that the encodings have variable numbers of bits. This is a fundamental property of prefix codes, and it ensures that bits can be saved by using short encodings for frequently used characters.

A second fundamental property of prefix codes is that messages can be formed by simply stringing together the code bits, without keeping track of which bit corresponds to which character. For example, the bitstring

0111110010110101001111100100

encodes the message "abracadabra ". The first 0 must encode 'a', then the next three 1's must encode 'b', then 110 must encode r, and so on as follows:

|0|111|110|0|1011|0|1010|0|111|110|0|100
 a  b   r  a  c   a  d   a  b   r  a  _

The codes can be run together because no encoding is a prefix of another one. This property defines a prefix code, and it allows us to represent the encoding with a tree, as shown above. Thus, to decode a given bit string: Start at the root of the tree, take one bit at a time, and go left when the bit is 0 and right when it's 1, until you reach an external node. Then, print the character in that external node, and start over at the root. Implementing this algorithm is your primary task in this assignment.

Encoding the tree. To decode a bit string, you need an encoding tree, and to represent the tree, you need a linear encoding of the tree itself. One method is to do a preorder walk of the tree, printing, for each external node, the number of left links travelled to get to that node (since the last node) followed by the node's label. The linear representation of the tree above using this method is:

1 a 2 _ 1 d 0 c 1 r 0 b

To read an encoding tree, you can call buildcode(), declared in /u/cs126/examples/buildcode.h:

typedef struct node *link;
struct node {
   char character;
   link left, link right;
};

/***************************************************
 *  reads a linear encoding of a prefix code tree  *
 *  from standard input, builds the tree, and      *
 *  returns the root of that tree.                 *
 ***************************************************/
link buildcode(void);

buildcode() uses node structures for both internal and external nodes; internal nodes have null character fields and external nodes have null left and right fields. The source code for buildcode is in /u/cs126/examples/buildcode.c.

Decoding. Write decode.c, a program that reads an encoding tree and a encoded message from the standard input and writes the decoded message on the standard output. The encoded message will follow the encoded tree. For simplicity, the encoded message will consist of a sequence of 0 and 1 characters. In an actual application, these bits would be packed eight to the byte, thus using 1/8th the space. Here's the input file /u/cs126/data/files/prefix/abra.huff:

1 a 2 _ 1 d 0 c 1 r 0 b 
0111110010110101001111100100

Your program should produce the following output:

abracadabra_
28 bits read.

Here's an outline for decipher.c. Your main task is to write decode().

#include <stdio.h>
#include "buildcode.h"

int decode(link tree) {
   /*************************************************
    *  Decode message and print to standard input.  *
    *  Return number of bits read.                  *
    *************************************************/
}

int main(void) {
   int b;
   link tree;
   tree = maketree();
   b = decode(tree);
   printf("%d bits read.", b);
   return 0;
}

Test your program on the inputs in the directory /u/cs126/data/files/prefix/. Also run your program on the file /u/cs126/files/prefix/message2.huff. It contains a 289-character message. The Huffman encoding uses 1140 bits, which would normally be stored in 143 characters (8 bits per character), or 49% of the space required by normal ASCII. This kind of savings is typical. Note that for large messages the amount of space needed to store the tree encoding is negligible compared to storing the message itself, so we have ignored this quantity in the calculation.

Extra credit. In addition to printing out the decoded message, print out the code table itself. That is, for each external node in the tree, print the node label followed by its bit encoding. For example:

% a.out < /u/cs126/files/prefix/abra.huff
_  1 100
a  5 0
b  2 111
c  1 1011
d  1 1010
r  2 110
abracadabra_

Challenge for the bored. Write a Huffman encoder. It should read text from standard input, computes its Huffman tree, and print the Huffman code and tree to standard output. Your program should produce data files in the same format as abra.dat.

This is a modified and simplified version of an assignment created by David Hanson.