COS 126 Prefix Codes |
Programming Assignment xyz Due: Wednesday, 11:59pm |
Write a program, decipher.c
, to decipher messages encoded
using a prefix code, given the encoding tree. Such codes are widely
used in applications that must compress data into as few bits as possible,
including JPEG for images and MP3 for music.
Prefix codes. A prefix code is most easily represented by a binary tree in which the external nodes are labelled with single characters that are combined to form the message. The "code" for each character is a binary encoding of the path down from the root of the tree to the external node that holds the character: a 0 bit identifies a left branch in the path, and a 1 bit identifies a right branch. For example, in the following tree, in which shaded circles are internal nodes and transparent ovals are external nodes,
the code for `b'
is 111
, because the
external node holding the
is reached from the root by
taking 3 consecutive right branches. The other codes in this tree are
a 5 0 c 1 1011 d 1 1010 r 2 110 _ 1 100
The underscore, "_", denotes a space character. Note that the encodings have variable numbers of bits. This is a fundamental property of prefix codes, and it ensures that bits can be saved by using short encodings for frequently used characters.
A second fundamental property of prefix codes is that messages can be formed by simply stringing together the code bits, without keeping track of which bit corresponds to which character. For example, the bitstring
0111110010110101001111100100
encodes the message "abracadabra
".
The first 0 must encode 'a'
, then the next three 1's must
encode 'b'
, then 110
must encode
r
, and so on as follows:
|0|111|110|0|1011|0|1010|0|111|110|0|100 a b r a c a d a b r a _
The codes can be run together because no encoding is a prefix of another one. This property defines a prefix code, and it allows us to represent the encoding with a tree, as shown above. Thus, to decode a given bit string: Start at the root of the tree, take one bit at a time, and go left when the bit is 0 and right when it's 1, until you reach an external node. Then, print the character in that external node, and start over at the root. Implementing this algorithm is your primary task in this assignment.
Encoding the tree. To decode a bit string, you need an encoding tree, and to represent the tree, you need a linear encoding of the tree itself. One method is to do a preorder walk of the tree, printing, for each external node, the number of left links travelled to get to that node (since the last node) followed by the node's label. The linear representation of the tree above using this method is:
To read an encoding tree, you can call1 a 2 _ 1 d 0 c 1 r 0 b
buildcode()
,
declared in
/u/cs126/examples/buildcode.h
:
typedef struct node *link; struct node { char character; link left, link right; }; /*************************************************** * reads a linear encoding of a prefix code tree * * from standard input, builds the tree, and * * returns the root of that tree. * ***************************************************/ link buildcode(void);
buildcode()
uses node
structures for both
internal and external nodes; internal nodes have null character
fields and external nodes have null left
and right
fields. The source code for buildcode is in
/u/cs126/examples/buildcode.c
.
Decoding.
Write decode.c
, a program that reads an encoding tree and a
encoded message from the standard input and writes the decoded message on the
standard output. The encoded message will follow the encoded tree.
For simplicity, the encoded message will consist of a sequence
of 0
and 1
characters.
In an actual application, these bits would be packed eight to the byte,
thus using 1/8th the space.
Here's the input file
/u/cs126/data/files/prefix/abra.huff
:
Your program should produce the following output:1 a 2 _ 1 d 0 c 1 r 0 b 0111110010110101001111100100
Here's an outline forabracadabra_ 28 bits read.
decipher.c
.
Your main task is to write decode()
.
#include <stdio.h> #include "buildcode.h" int decode(link tree) { /************************************************* * Decode message and print to standard input. * * Return number of bits read. * *************************************************/ } int main(void) { int b; link tree; tree = maketree(); b = decode(tree); printf("%d bits read.", b); return 0; }
Test your program on the inputs in the directory
/u/cs126/data/files/prefix/
.
Also run your program on the file
/u/cs126/files/prefix/message2.huff
.
It contains a 289-character message. The Huffman encoding uses 1140 bits, which
would normally be stored in 143 characters (8 bits per character), or 49% of the
space required by normal ASCII. This kind of savings is typical. Note that
for large messages the
amount of space needed to store the tree encoding is negligible compared to
storing the message itself, so we have ignored this quantity in the calculation.
Extra credit. In addition to printing out the decoded message, print out the code table itself. That is, for each external node in the tree, print the node label followed by its bit encoding. For example:
% a.out < /u/cs126/files/prefix/abra.huff _ 1 100 a 5 0 b 2 111 c 1 1011 d 1 1010 r 2 110 abracadabra_
Challenge for the bored.
Write a Huffman encoder. It should read text from standard input,
computes its Huffman tree, and print the Huffman code and tree
to standard output.
Your program should produce data files in the same format as
abra.dat
.