COS 126 Prefix Codes |
Due: Wed. 4/16, 11:59pm |
Write a program, decipher.c
, to decipher messages encoded
using a prefix code, given the encoding tree. Such codes are widely
used in applications that must compress data into as few bits as possible.
A prefix code is most easily represented by a binary tree in which the external nodes labelled with single characters or short strings that can be combined to for the message. The "code" for each string is a binary encoding of the path down from the root of the tree to the external node that holds string. A 0 bit identifies left branches in the path and a 1 bit identifies right branches. For example, in the following tree, in which shaded circles are internal nodes and transparent ovals are external nodes,
the code for the
is 000, because the external node holding
the
is reached from the root by taking 3 left branches. The other
codes in this tree are
the 000 was 00100 f 001010 c 001011 h 0011 at 010 in 011 _ 1
The underscore, "_", denotes a space character. Note that the encodings have variable numbers of bits. This is a fundamental property of prefix codes, and it ensures that bits can be saved by using short encodings for frequently used message fragments.
A second fundamental property of prefix codes is that messages can be formed by simply stringing together the code bits, without keeping track of which bit corresponds to which message fragment. For example, the bitstring
000100101101010111000100110101001001001010010
encodes the message "the cat in the hat was fat
".
The first three 0's must encode the
, then the 1 bit must encode a
space, then 001011 must encode c
, and so on as follows ("_"
denotes a space):
000 the 1 _ 001011 c 010 at 1 _ 011 in 1 _ 000 the 1 _ 0011 h 010 at 1 _ 00100 was 1 _ 001010 f 010 at
The codes can be run together because no encoding is a prefix of another one. This property defines a prefix code, and it allows us to represent the encoding with a tree, as shown above. Thus, to decode a given bit string: Start at the root of the tree, take one bit at a time, and go left when the bit is 0 and right when it's 1, until you reach an external node. Then, print the string in that external node, and start over at the root. Implementing this algorithm is your primary task in this assignment.
To decode a bit string, you need an encoding tree, and to represent the tree, we need a linear encoding of the tree itself. One method is to do a preorder walk of the tree, printing, for each external node, the number of left links travelled to get to that node (since the last node) followed by the node's label. The linear representation of the tree above using this method is
3 the 2 was 1 f 0 c 0 h 1 at 0 in 0 _
where "_" represents the space character. To read an encoding
tree, you can call buildcode
, declared in
/u/cs126/examples/buildcode.h
:
struct node { char *label; struct node *left, *right; }; extern struct node *buildcode(void); /* reads a linear encoding of a prefix code tree from the standard input, builds the tree, and returns the root of that tree. */
buildcode
uses node
structures for both internal
and external nodes; internal nodes have null label
fields and
external nodes have null left
and right
fields. The
source code for buildcode is in
/u/cs126/examples/buildcode.c
.
Write decipher.c
, a program that reads an encoding tree and a
encoded message from the standard input and writes the decoded message on the
standard output. Here's an outline for decipher.c
:
#include <stdio.h> #include "buildcode.h" void decipher(struct node *tree) { /* decode and print the standard input */ } int main(void) { struct node *codetree = buildcode(); decipher(codetree); return 0; }
Your task is to write decipher
. For simplicity, assume that
the input message appears as 0
and 1
characters
following the encoding tree in the standard input. In an actual application,
these bits would be packed eight to the byte, thus using 1/8th the space. Test
your program on the data in /u/cs126/data/cat
,
which is the message "the cat in the hat was fat
".
You'll need to compile and load buildcode
along with your program,
and load /u/cs126/lib/libmisc.a
, because buildcode
calls
emalloc
and
strsave
. All of this can
be done with the command
% /u/cs126/bin/lcc -n decipher.c buildcode.c
/u/cs126/bin/lcc
will find buildcode.h
and
buildcode.c
in /u/cs126/examples
, unless you have
your own copies. Assignment 8 explains more about
/u/cs126/bin/lcc
.
Run your program on the file
/u/cs126/data/code
,
which is a message encoded using a single-character encoding built according to
frequency of usage of letters in English.
/u/cs126/data/code
contains a 289-character message, and the code uses 1267 bits, which would
normally be stored in 159 characters, or 56% of the space required by normal
ASCII. This kind of savings is typical.
Turn in your code and your documentation with the command
/u/cs126/bin/submit 9 readme decipher.c
Add an option "-d
" to print out the code table. That
is, for each external node in the tree, print the node label followed by its bit
encoding. For example:
% a.out -d </u/cs126/data/cat the 000 was 00100 f 001010 c 001011 h 0011 at 010 in 011 _ 1 the cat in the hat was fat