DATA COMPRESSION

		Guy Jacobson
		AT&T Research


		DATA COMPRESSION


A hot field today, with much active research and development
on many levels.

Where is compression?

   Software: general purpose programs (compress, gzip, pkzip)
	     Image/signal data file formats (jpeg, RealAudio)
             Operating system (Stacker, Doublespace)
   Hardware: Modems, disk controllers, tape backup-units,
             Soon to be TV/RADIO

What are the advantages of compression?

1) Save space = disks / tapes / memory feels bigger

   Archiving
   Fitting into main memory / into a fixed size ROM


2) Decrease transmission time over a slow network:

   Modems / Faxes / ordinary telephony / Digital TV / radio


What are the disadvantages?

1) Usually pay a price in access time ("Time-space tradeoff")

   Sometimes you actually save time *and* space.
   Slow disks / Network / better cache performance


		"Signal" compression vs "Text" compression


1) "Signal": digitized information representing signal measurements

      a) Often sampled over regular intervals of time and/or space

         Audio         (time * nchannels)
         Still images  (space^2)         
         Video         (time * space ^2)

      b) Often represents something that humans can perceive directly

         (as above)

      c) Sometimes not:

         Seismic / weather data measurements
         Cat / MRI data


2) "Text": (Using the term loosely here) symbolic data

     A sequence of characters chosen over a particular alphabet

     ASCII / ones and zeroes

     General purpose compression

     
		LOSSY vs LOSSLESS compression:


Can you reconstruct the original uncompressed data *exactly*?

Sometimes this is important (almost always for text data);
Sometimes not (for signal data).

      The two main ideas in data compression:

1)    Save space by removing redundancy

2)    Save space by removing irrelevant information


Lossless compression uses only idea 1; Lossy uses both 1 and 2.


When is information "irrelevant"?

1)    Imperceptible data in audio/video

2)    Signal data is noisy anyway; too much precision is
      simply storing a lot of noise.

3)    Quality-space tradeoff


		Compressed file formats (just a sampling)

compress (*.Z)   Unix standard Compress program (LZW)
pack     (*.z)   Ancient Unix program (adaptive Huffman coding)
gzip     (*.gz)  Gnu's zip compressor  LZ77
pkzip    (*.zip) Phil Katz's popular PC compression program

Another tradeoff: good compression vs fast encode/decode times

Other PC/Mac/Amiga formats:   lharc, pak, arj, uc2, sqz, sit, pit, ha, etc.

v.42bis	         Modem standard with built-in compression (LZW)

Image formats

JPEG     (*.jpg) Joint Photo-something Experts Group
		 color and grey-scale images    (DCT-based)
		 Lossy--Built-in "knob" to control quality-space tradeoff
	         Lossless variant exists
		 Complicated, but public domain software exists

GIF              Graphics Interchange Format
                 (Color quantization and LZW)

MPEG	 	 Motion Picture...
		 for sequences of images (movies)
                 Very complicated, but public domain software exists

JBIG             bilevel images

fax group 3/4    bilevel images


		More Handy BUZZWORDS

CODEC			CODer/DECoder [hardware]

Huffman coding		A variable length binary code

Arithmetic coding	A fancier code where strings of symbols
			are mapped into a single rational number

Lempel-Ziv		Family of substitutional text compression algorithms
			Many general purpose compressors
			LZ-based.  Two subfamilies: LZ77 and LZ78

RLE			run length encoding

Vector Quantization	replace a group of samples by representative
			from a vector codebook.
			Color quantization [a la GIF] is one example;
		        Pops up frequently in other signal compression.

DCT			Discrete Cosine Transform
			[Similar to Fourier transform]
			transform into frequency domain
			JPEG/MPEG uses this

Wavelets		Another fancy signal compression technique

Fractal compression     Ditto


		Perceptual coding

JPEG/MPEG, DAT rely on the fact that the human perceptual system
(ok, eyes and ears), wonderful though they are, simply can't take
in all the information in a picture or sound sample.

	Why store what you can't perceive?

The human eye [ear] is very good at seeing some things, but pretty
bad at seeing other things


GOOD:  Finding edges, luminance, low-frequency color information
BAD:   color information (especially high frequency)

JPEG outline:

Transform from RGB space into luminance/chrominance space
Break up image into small block (say 8x8 pixels)
Perform 2-dim DCT on each block
Throw away many of the least significant bits
   especially chrominance
   especially high spatial frequency 
Encode the left over stuff cleverly

RGB coding is *already* an approximation catering to the human eye.

    Dogs/Martians watching color TV--what do they see?


Sometimes you need lossless encoding even for signal data

1) Maybe we don't know what's important and what isn't.

   Seismic data

2) Maybe the data is going to have to undergo many compression
   decompression steps.

3) Maybe we don't want to get hit with a malpractice suit.

   Medical data


		Data compression can't *always* work


It's impossible to design a program that compresses every
input file [losslessly].


	Why?


The reason we can design effective compression programs is that
real-word files are usually special--they don't contain as
much "information" as they could.


	Their "entropy" is low.


		A crash course in Information theory


Entropy	- a measure of information content of a data source

Let's say we have a "source" that emits symbols from an n-symbol
alphabet with different probabilities Prob(S1), Prob(S2), ... Prob(Sn).

Naturally

	Sum   Prob(Si)  =  1.0
      1<=i<=n

and every Prob(Si) is between 0 and 1.


Then entropy is defined to be:

	Sum   -Prob(Si) * log Prob(Si)
      1<=i<=n

(Log is base two here.  Entropies are measured in bits.)  Entropy
measures the true information content (as a "bits / symbol" rate).

Entropies *add* for a sequence.

Example:  suppose their are 256 equally likely symbols.  Then the
entropy is:

	256 * (-1/256 * -8)  =  8

We really *need* 8 bits / per symbol to encode such a data source.


		More about entropy

Now suppose that the symbols are ASCII, and follow a distribution
more like what you see in (typical English) text files:

Prob (' ') = .1741
Prob ('e') = .0976
Prob ('t') = .0701
Prob ('a') = .0615
...
Prob ('q') = .0013
Prob ('z') = .0011
...
Prob (\0377) = 0

Now, performing the summation, we see that the entropy is only about 4.47
bits per symbol.  So information from this source is less than 8 bits
per symbol.

The more skewed from uniform the probability distribution, the lower
the entropy--and hence the more opportunity for compression.

Now, how to take advantage of that fact?


		Huffman coding


(Named for Mr. Huffman, naturally)


A variable length binary code:


  Assign each symbol in the alphabet a different sequence of ones
     and zeroes


  Basic idea: assign common symbols short sequences, and
     uncommon ones longer sequences.


  Prefix property:  no code is a proper prefix of any other code.
     we need this to decode uniquely, because we simply...


  Concatenate the codes together to form a long binary string.


  Algorithm:  Bottom-up tree building to assign codes.  Simple, efficient.

		Huffman coding example


   Example:	Prob(a) = .5, Prob(b) = .1, Prob(c) = .1, Prob(d) = .3
		Huffman tree:

			*
		       / \
  		      /   \
                     a     *
                          / \
                         /   \
                        d     *
                             / \
                            /   \
                           b     c

		codes: a = '0'; b = '10'; c = '110'; d = '111'

		cost (bits/ symbol) = .5 * 1 + .1 * 3 + .1 * 3 + .3 * 2
				    = 1.7 (better than 2)

                but ... entropy     = .5 * 1     + .1 * 3.322 +
                                      .1 * 3.322 + .3 * 1.737
				    = 1.685 (better than 1.7)
		...Not perfect


		Huffman coding isn't perfect

Huffman coding "wastes" fractional bits whenever some probabilities
are not of the form 1/2^n.  It's especially bad when there are very
probable items:

        Prob(a) = .99; Prob(b) = .01; entropy is .08 bits / symbol,
        but Huffman codes still need 1 bit / symbol.

Q: How can we do better?
A: Arithmetic coding.


		Arithmetic coding


Assign each symbol a continuous half-open subinterval of [0, 1) whose
length is proportional to the symbol's probability:

	0                  .5  .6  .7          1
	[==================)[==)[==)[==========)
        	  a           b   c       d

Define Q(Si) = sum  Prob(Sj)
               j<i

Fi(x) = Q(Si) + x * Prob(Si)

Fi is the affine function mapping [0,1) to Si's interval.  (There is a
natural correspondence between every affine functions and the image of
[0,1) under that function.)

Now consider the functional composition Fa o Fb [which means Fa(Fb(x)]:
this maps x into 0 + .5 * (.5 + .1 * x) = .25 + .05 * x.  This
function corresponds to the interval [.25, .3).  We encode the string
"ab" by a representation of this interval.  We can recover the first
character "a" by noting that the interval is a subinterval of a's
interval [0,.5), and then apply Fa inverse to get b's interval.

More generally we let the string S1,S2,S3,...,Sn be represented by

      FS1 o FS2 o FS3 o  ....  o FSn

This produces a (probably very tiny) subinterval of [0,1).  To encode
this interval, we simply choose any rational number a/b from the subinterval
where b is [as small as possible a] power of two, and send the (long) binary
representation of a.


		Arithmetic coding (cont.)

How efficient is arithmetic coding?

  We only need to send one binary number, but it's very long.  How long?
  Our number is a, where a / (2^k) is a representative of the final
  subinterval--so it's k bits long.  How big does k have to be?

  Observation: every interval of size s contains a rational number
   a / b  as long as s >= 1 / b.
  Setting b = 2^k and taking logs, this means k must be [at least] -log s

  Our final interval is of size s=

      Prob(S1) * Prob(S2) * Prob(S3) * .... * Prob(Sn)

  taking logs:

      log(Prob(S1)) + log (Prob(S2)) + log (Prob(S3)) + .... + log(Prob(Sn))

  since the probability of Si is Prob(Si), this works out to, on the average

     n * sum Prob(Si) * log (Prob(Si))
          i

  so

     k = ceiling (n * entropy)    [cool! less than one bit wasted, total]


Details I've glossed over:

* How you know when to stop (how long the message is)

* Making this all efficient in practice
  (Lots of tricks and approximations involved)


		Static vs Dynamic [adaptive] coding

Now we have some ideas on how to do data compression when we know
the probability distribution in advance.

But how do you know what the probabilities are going to be?

1) Different files have different statistics, so we may want to
   compute statistics in a first pass, and include them along with
   the data.

2) How do we efficiently represent the probability distribution?

Another alternative:

Adaptive coding:  Start off with a simple statistical model of the
distribution, but change the probabilities as we go and learn more
about the input.

For example: keep counts of how many times you've seen each symbol
(better start all counts at 1) and use the current probability
distribution to do your coding.

Advantages:   can adapt to different statistics, one-pass algorithm
Disadvantage: extra complexity involved in updating the probabilities

Pack used adaptive Huffman coding.  *All* general purpose compressors
nowadays use some kind of adaptive coding.
^L
		Context

Previously, we considered each symbol to be independent of what came before.

	The source was *memoryless*.

This is a dumb-ass assumption.  Real data doesn't behave like this.
When you see a "q" in English, the probability is unusually high that
the next symbol is a "u," for example.

	We were using "order-0" statistics.

If we keep separate sets of probabilities based on what the previous
symbol was, we would do better (order-1 statistics) and if we based our
probability distribution on the n previous symbols (order-n statistics)
we might do better still.

Problems:

1) Higher-order statistics take up a lot of memory

2) In adaptive coding, we don't have very good higher-order statistics
   early on
 
   	Another buzzword: the zero-frequency problem

3) Fancy techniques combine statistics of different orders, ["blending"]
   relying on lower-order statistics until higher-order ones are available.

		Modeling and Coding

We can separate the problem of data compression into two parts:

  1) Guessing a probability distribution of the next symbol

  2) Encoding the next symbol, based on this distribution.

Arithmetic coding has pretty much solved the second part of the problem,
but the first part ("modeling") is a subject of active research.  (It's
also an important problem in lots of other areas: speech / handwriting
recognition, for example).

The highest-performance (in terms of compression rate) general-purpose
compressors around today use fancy higher-order models plus arithmetic
coding.  But these aren't used that much in practice, because they are
slow (arithmetic coding, fancy model updating).  


		Substitutional compressors


Most general-purpose data compressors in use today use some kind of
substitutional data compression.

In 1977, Lempel and Ziv observed that text tends to have many repeated
substrings, and they proposed the following idea for data compression.

Here's the rough scheme.

Input:
  THEY ARE HERE AND THERE
  01234567890123456789012
            1111111111222

Output (need some kind of further coding, of course):
add 'THEY ARE '			THEY ARE *
copy from 1 for length 2        THEY ARE HE*
copy from 6 for length 3        THEY ARE HERE *
add 'AND T'			THEY ARE HERE AND T*
copy from 9 for length 4	THEY ARE HERE AND THERE*

The compressed file consists of both new symbols and pointers into
previous parts of the string.  (In the original scheme, pointers and
new symbols alternated).

Lempel and Ziv proved that their scheme was *asymptotically optimal*
for certain kinds of sources with finite state memory--if you ran for
long enough, the coding rate would approach the entropy.

Many general purpose compressor use LZ77 variants (including gzip,
which combines it with fancy coding for the pointers and chars to
get good performance).


		Dictionary compressors


A subclass of substitutional compressors.  There are lots and lots of
substrings, and not all of them may be likely to repeat.  A dictionary
compressor parses its input as it reads it, and can only emit pointers
to substrings that are in the dictionary.

LZ77 Parse:

	T,H,E,Y, ,A,R,E ,HE,RE, A,N,D, T,HER,E$

Commas show the parsing into dictionary entries.

Output:	   DICTIONARY:

0 'T'        1 'T'
0 'H'        2 'H'
0 'E'        3 'E'
0 'Y'        4 'Y'
0 ' '	     5 ' '
0 'A'        6 'A'
0 'R         7 'R'
3 ' '        8 'E '
2 'E'        9 'HE'
7 'E'       10 'RE
5 'A'       11 ' A'
0 'N'       12 'N'
0 'D'       13 'D'
5 'T'       14 ' T'
9 'R'       15 'HER'
3 '$'

Many variants on this idea too. Unix compress uses one called LZW.
Not nearly state of the art--it was designed to be something quick
and easy to implement in hardware.  But it caught on (for a while)
as a standard.


		Loose ends

Programs you should know about (in the Unix world):

	compress, uncompress, zcat

	gzip, gunzip, gzcat

	xv (to look at jpeg or gif files, and convert to/from
            jpeg from other formats)

Patent issues:

	A big mess.

        Lots of data compression algorithms are covered
        by patents nowadays; it remains to be seen what
        the consequences of this are, or to what degree
	the patents are enforceable.