DATA COMPRESSION Guy Jacobson AT&T Research DATA COMPRESSION A hot field today, with much active research and development on many levels. Where is compression? Software: general purpose programs (compress, gzip, pkzip) Image/signal data file formats (jpeg, RealAudio) Operating system (Stacker, Doublespace) Hardware: Modems, disk controllers, tape backup-units, Soon to be TV/RADIO What are the advantages of compression? 1) Save space = disks / tapes / memory feels bigger Archiving Fitting into main memory / into a fixed size ROM 2) Decrease transmission time over a slow network: Modems / Faxes / ordinary telephony / Digital TV / radio What are the disadvantages? 1) Usually pay a price in access time ("Time-space tradeoff") Sometimes you actually save time *and* space. Slow disks / Network / better cache performance "Signal" compression vs "Text" compression 1) "Signal": digitized information representing signal measurements a) Often sampled over regular intervals of time and/or space Audio (time * nchannels) Still images (space^2) Video (time * space ^2) b) Often represents something that humans can perceive directly (as above) c) Sometimes not: Seismic / weather data measurements Cat / MRI data 2) "Text": (Using the term loosely here) symbolic data A sequence of characters chosen over a particular alphabet ASCII / ones and zeroes General purpose compression LOSSY vs LOSSLESS compression: Can you reconstruct the original uncompressed data *exactly*? Sometimes this is important (almost always for text data); Sometimes not (for signal data). The two main ideas in data compression: 1) Save space by removing redundancy 2) Save space by removing irrelevant information Lossless compression uses only idea 1; Lossy uses both 1 and 2. When is information "irrelevant"? 1) Imperceptible data in audio/video 2) Signal data is noisy anyway; too much precision is simply storing a lot of noise. 3) Quality-space tradeoff Compressed file formats (just a sampling) compress (*.Z) Unix standard Compress program (LZW) pack (*.z) Ancient Unix program (adaptive Huffman coding) gzip (*.gz) Gnu's zip compressor LZ77 pkzip (*.zip) Phil Katz's popular PC compression program Another tradeoff: good compression vs fast encode/decode times Other PC/Mac/Amiga formats: lharc, pak, arj, uc2, sqz, sit, pit, ha, etc. v.42bis Modem standard with built-in compression (LZW) Image formats JPEG (*.jpg) Joint Photo-something Experts Group color and grey-scale images (DCT-based) Lossy--Built-in "knob" to control quality-space tradeoff Lossless variant exists Complicated, but public domain software exists GIF Graphics Interchange Format (Color quantization and LZW) MPEG Motion Picture... for sequences of images (movies) Very complicated, but public domain software exists JBIG bilevel images fax group 3/4 bilevel images More Handy BUZZWORDS CODEC CODer/DECoder [hardware] Huffman coding A variable length binary code Arithmetic coding A fancier code where strings of symbols are mapped into a single rational number Lempel-Ziv Family of substitutional text compression algorithms Many general purpose compressors LZ-based. Two subfamilies: LZ77 and LZ78 RLE run length encoding Vector Quantization replace a group of samples by representative from a vector codebook. Color quantization [a la GIF] is one example; Pops up frequently in other signal compression. DCT Discrete Cosine Transform [Similar to Fourier transform] transform into frequency domain JPEG/MPEG uses this Wavelets Another fancy signal compression technique Fractal compression Ditto Perceptual coding JPEG/MPEG, DAT rely on the fact that the human perceptual system (ok, eyes and ears), wonderful though they are, simply can't take in all the information in a picture or sound sample. Why store what you can't perceive? The human eye [ear] is very good at seeing some things, but pretty bad at seeing other things GOOD: Finding edges, luminance, low-frequency color information BAD: color information (especially high frequency) JPEG outline: Transform from RGB space into luminance/chrominance space Break up image into small block (say 8x8 pixels) Perform 2-dim DCT on each block Throw away many of the least significant bits especially chrominance especially high spatial frequency Encode the left over stuff cleverly RGB coding is *already* an approximation catering to the human eye. Dogs/Martians watching color TV--what do they see? Sometimes you need lossless encoding even for signal data 1) Maybe we don't know what's important and what isn't. Seismic data 2) Maybe the data is going to have to undergo many compression decompression steps. 3) Maybe we don't want to get hit with a malpractice suit. Medical data Data compression can't *always* work It's impossible to design a program that compresses every input file [losslessly]. Why? The reason we can design effective compression programs is that real-word files are usually special--they don't contain as much "information" as they could. Their "entropy" is low. A crash course in Information theory Entropy - a measure of information content of a data source Let's say we have a "source" that emits symbols from an n-symbol alphabet with different probabilities Prob(S1), Prob(S2), ... Prob(Sn). Naturally Sum Prob(Si) = 1.0 1<=i<=n and every Prob(Si) is between 0 and 1. Then entropy is defined to be: Sum -Prob(Si) * log Prob(Si) 1<=i<=n (Log is base two here. Entropies are measured in bits.) Entropy measures the true information content (as a "bits / symbol" rate). Entropies *add* for a sequence. Example: suppose their are 256 equally likely symbols. Then the entropy is: 256 * (-1/256 * -8) = 8 We really *need* 8 bits / per symbol to encode such a data source. More about entropy Now suppose that the symbols are ASCII, and follow a distribution more like what you see in (typical English) text files: Prob (' ') = .1741 Prob ('e') = .0976 Prob ('t') = .0701 Prob ('a') = .0615 ... Prob ('q') = .0013 Prob ('z') = .0011 ... Prob (\0377) = 0 Now, performing the summation, we see that the entropy is only about 4.47 bits per symbol. So information from this source is less than 8 bits per symbol. The more skewed from uniform the probability distribution, the lower the entropy--and hence the more opportunity for compression. Now, how to take advantage of that fact? Huffman coding (Named for Mr. Huffman, naturally) A variable length binary code: Assign each symbol in the alphabet a different sequence of ones and zeroes Basic idea: assign common symbols short sequences, and uncommon ones longer sequences. Prefix property: no code is a proper prefix of any other code. we need this to decode uniquely, because we simply... Concatenate the codes together to form a long binary string. Algorithm: Bottom-up tree building to assign codes. Simple, efficient. Huffman coding example Example: Prob(a) = .5, Prob(b) = .1, Prob(c) = .1, Prob(d) = .3 Huffman tree: * / \ / \ a * / \ / \ d * / \ / \ b c codes: a = '0'; b = '10'; c = '110'; d = '111' cost (bits/ symbol) = .5 * 1 + .1 * 3 + .1 * 3 + .3 * 2 = 1.7 (better than 2) but ... entropy = .5 * 1 + .1 * 3.322 + .1 * 3.322 + .3 * 1.737 = 1.685 (better than 1.7) ...Not perfect Huffman coding isn't perfect Huffman coding "wastes" fractional bits whenever some probabilities are not of the form 1/2^n. It's especially bad when there are very probable items: Prob(a) = .99; Prob(b) = .01; entropy is .08 bits / symbol, but Huffman codes still need 1 bit / symbol. Q: How can we do better? A: Arithmetic coding. Arithmetic coding Assign each symbol a continuous half-open subinterval of [0, 1) whose length is proportional to the symbol's probability: 0 .5 .6 .7 1 [==================)[==)[==)[==========) a b c d Define Q(Si) = sum Prob(Sj) j= 1 / b. Setting b = 2^k and taking logs, this means k must be [at least] -log s Our final interval is of size s= Prob(S1) * Prob(S2) * Prob(S3) * .... * Prob(Sn) taking logs: log(Prob(S1)) + log (Prob(S2)) + log (Prob(S3)) + .... + log(Prob(Sn)) since the probability of Si is Prob(Si), this works out to, on the average n * sum Prob(Si) * log (Prob(Si)) i so k = ceiling (n * entropy) [cool! less than one bit wasted, total] Details I've glossed over: * How you know when to stop (how long the message is) * Making this all efficient in practice (Lots of tricks and approximations involved) Static vs Dynamic [adaptive] coding Now we have some ideas on how to do data compression when we know the probability distribution in advance. But how do you know what the probabilities are going to be? 1) Different files have different statistics, so we may want to compute statistics in a first pass, and include them along with the data. 2) How do we efficiently represent the probability distribution? Another alternative: Adaptive coding: Start off with a simple statistical model of the distribution, but change the probabilities as we go and learn more about the input. For example: keep counts of how many times you've seen each symbol (better start all counts at 1) and use the current probability distribution to do your coding. Advantages: can adapt to different statistics, one-pass algorithm Disadvantage: extra complexity involved in updating the probabilities Pack used adaptive Huffman coding. *All* general purpose compressors nowadays use some kind of adaptive coding. ^L Context Previously, we considered each symbol to be independent of what came before. The source was *memoryless*. This is a dumb-ass assumption. Real data doesn't behave like this. When you see a "q" in English, the probability is unusually high that the next symbol is a "u," for example. We were using "order-0" statistics. If we keep separate sets of probabilities based on what the previous symbol was, we would do better (order-1 statistics) and if we based our probability distribution on the n previous symbols (order-n statistics) we might do better still. Problems: 1) Higher-order statistics take up a lot of memory 2) In adaptive coding, we don't have very good higher-order statistics early on Another buzzword: the zero-frequency problem 3) Fancy techniques combine statistics of different orders, ["blending"] relying on lower-order statistics until higher-order ones are available. Modeling and Coding We can separate the problem of data compression into two parts: 1) Guessing a probability distribution of the next symbol 2) Encoding the next symbol, based on this distribution. Arithmetic coding has pretty much solved the second part of the problem, but the first part ("modeling") is a subject of active research. (It's also an important problem in lots of other areas: speech / handwriting recognition, for example). The highest-performance (in terms of compression rate) general-purpose compressors around today use fancy higher-order models plus arithmetic coding. But these aren't used that much in practice, because they are slow (arithmetic coding, fancy model updating). Substitutional compressors Most general-purpose data compressors in use today use some kind of substitutional data compression. In 1977, Lempel and Ziv observed that text tends to have many repeated substrings, and they proposed the following idea for data compression. Here's the rough scheme. Input: THEY ARE HERE AND THERE 01234567890123456789012 1111111111222 Output (need some kind of further coding, of course): add 'THEY ARE ' THEY ARE * copy from 1 for length 2 THEY ARE HE* copy from 6 for length 3 THEY ARE HERE * add 'AND T' THEY ARE HERE AND T* copy from 9 for length 4 THEY ARE HERE AND THERE* The compressed file consists of both new symbols and pointers into previous parts of the string. (In the original scheme, pointers and new symbols alternated). Lempel and Ziv proved that their scheme was *asymptotically optimal* for certain kinds of sources with finite state memory--if you ran for long enough, the coding rate would approach the entropy. Many general purpose compressor use LZ77 variants (including gzip, which combines it with fancy coding for the pointers and chars to get good performance). Dictionary compressors A subclass of substitutional compressors. There are lots and lots of substrings, and not all of them may be likely to repeat. A dictionary compressor parses its input as it reads it, and can only emit pointers to substrings that are in the dictionary. LZ77 Parse: T,H,E,Y, ,A,R,E ,HE,RE, A,N,D, T,HER,E$ Commas show the parsing into dictionary entries. Output: DICTIONARY: 0 'T' 1 'T' 0 'H' 2 'H' 0 'E' 3 'E' 0 'Y' 4 'Y' 0 ' ' 5 ' ' 0 'A' 6 'A' 0 'R 7 'R' 3 ' ' 8 'E ' 2 'E' 9 'HE' 7 'E' 10 'RE 5 'A' 11 ' A' 0 'N' 12 'N' 0 'D' 13 'D' 5 'T' 14 ' T' 9 'R' 15 'HER' 3 '$' Many variants on this idea too. Unix compress uses one called LZW. Not nearly state of the art--it was designed to be something quick and easy to implement in hardware. But it caught on (for a while) as a standard. Loose ends Programs you should know about (in the Unix world): compress, uncompress, zcat gzip, gunzip, gzcat xv (to look at jpeg or gif files, and convert to/from jpeg from other formats) Patent issues: A big mess. Lots of data compression algorithms are covered by patents nowadays; it remains to be seen what the consequences of this are, or to what degree the patents are enforceable.