CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 23, 2001 04:21 PM

Data Compression - Dictionary Methods

- in its simplest form it is just like a dictionary where we assign codes to each entry
- if the word is not matched (found) in the dictionary, we write a flag, the length of the word and the word itself
 
- dictionary methods often achieve the best compression ratios

Abraham Lempel and Jacob Ziv developed the first methods in the 70's {LZ77 & LZ78}
- the vast majority of dictionary compression methods are based on these

LZ77 (Sliding Window) EXAMPLE
 
- maintain a window of the input stream that is divided into 2 parts:
search buffer | look-ahead buffer
(usually big).......(not-so-much)
 
- takes first part of the look-ahead buffer and scans backwards in the search window to find as much match as it can : returns with offset of longest, or most recently seen (last)
 
- tokens consist of [ offset, length, next symbol in look-ahead buffer]
 
- selecting last match is simplest
 
many variations

LZ78 EXAMPLE
 
- keeps the entire dictionary in memory (sometimes, it gets big)
- emits a dictionary pointer and a new code
- dictionary is actually built as a trie
 
QUESTION: What to do if the dictionary gets full?
1. freeze the dictionary
2. dump the dictionary and start again
3. selectively delete entries (oldest, least-recently-used)

LZRW1
 
- idea is to find match w/o searching
- uses hash table
- fast but not efficient; doesn't always find the longest match
- uses all available memory
- encodes block by block
search buffer = 4K; look-ahead = 16-bytes
1 pointer to divide them
 
1. hash the left-most 3 chars of look-ahead
= 12 bits; call it I ; index to array of pointers
2. retrieve ptr P and replace it by I
if P is out of range or points to a string with different 1st 3 chars
then output char and advance window_ptr by 1
else find longest match
- output length (4 bits, as length-1) and offset (12 bits)
- advance window_ptr by length
 
each group of 16 items has 16-bit control word to say which are literals and which are copy items
 
Decoder doesn't need the hash table
 
about 10% worse than UNIX compress but 4 X faster
 
68000 assembler Ave. 13 machine instr. to compress and 4 instr. to decompress 1 byte

 

 
 
for sample code, see: ftp://ftp.ross.net/clients/ross/compression/original/old_lzrw1.c

LZW EXAMPLE
 
- variant of LZ78
- eliminates 2nd filed of the token
- emits only pointers
- initializes dictionary to base alphabet
 
1. build string as long as matches are found
2. when search fails emit last found pointer and add new phrase to the dictionary
 
when dictionary gets full: dump it (emit clear code and start again)
 
- strings only get longer 1 char at a time
- generally not good for text
 

UNIX Compression <compress>
 
- uses LZW w/ growing dictionary
- start with 512 entries (256 already filled in) : 9 bits
- when dictionary full, pointers grow to 10 bits
- user specifies max. (9-16) 16 is default
- when D fills, continues with static D but monitors compression ratio
- if ratio falls below predefined threshold delete D and start again

GIF - Graphics Interchange Format
 
current standard 89a
 
- not really data compression, actually Graphics File Format
- uses variant of LZW
 
b = #bits per pixel (B/W = 2; 256 colours = 8)
 
- starts with 2b+1 entries
- grows to 212
- then goes static but monitors compression ratio
- if it decides to discard dictionary, emits 2b <clear code>s
 
- ptrs are output in blocks of 8-it bytes
- each block has a header: block size (max = 255 bytes) terminated by a zero byte
- last block has <eof> (2b+1)
- ptrs are stored with lsb (least significant bit) on left

ZIP
- uses a "deflation" algorithm that is a combination of an LZ77 variant with static Huffman
- uses 32Kb sliding window and 258 byte look-ahead
- if the current string is not found in the window, it is emitted as a string of litteral bytes
 
- input is divided into blocks of arbitrary length
- blocks containing uncompressed data are limited to 64Kbytes
- block is treminated when the "deflate" encoder determines it would be useful
- literals ("match lengths") are compressed with one Huffman tree, match distances with another
- trees are stored at the start of the block
 
- all input strings of length 3 are inserted in the hash table
- hash index is computed for the next 3 bytes
- if hash chain for this index not empty, check all strings in the chain and select the longest match
- chains are singly-linked, most recent first - there are no deletions; the tail end of the chain just gets discarded
 
For further discussion, see:
http://www.cdrom.com/pub/infozip/zlib/rfc-gzip.htm

PKZIP

CRCs
- we know what parity is it detects many transmission errors (but not all)
 
CRC = Cyclical Redundancy Code
 
CRC is a glorified vertical parity
*for a more complete discussion, click here
 
(excerpt from:Author : Ross N. Williams.
Net : ross@guest.adelaide.edu.au.
FTP : ftp.adelaide.edu.au/pub/rocksoft/crc_v3.txt)
The basic idea of CRC algorithms is simply to treat the message as an enormous binary number, to divide it by another fixed binary number, and to make the remainder from this division the checksum. Upon receipt of the message, the receiver can perform the same division and compare the remainder with the "checksum" (transmitted remainder).
 
Example: Suppose the the message consisted of the two bytes (6,23) as in the previous example. These can be considered to be the hexadecimal number 0617 which can be considered to be the binary number 0000-0110-0001-0111. Suppose that we use a checksum register one-byte wide and use a constant divisor of 1001, then the checksum is the remainder after 0000-0110-0001-0111 is divided by 1001. While in this case, this calculation could obviously be performed using common garden variety 32-bit registers, in the general case this is messy. So instead, we'll do the division using good-'ol long division which you learnt in school (remember?).
 
Except this time, it's in binary:
          ...0000010101101 = 00AD = 173 = QUOTIENT
         _____-___-___-___-
9= 1001 ) 0000011000010111 = 0617 = 1559 = DIVIDEND
DIVISOR.. 0000.,,....,.,,,
          ----.,,....,.,,,
           0000,,....,.,,,
           0000,,....,.,,,
           ----,,....,.,,,
            0001,....,.,,,
            0000,....,.,,,
            ----,....,.,,,
             0011....,.,,,
             0000....,.,,,
             ----....,.,,,
              0110...,.,,,
              0000...,.,,,
              ----...,.,,,
               1100..,.,,,
               1001..,.,,,
               ====..,.,,,
                0110.,.,,,
                0000.,.,,,
                ----.,.,,,
                 1100,.,,,
                 1001,.,,,
                 ====,.,,,
                  0111.,,,
                  0000.,,,
                  ----.,,,
                   1110,,,
                   1001,,,
                   ====,,,
                    1011,,
                    1001,,
                    ====,,
                     0101,
                     0000,
                     ----
                     1011
                     1001
                      ====
                       0010 = 02 = 2 = REMAINDER
 
In decimal this is "1559 divided by 9 is 173 with a remainder of 2".
Although the effect of each bit of the input message on the quotient
is not all that significant, the 4-bit remainder gets kicked about
quite a lot during the calculation, and if more bytes were added to
the message (dividend) it's value could change radically again very
quickly. This is why division works where addition doesn't.
In case you're wondering, using this 4-bit checksum the transmitted
message would look like this (in hexadecimal): 06172 (where the 0617
is the message and the 2 is the checksum). The receiver would divide
0617 by 9 and see whether the remainder was 2.

BackCPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 23, 2001 04:21 PM