- CPSC 461: Copyright (C) 2003 Katrin Becker 1998-2002 Last Modified May 31, 2003 12:55 AM
- Data Compression - Dictionary Methods
- - in its simplest form it is just like a dictionary where we assign codes to each entry
- usually work on STRINGS of symbols
- - each string gets encoded by creating some sort of TOKEN
- - typically, if the word is not matched (found) in the dictionary, we write a flag, the length of the word and the word itself
A Simple ample: word -> index
Output consists of indices and raw words (when they aren't found)
need to distinguish between raw words and symbols
Suppose we have 19-bit index entries = ~500,000 words
Word found: flag bit = 0; 19-bit index
Word NOT found: flag bit = 1; 7-bit word size; WORD
Would this be good? How can we tell?
Ave. Word size = 5 chars
a typical RAW word would be 6 bytes (48-bits)
an encoded word would be 20 bits
OK, so this would work provided we have a reasonable number of repeated words (and hey, who doesn't these days these days)
Continuing: Let's say, the probability of finding a match = P
After reading and compressing N words, the size of the output will be:
- N [ 20P + 48(1-P) ]
= N [ 48 - 28P ]
assuming 5 chars per word for the input stream, we get:
5*8 = 40N bits.
We can say we have compression when:
N [ 48 - 28P ] < 40 N
This implies that: P > 0.29
In other words, we need a match rate of better than 29% to get compression using this algorithm.
This would work for English text. Would it work for programs?
- - dictionary methods often achieve the best compression ratios Abraham Lempel and Jacob Ziv developed the first methods in the 70's {LZ77 & LZ78} - the vast majority of dictionary compression methods are based on these
- LZ77 (Sliding Window)
- <see the {ppt} show>
- EXAMPLE
- - maintain a window of the input stream that is divided into 2 parts:
- search buffer | look-ahead buffer
(usually big).......(not-so-much)
- - takes first part of the look-ahead buffer and scans backwards in the search window to find as much match as it can : returns with offset of longest, or most recently seen (last)
- - tokens consist of [ offset, length, next symbol in look-ahead buffer]
- - selecting last match is simplest many variations
- LZ78
- <see the {ppt} show>
- EXAMPLE
- - keeps the entire dictionary in memory (sometimes, it gets big)
- - emits a dictionary pointer and a new code - dictionary is actually built as a trie
- QUESTION: What to do if the dictionary gets full? 1. Freeze the dictionary 2. Dump the dictionary and start again 3. Selectively delete entries (oldest, least-recently-used)
- LZRW1
- - idea is to find match w/o searching
- - uses hash table
- - fast but not efficient; doesn't always find the longest match
- - uses all available memory
- - encodes block by block
- search buffer = 4K; look-ahead = 16-bytes 1 pointer to divide them
- 1. Hash the left-most 3 chars of look-ahead
- = 12 bits; call it I ; index to array of pointers
- 2. Retrieve ptr P and replace it by I
- if P is out of range or points to a string with different 1st 3 chars
- then output char and advance window_ptr by 1 else find longest match
- - output length (4 bits, as length-1) and offset (12 bits) - advance window_ptr by length
- each group of 16 items has 16-bit control word to say which are literals and which are copy items Decoder doesn't need the hash table
- about 10% worse than UNIX compress but 4 X faster
- 68000 assembler Ave. 13 machine instr. to compress and 4 instr. to decompress 1 byte
-
- for sample code, see: ftp://ftp.ross.net/clients/ross/compression/original/old_lzrw1.c
- LZW
- EXAMPLE
- <see the {ppt} show>
- - variant of LZ78
- - eliminates 2nd filed of the token
- - emits only pointers
- - initializes dictionary to base alphabet
- 1. Build string as long as matches are found
- 2. When search fails emit last found pointer and add new phrase to the dictionary
- when dictionary gets full: dump it (emit clear code and start again)
- - strings only get longer 1 char at a time - generally not good for text
- UNIX Compression <compress>
- - uses LZW w/ growing dictionary
- - start with 512 entries (256 already filled in) : 9 bits - when dictionary full, pointers grow to 10 bits
- - user specifies Max. (9-16) 16 is default
- - when D fills, continues with static D but monitors compression ratio
- - if ratio falls below predefined threshold delete D and start again
- GIF
- - Graphics Interchange Format
- current standard 89a
- - not really data compression, actually Graphics File Format
- - uses variant of LZW
- b = #bits per pixel (B/W = 2; 256 colours = 8)
- - starts with 2b+1 entries - grows to 212
- - then goes static but monitors compression ratio
- - if it decides to discard dictionary, emits 2b <clear code>s
- - ptrs are output in blocks of 8-it bytes
- - each block has a header: block size (Max = 255 bytes) terminated by a zero byte
- - last block has <eof> (2b+1)
- - ptrs are stored with lsb (least significant bit) on left
- ZIP
- - uses a "deflation" algorithm that is a combination of an LZ77 variant with static Huffman
- - uses 32Kb sliding window and 258 byte look-ahead
- - if the current string is not found in the window, it is emitted as a string of literal bytes
- - input is divided into blocks of arbitrary length
- - blocks containing uncompressed data are limited to 64Kbytes
- - block is terminated when the "deflate" encoder determines it would be useful
- - literals ("match lengths") are compressed with one Huffman tree, match distances with another
- - trees are stored at the start of the block
- - all input strings of length 3 are inserted in the hash table
- - hash index is computed for the next 3 bytes
- - if hash chain for this index not empty, check all strings in the chain and select the longest match
- - chains are singly-linked, most recent first
- - there are no deletions; the tail end of the chain just gets discarded
- For further discussion, see:
- http://www.cdrom.com/pub/infozip/zlib/rfc-gzip.htm
- PKZIP
- CRCs
- - we know what parity is it detects many transmission errors (but not all) CRC = Cyclical Redundancy Code CRC is a glorified vertical parity
*for a more complete discussion, click here (excerpt from:Author : Ross N. Williams. Net : ross@guest.adelaide.edu.au. FTP : ftp.adelaide.edu.au/pub/rocksoft/crc_v3.txt)
The basic idea of CRC algorithms is simply to treat the message as an enormous binary number, to divide it by another fixed binary number, and to make the remainder from this division the checksum. Upon receipt of the message, the receiver can perform the same division and compare the remainder with the "checksum" (transmitted remainder).
Example: Suppose the message consisted of the two bytes (6,23) as in the previous example. These can be considered to be the hexadecimal number 0617 which can be considered to be the binary number 0000-0110-0001-0111. Suppose that we use a checksum register one-byte wide and use a constant divisor of 1001, then the checksum is the remainder after 0000-0110-0001-0111 is divided by 1001. While in this case, this calculation could obviously be performed using common garden variety 32-bit registers, in the general case this is messy. So instead, we'll do the division using good-'ol long division which you learnt in school (remember?).
- Except this time, it's in binary:
- ...0000010101101 = 00AD = 173 = QUOTIENT
- _____-___-___-___-9= 1001 )
- 0000011000010111 = 0617 = 1559 = DIVIDEND
- DIVISOR.. 0000.,,....,.,,,
- ----.,,....,.,,,
- 0000,,....,.,,,
- 0000,,....,.,,,
- ----,,....,.,,,
- 0001,....,.,,,
- 0000,....,.,,,
- ----,....,.,,,
- 0011....,.,,,
- 0000....,.,,,
- ----....,.,,,
- 0110...,.,,,
- 0000...,.,,,
- ----...,.,,,
- 1100..,.,,,
- 1001..,.,,,
- ====..,.,,,
-
- 0110.,.,,,
- 0000.,.,,,
- ----.,.,,,
-
- 1100,.,,,
- 1001,.,,,
- ====,.,,,
- 0111.,,,
- 0000.,,,
- ----.,,,
- 1110,,,
- 1001,,,
- ====,,,
- 1011,,
- 1001,,
- ====,,
- 0101,
- 0000,
- ----
- 1011
- 1001
- ====
- 0010 = 02 = 2 = REMAINDER
- In decimal this is "1559 divided by 9 is 173 with a remainder of 2". Although the effect of each bit of the input message on the quotient is not all that significant, the 4-bit remainder gets kicked about quite a lot during the calculation, and if more bytes were added to the message (dividend) it's value could change radically again very quickly. This is why division works where addition doesn't. In case you're wondering, using this 4-bit checksum the transmitted message would look like this (in hexadecimal): 06172 (where the 0617 is the message and the 2 is the checksum). The receiver would divide 0617 by 9 and see whether the remainder was 2.
CPSC 461: Copyright (C) 2003 Katrin Becker 1998-2002 Last Modified May 31, 2003 12:55 AM