CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 23, 2001 04:35 PM

Data Compression - Intro & Intuitive Methods

"I have made this letter longer than usual because I lack the time to make it shorter."

Blaise Pascal (1623-1662)

 
WHY?
less storage
faster transmission
faster processing
 
There are many methods for data compression based on different types of data and they produce different results - all however are based on the same principle: removing redundancy
 
The general idea is to remove redundancy but the side-effect is that the data becomes less reliable, more error prone.
 
Increasing reliability usually involves adding check bits and redundancy - data grows in size.
 
Compression and reliability are essentially opposites: the latter is relatively recent whereas the former existed long before computers. (Braille, Morse Code)
 
General Law of Compression:
Apply shorter codes to common events and long codes to rare events.
 
Another approach: Use Different Notation - e.g. code or abbrev. instead of longer text
cost = readability; time to encode & decode; increased program complexity
 
Question: Why can't an already compressed file be compressed further?
answer: there is little or no redundancy left so there is nothing left to remove
If it were possible then successive passes should be able to reduce the size of a file to just one byte or even one bit.
 
compressor / encoder together with decompressor / decoder sometimes called codec
 
Some forms of data compression are non-adaptive : they work the same way regardless of the data
- they are best used to compress similar kinds of data (e.g. Group3 or Group4 methods for fax compression)
 
Others are adaptive : this usually involves looking at the raw data and modifying its operations &/or parameters accordingly (e.g. Adaptive Huffman method)
- some of these require 2 passes over the data: the 1st examines the data and collects stats; the 2nd compresses (this makes them semi-adaptive)
- doing 2 complete passes is slow so many use estimates or adapt as input is compressed
- some are locally adaptive - the algorithm adapts itself to local conditions in the data stream
 
Lossy vs Lossless Compression:
lossy: compresses by discarding some information
: when decompressed we don't get the original data back (can't)
: can sometimes get away with this on images or sounds (but not on images used for analysis like astronomical images or forensic images)
: others, like text files containing programs may become useless if even one byte is thrown out (some parts may be discarded like white space, or font&style data in other text files)
Question: What other kinds of text files cannot afford to loose white space?
 
Symmetrical Compression: compressor and decompressor essentially the same but work in opposite directions.
 
Asymmetric Compression: one way works considerably harder than the other. This may be worthwhile when the codec is used more frequently in one direction than the other.
 
Data can be encoded in streaming mode : compressed byte by byte or in block mode which often treats each block separately.
 
Most methods are physical: they look at the bits and ignore the meaning.
Some are logical: they respond to the content - they are usually special purpose.
 
A Bit of Arithmetic:
1. compression ratio = size of the output stream
        ----------------------------------
        size of the input stream
2. compression factor = 1/ compression ratio
 
3. bbp = bits per pixel = # bits needed on average to compress 1 pixel of an image
 
4. compression gain : can be used to compare compression methods against each other
    100 loge * reference size
      -----------------------
      compressed size
    - called percent log ratio
5. compression speed: cycles per byte (CPB)
    - average number of cycles to compress one byte
     
There exist two sets of files publicly available to test compression algorithms:
1. "The Calgary Corpus": /home/ftp/pub/projects/text.compression.corpus
2. "The Canterbury Corpus": http://corpus.canterbury.ac.nz/
- the second is somewhat more representative and less arbitrary than the first (both are internationally known)

 

probability model: used in statistical methods
 
3 main approaches to compression:
1. intuitive
2. statistical
3. dictionary
 
 
Braille: 1820's; composed of 3x2 dots; = 6 bits of information = 64 codes
 
- uses just letters, digits, and punctuation - this doesn't use up all 64 possibilities so the rest are for common words (and, for, of) and strings (th, ation, ound).
 
Braille achieves a small but significant compression. This is important as Braille letters are physically large

 

The use of acronyms amounts to compression of a kind as well (S.A.L.T., ASCII, NSA, ...)

Another example is the old CDC-Display Code commonly used in 2nd generation computers (and a few 3rd gen too). this is before the days of CRTs and upper/lower case letters. Printers were very limited.

Bits
Bit Positions 210
543
0
1
2
3
4
5
6
7
0
 
A
B
C
D
E
F
G
1
H
I
J
K
L
M
N
O
2
P
Q
R
S
T
U
V
W
3
X
Y
Z
0
1
2
3
4
4
5
6
7
8
9
+
-
*
5
/
(
)
$
=
<sp>
,
.
6
a
[
]
:
`
_
V
^
7
<
>
d
e
¬
;
 
Some forms of compression amount to compaction... they are irreversible: e.g. removing white space - can be used on programs, not tables usually. In some kinds of text it is possible to remove all characters except letters and spaces - even the letters can be case flattened.
 
Some Other Approaches:
- replace white space by a bit string that pre- or succeeds the text to indicate where they were:
WE ARE NOT ALONE
0010001000100000
- since ASCII is 7 bits they can be packed; compression ratio = 7/8 = 0.875
 
Baudot code (~1880) a 5-bit telegraph code
In Baudot, characters are expressed using five bits. Baudot uses two code sub-sets, the "letter set" (LTRS), and the "figure set" (FIGS). The FIGS
character (11011) signals that the following code is to be interpreted as being in the FIGS set, until this is reset by the LTRS (11111) character.
binary hex LTRS FIGS
--------------------------
00011 03 A -
11001 19 B ?
01110 0E C :
01001 09 D $
00001 01 E 3
01101 0D F !
11010 1A G &
10100 14 H #
00110 06 I 8
01011 0B J BELL
01111 0F K (
10010 12 L )
11100 1C M .
01100 0C N ,
11000 18 O 9
10110 16 P 0
10111 17 Q 1
01010 0A R 4
00101 05 S '
10000 10 T 5
00111 07 U 7
11110 1E V ;
10011 13 W 2
11101 1D X /
10101 15 Y 6
10001 11 Z "
01000 08 CR CR
00010 02 LF LF
00100 04 SP SP
11111 1F LTRS LTRS
11011 1B FIGS FIGS
00000 00 [..unused..]
 
Where CR is carriage return, LF is linefeed, BELL is the bell, SP is space, and STOP is the stop character.
Note: these bit values are often shown in inverse order, depending (presumably) which side of the paper tape you were looking at.
Local implementations of Baudot may differ in the use of #, STOP, BELL, and '.
Dictionary Data: (lexicographically sorted data)
- can use front compression
 
E.g.
Uncompressed Compressed
eel eel (1st can't be compressed)
eerie 2rie
efface 1fface
effect 3ect
effeminate 4minate
effervesce 4rvesce
effigy 3igy
effrontery 3rontery
egg 1gg
ego 2o
egregious 2regious
egress 4ss
eisteddfod 1isteddfod
eject 1ject
 
MacWrite WordProcessor uses 4 bits for the 15 most common characters (" etnroaisdlhcfp") plus an escape character. The common letters use 4 bits and all others use the escape + ASCII = 12 bits.
- each paragraph is encoded separately and if longer when compressed, left as ASCII
- each paragraph is then preceded by a bit to say if it's encoded or not.
 
RLE (Run Length Encoding) Text Compression
 
If data item d occurs n consecutive times then replace n occurrences with a single pair dn.
 
We need some escape sequence to tell us a compressed pair is coming up.
- now we have @dn, if n is 8 bits = max. 255 repetitions and each tuple = 3 bytes
- there is no gain in compressing and repetitions of < 3 characters (or anything less than 4 for that matter) so we can leave doubles and make @d0 mean 3 occurrences; this gives us a range of 3-258
- if we come across > 258 occurrences, can simply start again after the first 259 are compressed
 
Problems: not good for English text since text has few run lengths > 3
 
Relative Encoding (differencing)
 
Used where strings of numbers don't differ from each other by much.
 
E.g. Telemetry: collecting data at certain intervals to be transmitted to a central location for processing
like temperatures:
70,71,72.5,73.1,.... becomes
70,1, 1.5,0.6,.... the differences are smaller and can be expressed in fewer bits.
 
Sometimes the differences are large, then the actual value is used, --- now we need a flag to say which is which: this can be done by strings of flag bits saying which values are source and which are differences. They can be packed and transmitted intermittently (like every 8 or 16 values). Another way is to mix them: either one16-bit measurement or 2 8 bit differences. Then the flag says if the value is a measurement or a difference pair.
 
RLE Image Compression:
 
Black and White Images - only needs to record counts of runs - 17,1,55 means 17 white pixels followed by 1 black followed by 55 more white..... If the first pixel is black then the first value will be 0
 
The size of the compressed stream depends on the complexity of the image.
 
Can also use it for grey scale images....
- again we'll need an escape character but,
- since all values may be used:
1. if < 8 bits grey levels can reserve 1 bit to flag grey scale/ count
2. if 8, 16 its, can reduce grey levels by 1 using it as a flag
3. create bit strings that indicate which coming 8/ 16 bytes are values and which are counts
 
Colour Maps usually have 3 bytes/pixel (RGB)
- they can be separated into 3 groups which are encoded separately
 
Encoding of images is best done row by row
- gives us choices for scanning/ display
- there is likely to be a big difference between the end of one row and the start of the next so stringing rows together gives us no advantage
 
- encoding can be done row by row, column by column, diagonally, interlaced
- one can simply include information at the start of the file to tell which method was used.
- this also allows us to reference and/ or extract part of an image
- we can then merge two images without having to decompress
 
Move-to-Front:
- builds an alphabet of characters that appear in the data
- each time a character is encountered, it is encoded by its position in the alphabet and then moved to the front of the alphabet (alphabet keeps changing orders)
- it is adaptive
- tends to produce shorter codes with move-to-front than without
 
E.g. "the truth is out there"
 
w/o move to front:
the alphabet: t h e _ r u i s o
C = 0,1,2,3,0,4,5,0,1,3,6,7,3,8,5,0,3,0,1,2,4,2
 
with move to front:
next symbol

of text

current alphabet
code

generated

t
the_ruiso
0
h
the_ruiso
1
e
hte_ruiso
2
_
eht_ruiso
3
t
_ehtruiso
3
r
t_ehruiso
4
u
rt_ehuiso
5
t
urt_ehiso
2
h
tur_ehiso
5
_
htur_eiso
4
i
_htureiso
6
s
i_htureso
7
_
si_htureo
2
o
_sihtureo
8
u
o_sihture
6
t
uo_sihtre
6
_
tuo_sihre
3
t
_tuosihre
1
h
t_uosihre
6
e
ht_uosire
8
r
eht_uosir
8
e
reht_uosi
1
 
erht_uosi
 

 

 
We can then take it a step further if we want and encode C (the code string)
 
To decode, we must save the alphabet in its original order. Notice that move-to-front doesn't always improve on compression. We can tell which is better easily by averaging the values in C.
 
Move-to-front works well when there are concentrations of identical symbols but not if common symbols are distributed evenly throughout the data.
 
Variations:
1. move-ahead-k
2. wait-c-and-move
3. if text - treat each word separately rather than each character


Back to TopCPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 23, 2001 04:35 PM