- CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 23, 2001 04:35 PM
Data Compression - Intro & Intuitive Methods
- "I have made this letter longer than usual because I lack the time to make it shorter."
Blaise Pascal (1623-1662)
-
- WHY?
- less storage
- faster transmission
- faster processing
-
- There are many methods for data compression based on different types of data and they produce different results - all however are based on the same principle: removing redundancy
-
- The general idea is to remove redundancy but the side-effect is that the data becomes less reliable, more error prone.
-
- Increasing reliability usually involves adding check bits and redundancy - data grows in size.
-
- Compression and reliability are essentially opposites: the latter is relatively recent whereas the former existed long before computers. (Braille, Morse Code)
-
- General Law of Compression:
- Apply shorter codes to common events and long codes to rare events.
-
- Another approach: Use Different Notation - e.g. code or abbrev. instead of longer text
- cost = readability; time to encode & decode; increased program complexity
-
- Question: Why can't an already compressed file be compressed further?
- answer: there is little or no redundancy left so there is nothing left to remove
- If it were possible then successive passes should be able to reduce the size of a file to just one byte or even one bit.
-
- compressor / encoder together with decompressor / decoder sometimes called codec
-
- Some forms of data compression are non-adaptive : they work the same way regardless of the data
- - they are best used to compress similar kinds of data (e.g. Group3 or Group4 methods for fax compression)
-
- Others are adaptive : this usually involves looking at the raw data and modifying its operations &/or parameters accordingly (e.g. Adaptive Huffman method)
- - some of these require 2 passes over the data: the 1st examines the data and collects stats; the 2nd compresses (this makes them semi-adaptive)
- - doing 2 complete passes is slow so many use estimates or adapt as input is compressed
- - some are locally adaptive - the algorithm adapts itself to local conditions in the data stream
-
- Lossy vs Lossless Compression:
- lossy: compresses by discarding some information
- : when decompressed we don't get the original data back (can't)
- : can sometimes get away with this on images or sounds (but not on images used for analysis like astronomical images or forensic images)
- : others, like text files containing programs may become useless if even one byte is thrown out (some parts may be discarded like white space, or font&style data in other text files)
- Question: What other kinds of text files cannot afford to loose white space?
-
- Symmetrical Compression: compressor and decompressor essentially the same but work in opposite directions.
-
- Asymmetric Compression: one way works considerably harder than the other. This may be worthwhile when the codec is used more frequently in one direction than the other.
-
- Data can be encoded in streaming mode : compressed byte by byte or in block mode which often treats each block separately.
-
- Most methods are physical: they look at the bits and ignore the meaning.
- Some are logical: they respond to the content - they are usually special purpose.
-
- A Bit of Arithmetic:
- 1. compression ratio = size of the output stream
- ----------------------------------
- size of the input stream
- 2. compression factor = 1/ compression ratio
-
- 3. bbp = bits per pixel = # bits needed on average to compress 1 pixel of an image
-
- 4. compression gain : can be used to compare compression methods against each other
- 100 loge * reference size
- -----------------------
- compressed size
- - called percent log ratio
- 5. compression speed: cycles per byte (CPB)
- - average number of cycles to compress one byte
-
- There exist two sets of files publicly available to test compression algorithms:
- 1. "The Calgary Corpus": /home/ftp/pub/projects/text.compression.corpus
- 2. "The Canterbury Corpus": http://corpus.canterbury.ac.nz/
- - the second is somewhat more representative and less arbitrary than the first (both are internationally known)
- probability model: used in statistical methods
- : builds a model for the data before compression begins
-
- 3 main approaches to compression:
- 1. intuitive
- 2. statistical
- 3. dictionary
-
-
- Braille: 1820's; composed of 3x2 dots; = 6 bits of information = 64 codes
-
- uses just letters, digits, and punctuation - this doesn't use up all 64 possibilities so the rest are for common words (and, for, of) and strings (th, ation, ound).
-
- Braille achieves a small but significant compression. This is important as Braille letters are physically large
The use of acronyms amounts to compression of a kind as well (S.A.L.T., ASCII, NSA, ...)
Another example is the old CDC-Display Code commonly used in 2nd generation computers (and a few 3rd gen too). this is before the days of CRTs and upper/lower case letters. Printers were very limited.
-
|
Bits
|
Bit Positions 210
|
|
543
|
0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
|
0
|
|
A
|
B
|
C
|
D
|
E
|
F
|
G
|
|
1
|
H
|
I
|
J
|
K
|
L
|
M
|
N
|
O
|
|
2
|
P
|
Q
|
R
|
S
|
T
|
U
|
V
|
W
|
|
3
|
X
|
Y
|
Z
|
0
|
1
|
2
|
3
|
4
|
|
4
|
5
|
6
|
7
|
8
|
9
|
+
|
-
|
*
|
|
5
|
/
|
(
|
)
|
$
|
=
|
<sp>
|
,
|
.
|
|
6
|
a
|
[
|
]
|
:
|
`
|
_
|
V
|
^
|
|
7
|
|
|
<
|
>
|
d
|
e
|
¬
|
;
|
-
- Some forms of compression amount to compaction... they are irreversible: e.g. removing white space - can be used on programs, not tables usually. In some kinds of text it is possible to remove all characters except letters and spaces - even the letters can be case flattened.
-
- Some Other Approaches:
- - replace white space by a bit string that pre- or succeeds the text to indicate where they were:
- WE ARE NOT ALONE
- 0010001000100000
- - since ASCII is 7 bits they can be packed; compression ratio = 7/8 = 0.875
-
- Baudot code (~1880) a 5-bit telegraph code
- In Baudot, characters are expressed using five bits. Baudot uses two code sub-sets, the "letter set" (LTRS), and the "figure set" (FIGS). The FIGS
character (11011) signals that the following code is to be interpreted as being in the FIGS set, until this is reset by the LTRS (11111) character.
- binary hex LTRS FIGS
--------------------------
00011 03 A -
11001 19 B ?
01110 0E C :
01001 09 D $
00001 01 E 3
01101 0D F !
11010 1A G &
10100 14 H #
00110 06 I 8
01011 0B J BELL
01111 0F K (
10010 12 L )
11100 1C M .
01100 0C N ,
11000 18 O 9
10110 16 P 0
10111 17 Q 1
01010 0A R 4
00101 05 S '
10000 10 T 5
00111 07 U 7
11110 1E V ;
10011 13 W 2
11101 1D X /
10101 15 Y 6
10001 11 Z "
01000 08 CR CR
00010 02 LF LF
00100 04 SP SP
11111 1F LTRS LTRS
11011 1B FIGS FIGS
00000 00 [..unused..]
-
- Where CR is carriage return, LF is linefeed, BELL is the bell, SP is space, and STOP is the stop character.
- Note: these bit values are often shown in inverse order, depending (presumably) which side of the paper tape you were looking at.
- Local implementations of Baudot may differ in the use of #, STOP, BELL, and '.
-
- Dictionary Data: (lexicographically sorted data)
- - can use front compression
-
- E.g.
| Uncompressed |
Compressed |
| eel |
eel (1st can't be compressed) |
| eerie |
2rie |
| efface |
1fface |
| effect |
3ect |
| effeminate |
4minate |
| effervesce |
4rvesce |
| effigy |
3igy |
| effrontery |
3rontery |
| egg |
1gg |
| ego |
2o |
| egregious |
2regious |
| egress |
4ss |
| eisteddfod |
1isteddfod |
| eject |
1ject |
-
- MacWrite WordProcessor uses 4 bits for the 15 most common characters (" etnroaisdlhcfp") plus an escape character. The common letters use 4 bits and all others use the escape + ASCII = 12 bits.
- - each paragraph is encoded separately and if longer when compressed, left as ASCII
- - each paragraph is then preceded by a bit to say if it's encoded or not.
-
- RLE (Run Length Encoding) Text Compression
-
- If data item d occurs n consecutive times then replace n occurrences with a single pair dn.
-
- We need some escape sequence to tell us a compressed pair is coming up.
- - now we have @dn, if n is 8 bits = max. 255 repetitions and each tuple = 3 bytes
- - there is no gain in compressing and repetitions of < 3 characters (or anything less than 4 for that matter) so we can leave doubles and make @d0 mean 3 occurrences; this gives us a range of 3-258
- - if we come across > 258 occurrences, can simply start again after the first 259 are compressed
-
- Problems: not good for English text since text has few run lengths > 3
- : the escape character must be unused in the data
-
- Relative Encoding (differencing)
-
- Used where strings of numbers don't differ from each other by much.
-
- E.g. Telemetry: collecting data at certain intervals to be transmitted to a central location for processing
- like temperatures:
- 70,71,72.5,73.1,.... becomes
- 70,1, 1.5,0.6,.... the differences are smaller and can be expressed in fewer bits.
-
- Sometimes the differences are large, then the actual value is used, --- now we need a flag to say which is which: this can be done by strings of flag bits saying which values are source and which are differences. They can be packed and transmitted intermittently (like every 8 or 16 values). Another way is to mix them: either one16-bit measurement or 2 8 bit differences. Then the flag says if the value is a measurement or a difference pair.
-
- RLE Image Compression:
-
- Black and White Images - only needs to record counts of runs - 17,1,55 means 17 white pixels followed by 1 black followed by 55 more white..... If the first pixel is black then the first value will be 0
-
- The size of the compressed stream depends on the complexity of the image.
-
- Can also use it for grey scale images....
- - again we'll need an escape character but,
- - since all values may be used:
- 1. if < 8 bits grey levels can reserve 1 bit to flag grey scale/ count
- 2. if 8, 16 its, can reduce grey levels by 1 using it as a flag
- 3. create bit strings that indicate which coming 8/ 16 bytes are values and which are counts
-
- Colour Maps usually have 3 bytes/pixel (RGB)
- - they can be separated into 3 groups which are encoded separately
-
- Encoding of images is best done row by row
- - gives us choices for scanning/ display
- - there is likely to be a big difference between the end of one row and the start of the next so stringing rows together gives us no advantage
-
- - encoding can be done row by row, column by column, diagonally, interlaced
- - one can simply include information at the start of the file to tell which method was used.
- - this also allows us to reference and/ or extract part of an image
- - we can then merge two images without having to decompress
-
- Move-to-Front:
- - builds an alphabet of characters that appear in the data
- - each time a character is encountered, it is encoded by its position in the alphabet and then moved to the front of the alphabet (alphabet keeps changing orders)
- - it is adaptive
- - tends to produce shorter codes with move-to-front than without
-
- E.g. "the truth is out there"
-
- w/o move to front:
- the alphabet: t h e _ r u i s o
- C = 0,1,2,3,0,4,5,0,1,3,6,7,3,8,5,0,3,0,1,2,4,2
-
- with move to front:
|
next symbol
of text
|
current alphabet
|
code
generated
|
|
t
|
the_ruiso
|
0
|
|
h
|
the_ruiso
|
1
|
|
e
|
hte_ruiso
|
2
|
|
_
|
eht_ruiso
|
3
|
|
t
|
_ehtruiso
|
3
|
|
r
|
t_ehruiso
|
4
|
|
u
|
rt_ehuiso
|
5
|
|
t
|
urt_ehiso
|
2
|
|
h
|
tur_ehiso
|
5
|
|
_
|
htur_eiso
|
4
|
|
i
|
_htureiso
|
6
|
|
s
|
i_htureso
|
7
|
|
_
|
si_htureo
|
2
|
|
o
|
_sihtureo
|
8
|
|
u
|
o_sihture
|
6
|
|
t
|
uo_sihtre
|
6
|
|
_
|
tuo_sihre
|
3
|
|
t
|
_tuosihre
|
1
|
|
h
|
t_uosihre
|
6
|
|
e
|
ht_uosire
|
8
|
|
r
|
eht_uosir
|
8
|
|
e
|
reht_uosi
|
1
|
| |
erht_uosi
|
|
-
- We can then take it a step further if we want and encode C (the code string)
-
- To decode, we must save the alphabet in its original order. Notice that move-to-front doesn't always improve on compression. We can tell which is better easily by averaging the values in C.
-
- Move-to-front works well when there are concentrations of identical symbols but not if common symbols are distributed evenly throughout the data.
-
- Variations:
- 1. move-ahead-k
- 2. wait-c-and-move
- 3. if text - treat each word separately rather than each character
CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 23, 2001 04:35 PM