CPSC 461: Copyright (C) 2003 Katrin Becker 1998-2002 Last Modified May 28, 2003 09:27 PM

Data Compression - Information Theory Basics


A few words on Information Theory....

Information Theory is primarily concerned with ways to quantify information.

It's based on the idea that the amount of information a message contains is directly related to the amount of surprise in the message.

 

Let's say s = #symbols transmitted / time unit
If there are n symbols, they must be (at least) log2n bits long.
RVL Hartly (1928) first defined a measure of information. He said:
- suppose there are S symbols and a message of length l
- this means we can have Sl messages
- the amount of information is defined as the log of the distinguishable messages

The amount of information transmitted / time unit (in bits) = H = s * log2n
(s = #symbols/time unit; log2n = size of a symbol in bits; H = #bits / time unit)
Claude Shannon is credited with extending this and allowing for the fact that all messages do not have the same chances of occuring. His major contibution had to do with adding the element of uncertainty to the definitions first devised by Hartley. It is perhaps fitting that at least one source discussing Shannon's famous paper [A Mathematical Theory of Communication] has a quote from his obituary at the top of the paper and then says he lives in Michigan. It would seem there is some uncertainty as to his state of being. (Most would now concur that Shannon died early in 2002)

To get a better picture of the information, we must assign probabilities of occurrence to the individual symbols:

(given n symbols)

sum-of-all-probabilities = 1 = P1 + P2 + P3 +...+ PN
If all P are equal we get nP = 1,
implying that P = 1/n (also n = 1/P), and resulting in
H = s log2n = s log2(1/P) = - s log2P
Entropy 
Since symbol ai occurs Pi % of the time, it occurs on average sPi times in each time unit so it's contribution to H is -sPi log2 Pi
The sum of all of them is: H = -s (this is called entropy)
We can define the entropy of a single symbol ai as -Pi log2 Pi.

This is the smallest number of bits needed, on average, to represent the symbol.
The term entropy comes from thermodymnamics and refers to disorder. It is more than simply the smallest number of bits needed to represent the symbol. When we are looking at the entropy of a message we can also say something about the disorder of the message - in other words its "randomness". Entropy is at a maxumum when all the probabilities are the same (note: this is the opposite of what the Solomon book claims) . To get an intuitive feel for what this means imagine that the game is "guess the symbol". If the probabilities of all symbols are equal and there are, say, 4 symbols we have a 1 in 4 chance of guessing right. On the other hand, if the probabilites are, say, .49%, .25%, .25%, and .01% respectively we would be wise to guess a1 since we would be right nearly 1/2 the time. So, maximum entropy happens when you have the worst chances of guessing the symbols right. It is also when we need the greatest number of bits to represent those symbols.
In addition to saying something about the disorder of the message, entropy also tells us the minimum number of bits needed to represent the symbols. Clearly, when all the probabilities are equal:
- = log2n
Redundancy
Defn: the difference between symbol size (actual) and the entropy (minimal)
R = log2(1/P) - (- )   --OR--
R = log2n - (- ) --OR more useful: --
R = symbol_size - (- )
so the test for fully compressed data is:
R = 0

Example: we have a 4-symbol alphabet: a1, a2, a3, a4
 
If they all have equal probability then Pi = .25 for all i
entropy = -4 ( 0.25 log2 0.25) = 2 (which = log2n )
so entropy = maximum entropy and we use 00, 01, 10, 11
 
but if the probabilities are: 0.49, 0.25, 0.25, 0.01
then
entropy = -(.49 log2.49 + 2(.25log2.25) + .01log2.01)
» -(-0.05 - 1.0 - 0.066) = 1.12
 
so the smallest # of bits required = 1.12 so we could assign:
a1 = 1, a2 = 01, a3 = 000, a4 = 001
 
If we assign 2 bit codes, redundancy is 2.0 - 1.12 = 0.88 which suggests the use of variable sized codes. (In other words, we are wasting bit space if we assign 2-bit codes to all of these symbols)

If we use the variable codes suggested, what is the redundancy?
We can calculate average symbol size by summing each symbol's probability multiplied by that symbol's code length:
symbol size = (.49 * 1 + .25 * 2 + .25 * 3 + .01 * 3)
 = (.49 + .5 + .75 + .03)
 = 1.77
Now, if we compare this against the entropy of the symbols, we get:
R = symbol_size - (- )
R = 1.77 - 1.12
    = 0.65, not too bad (small is good in this case).
And, if we compare same-size (2-bit) codes against the entropy of the symbols, we get:
R = log2n - (- )
R = 2.0 - 1.12
    = 0.88, not as good.

If we are to assign varying length codes, they must be assigned such that all are unambiguous. In other words, we must be able to distinguish between symbols and 'decode' them when we are reading them in as one great long bit string essentially one bit at a time. The best way to do this is by applying the prefix property which says that once a certain bit pattern has been assigned to a code no others should start the same way. (This makes it a prefix code)


Here's what we mean...

symbol:
same size codes variable codes, 1 variable codes, 2 variable codes, 3
a1 [A]
00
1
1
1
a2 [B]
01
01
10
10
a3 [C]
10
001
11
100
a4 [D]
11
000
111
110
sample message 1011011010010001 00001010011101 1111110111110110 1001101010010010110
sample message (decode colours) 1111110111110110 1001101010010010110
intended message CDBCCBAB DBBCAAB CDBCCBAB CDBCCBAB
decoded message CDBCCBAB DBBCAAB CCC??? CDBCCBD
BackCPSC 461: Copyright (C) 2003 Katrin Becker 1998-2002 Last Modified May 28, 2003 09:27 PM