461 - Data Compression Review

DATA COMPRESSION REVIEW QUESTIONS

SHORT ANSWER QUESTIONS

[ 3 marks ] Why is Run Length Encoding generally a poor choice for compressing text?
[ 2 marks ] In Static Huffman Compression, if the resultant Huffman Tree is full and balanced, what can be said about the frequencies of the symbols?
[ 5 marks ] Next to Arithmetic Coding, Huffman Coding is said to produce the best compression ratios but LZW (and its variants) is the method of choice for image compression. Explain this discrepancy. (Hint: What do they actually encode?)
[ 2 marks ] If one considers that a common bit pattern is a representation of the null value, why might the use of odd parity be preferred over even parity for transmission of data?
[ 5 marks ] Entropy is defined as the number of bits required to represent one symbol of a set and is based on the probability of that symbol appearing in a given message. Its value is greatest when the probabilities of all the symbols are equal. How many bits would be required to represent a symbol whose probability of occurrence has been set to 0? Explain.
[ 2 marks ] What is one important aspect of a file that is usually lost when it gets compressed?
[ 4 marks ] We want to compress a file using RLE encoding but all possible bit strings are used in the data so none are readily available for use as a repetition code. What do we do? (Offer 2 solutions)
[ 3 marks ] Give one real-life situation (be specific) where compression with loss of data is acceptable.
[ 3 marks ] Match the compression method with the type of compression:

1.Assigning variable length codes	____run length encoding
2.Suppressing repeated sequences	____sampling, as in voice coding
3.Irreversible compression	____Huffman encoding
4.Using different notation

LONG QUESTIONS

[10 marks total] You are given a message that contains 4 symbols:
a1, a2, a3, a4
that appear with the following probabilities: a1: 0.5, a2: 0.3, a3: 0.1, a4: 0.1
They are currently encoded as straight-forward two-bit values.
The entropy for the message is: - , so
= -((.5log2.5) + (.3log2.3) + (.1log2.1) + (.1log2.1))
= -( -0.5 + -0.52 + -0.16 + -0.16)
= -(-1.34) = 1.34

Varying length codes have been assigned to the symbols as follows:
a1: 1, a2: 01, a3: 000, a4: 001

A. [5 marks]: Given the preceding information, what compression ratio would you expect in a typical message?

B. [5 marks] If the frequencies vary, at what point would it be pointless to try and use varying length codes? (hint: I'm looking for a formula that uses the probabilities but not the formulas for entropy or redundancy.)
(worth 15 total)
Describe what characteristics of a data set (i.e. text sorted in ascending order, etc) that would be best compressed using each of the following algorithms. In each case explain why.

1. Arithmetic Coding:
2. Run Length Encoding:
3. LZW and its variants:
4. Huffman Coding:
5. Front Compression:
(worth 10 total)
Decode the following message which has been encoded using LZW compression (case has been flattened).
(worth 25 total) Encode the following string using first LZ77 and then LZW compression. The "base codes" are A-Z, and <space> (assume flattened case). Show your ‘dictionaries’. Compare the results (either qualitatively or quantitatively).

Enter the codes for the dictionary here:

The codes: (please enter them row by row)
(worth 15 total)
The encoder for the LZRW1 dictionary method of data compression uses a hash table to find match lengths. The hash table entries are simply pointers to the locations of matches in the search window. This hash table is implemented using linked chains for overflow entries. The key is a 12-bit value that is the result of hashing the 1st 3 chars of the match length. Synonyms are chained and the chains are linked such that the most recent synonym is first. These chains are truncated at a specified length rather than growing arbitrarily long. Explain how this might be implemented (i.e. What Collision Resolution Technique must be used? How are the ends of the chains allowed to 'drop off'? How must the chains be linked? How is a new synonym added? How can we ensure that space occupied by out-of-date synonyms can be re-used? )
You may use point-form, pseudo-code; or diagrams as necessary to support you explanation.
(worth 10) Encode the following string using LZW compression. The "base codes" are a-z, and <space>. Show your dictionary.
there are hits and there are misses and then there are misses
Enter the codes for the dictionary here:

The codes: (please enter them row by row)
(worth 25 total)
A) [15/25 marks] Consider the following data for a static Huffman tree
a. A bit string representing a binary tree structure:
[ 1 = node exists; 0 = no node]
1111111110011000000000011000000
b. A set of terminal symbols:
A B C D E F G
Draw the tree represented by the data.

B) [10/25 marks]Given the following tree and encoded message, decompress the message (this is not a trick; the message says something).