Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 12, 2000 01:35 PM

Programming Assignment:

Data Compression Experiment

(may be done in groups of 2|4|6)

(Group members must notify TA by email or in writing; all members of the group must be named at the same time. There will be no changes without the express approval of the TA. Notification must be made no later than OCTOBER 6. After this time there will be NO NEW GROUPS or GROUP CHANGES) 


For this assignment you are to implement a number of data compression algorithms, test them and then compare their relative performances in terms of speed and compression ratio. Each pair of members of the group must implement one technique (one person will implement the compression routine and the other will do the de-compression routine) and no two algorithms in your group may be closely related (for example, LZW and LZ77 are considered to be closely related).
Each pair in your group is to choose one algorithm to implement. Each choice must come from a different category [i.e. one from each colmn below].
Intuitive Methods
Bonus
Dictionary Methods
Bonus
Statistical Methods
Bonus
Arithmetic Methods
Bonus
run-length encoding
[0]
LZ77
[2]
Huffman Coding
[4]
Arithmetic coding
[6]
front compression
[0]
LZ78
[2]
Adaptive Huffman Coding
[6]
other*
[??]
relative encoding
[0]
LZRW1
[2]
Shannon-Fanno Coding
[4]
move-to-front
[0]
LZW
[2]
PPM
[6]
other*
[?]
GIF
[?!]
other*
[??]
other*
[?]
other*: If you wish to implement an algorithm not on the list, please check with the instructor first.
When these algorithms are tested, make sure they are all executed on the same data set(s) and under similar circumstances (i.e. same server, similar system load, etc.) Each must be run at least 10 times in order to verify the results.
If timings prove too difficult to measure on individual data files, time the program on a set of data files and use that to compute individual timings. One suggestion is to have the program compress the same file 100 or 1000 times. (Write a shell script for this)
The Data Set:
You will be provided with 4 data-sets: one containing all text; one containing ASCII representations of numbers; and one containing binary files, and one containing files already compressed. Run each CODEC on all four.
Collecting Results:
Calculate the following sets of values for each CODEC:
( Each "value set" should be calculated using as many trials as possible (minimum 10 per dataset). You need to find the minimum, maximum, and the average)
  1. the compression ratio (compressed size in bytes/ uncompressed size in bytes as a %)
  2. the time taken to compress a file (in ms/MegaByte)
  3. the time taken to decompress a file (in ms/MegaByte)
To gather the timing information see the UNIX man pages under clock.

For each CODEC, you should provide a table like the following:
CODEC Compression Ratio Compression Speed Decompression Speed
MIN MAX AVE MIN MAX AVE MIN MAX AVE
text-set
number-set
binary-set
other-set

A facility will be set up to allow you to submit these results once you have them. They will then be made available to the entire class and each individual will write up their own conclusions.

Conclusions:
Was there a clear winner?
Which ones works best on which types of files?
What general comments can be made about the results presented?
If it turns out that different groups have implemented the same algorithms and the 'statistical' results are different, how do you account for that?
Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 12, 2000 01:35 PM