CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 5, 2000 11:13 AM

HASHING - Part I

WHAT IS HASHING?

 

NAME
ASCII 
PRODUCT CODES
HOME ADDRESS
BALL
66 65
4290
290
LOWELL
76 79
6004
004
TREE
84 82
6888
888
OLIVER
79 76
6004
004***

 

Given that the keys conform to certain patterns, a perfect hashing function might be found. Here are four such algorithms:

1. Quotient Reduction. It assumes the keys are in ascending order and evenly spread, although not continuous:

hash(w) = (w + s) / M

let's say the keys are:

1, 3, 8, 14, 17, 23, then (w + 3) / 5 yields

0, 1, 2, 3, 4, 5

2. Remainder Reduction. Here the keys don't have to be evenly distributed.

hash(w) = ((d + wq) mod M) / K

In this case as well, d, q, M, K must be chosen only after close examination of the keys themselves.

3. Associated Value Hashing: (a.k.a. Cichelli's Algorithm) Assumes the keys are characters.

hash(key) = key length +

associated value of the key's first character +

associated value of the key's last character

This one involves extensive searching of the keys to find appropriate associated values for the first and last characters based on the frequency of use of each character in both the first and last positions.

The Algorithm:

  1. Order the set of keys based upon the sum of the frequencies of occurrence of the first and last characters of the keys in the entire set. Place the keys with the most frequently occurring first and last characters at the top of the ordering.
  2. Modify the ordering. Place a key whose first and last characters both appear in prior keys as high in the ordering such that this condition holds.

Assign the value 0 to the first and last characters of the key in the set.

  1. For each remaining key in the set, in order,
    1. If the first and last characters have already been assigned values, hash the key to determine if a conflict arises. If yes, discontinue processing on this key and recursively apply this step with the previous key (backtrack).
    2. If either the first or last character has already been assigned a value, assign a value to the other character by trying all possibilities from 0 to the maximum allowable value. If a conflict still exists after trying all possible values for the other character, then discontinue processing on this key and recursively try this step with the previous key (backtrack).
    3. If both the first and last characters are as yet unassigned, vary the first and then the last character, trying each combination. If all combinations cause conflicts, then discontinue processing on this key and recursively apply this step to the previous key (backtrack).
  2. If all keys have been processed, terminate successfully, else terminate unsuccessfully.

4. Reciprocal Hashing:

hash (w) = c/(d * w + e) Mod n

n is the number of keys in the set, and the constants c, d, and e must be found such that (c/w) Mod n is different for each key and the constants d and e are chosen so that d * (w) + e are pair wise relatively prime for all keys in the set.

In most cases producing a perfect hashing algorithm isn't easy. Suppose we have 4000 records and 5000 places. It can be shown that only 1 out of 10120,000 avoids collisions totally. (BTW there are ~1060 atoms in the Universe) Don't waste your time trying.

Solution is to reduce the collisions to an acceptable level. - like only 1 out of 10 searches results in a collision - this leaves the average access still pretty low.

Ways to reduce the # collisions:

  1. spread out the records: so that records are spread out among available addresses. The example wouldn't work - there are lots of names that start with 'JO' or 'Mc'
  2. use extra memory: the example spreads 75 records among 1000 spaces. This works but wastes a lot of space
  3. put more than one record at a single address: make every file address big enough to hold several records (buckets)

HASH means to 'chop into small pieces ... muddle or confuse.'

A SIMPLE HASHING ALGORITHM:

  1. represent the key in numerical form
  2. fold and add
  3. divide by a prime number and use the remainder as an address

Step 1. represent the key in numerical form: a common approach (and simple) is to use the ASCII codes of the key to make a number: concatenate them

Step 2. fold and add: cut off pieces and add them together (like sets of 2 or 3 codes). Make sure we take into account the size of the numbers we want to work with (if 2 bytes then max. value = 32,767). We have to make sure each successive sum is less than our allowable max.

  1. find the largest allowable intermediate result. If using two letters and they are all upper case this would be ZZ = 9090. if we choose 19937, adding 9090 wouldn't cause overflow so we're OK.
  2. Use MOD to make sure values don't get bigger than 19937 - choose 19937 because it's a prime number and produces better random distribution than a non-prime.

Step 3. divide by the size of the address space: to cut down the number produced so it falls within the range of addresses within the file. Again can use MOD. Again a prime number produces better results and we are not necessarily restricted to a specific number.

function hash (key, maxad)

{ where key is the record key (12 chars long) and maxad is the address size }
set sum to 0
set j to 0
while ( j < 12)
    set sum to (sum + 100*key[j] + key[j+1])
         mod 19937
     j := j + 2
endwhile
return (sum mod maxad)
end function

Hashing Functions and Record Distributions

Square the key and take the middle (randomize keys) - mid-square method treat the key as one large number; square it and extract the # of digits needed from the middle (453 * 453 = 205,209 - taking the middle gives us 52. This works pretty good as long as keys don't have many leading or trailing zeros.

  • Radix transformation (randomize keys) - convert the key to some other base; take result modulo maximum address (ex. to get addr 0-99; if key is 453 - base 11 equivalent = 382; 382 mod 99 = 85, so 85 is hash address) - 2 approaches: 1. Covert to new base; 2. Multiply current digits by new base


    Back to Top
    CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 5, 2000 11:13 AM