WHAT IS HASHING?

HASH FUNCTION: (also called MAPPING FUNCTION)
like a black box that produces an address every time you drop in a key.
function h(K) that transforms a key into an address that is used as the basis for storing and retrieving records
the address produced (home address) is the address of the record in the file
hashing addresses appear to be random - there is no obvious connection between the key and the location of the record (sometimes called randomizing)
the hashing function may result in two keys being transformed into the same address - this is a collision and must be dealt with
EXAMPLE:
have 75 records and 1000 spaces
key is hashed using ASCII codes of first two characters of name, multiplying and using rightmost 3 digits of result for address

NAME ASCII PRODUCT CODES HOME ADDRESS

BALL 66 65 4290 290

LOWELL 76 79 6004 004

TREE 84 82 6888 888

OLIVER 79 76 6004 004***

Notice two keys hash to the same address. These are called synonyms.
To solve:
a) choose a hashing algorithm to minimize collisions
b) get tricky about how to store records (collision resolution algorithms)
Ideally we want to use an algorithm that produces no collisions (perfect hashing algorithm).

Given that the keys conform to certain patterns, a perfect hashing function might be found. Here are four such algorithms:

1. Quotient Reduction. It assumes the keys are in ascending order and evenly spread, although not continuous:

hash(w) = (w + s) / M

let's say the keys are:

1, 3, 8, 14, 17, 23, then (w + 3) / 5 yields

0, 1, 2, 3, 4, 5

2. Remainder Reduction. Here the keys don't have to be evenly distributed.

hash(w) = ((d + wq) mod M) / K

In this case as well, d, q, M, K must be chosen only after close examination of the keys themselves.

3. Associated Value Hashing: (a.k.a. Cichelli's Algorithm) Assumes the keys are characters.

hash(key) = key length +

associated value of the key's first character +

associated value of the key's last character

This one involves extensive searching of the keys to find appropriate associated values for the first and last characters based on the frequency of use of each character in both the first and last positions.

The Algorithm:

Order the set of keys based upon the sum of the frequencies of occurrence of the first and last characters of the keys in the entire set. Place the keys with the most frequently occurring first and last characters at the top of the ordering.
Modify the ordering. Place a key whose first and last characters both appear in prior keys as high in the ordering such that this condition holds.

Assign the value 0 to the first and last characters of the key in the set.

For each remaining key in the set, in order,
1. If the first and last characters have already been assigned values, hash the key to determine if a conflict arises. If yes, discontinue processing on this key and recursively apply this step with the previous key (backtrack).
2. If either the first or last character has already been assigned a value, assign a value to the other character by trying all possibilities from 0 to the maximum allowable value. If a conflict still exists after trying all possible values for the other character, then discontinue processing on this key and recursively try this step with the previous key (backtrack).
3. If both the first and last characters are as yet unassigned, vary the first and then the last character, trying each combination. If all combinations cause conflicts, then discontinue processing on this key and recursively apply this step to the previous key (backtrack).
If all keys have been processed, terminate successfully, else terminate unsuccessfully.

4. Reciprocal Hashing:

hash (w) = c/(d * w + e) Mod n

n is the number of keys in the set, and the constants c, d, and e must be found such that (c/w) Mod n is different for each key and the constants d and e are chosen so that d * (w) + e are pair wise relatively prime for all keys in the set.

In most cases producing a perfect hashing algorithm isn't easy. Suppose we have 4000 records and 5000 places. It can be shown that only 1 out of 10^120,000 avoids collisions totally. (BTW there are ~10⁶⁰ atoms in the Universe) Don't waste your time trying.

Solution is to reduce the collisions to an acceptable level. - like only 1 out of 10 searches results in a collision - this leaves the average access still pretty low.

Ways to reduce the # collisions:

spread out the records: so that records are spread out among available addresses. The example wouldn't work - there are lots of names that start with 'JO' or 'Mc'
use extra memory: the example spreads 75 records among 1000 spaces. This works but wastes a lot of space
put more than one record at a single address: make every file address big enough to hold several records (buckets)

HASH means to 'chop into small pieces ... muddle or confuse.'

A SIMPLE HASHING ALGORITHM:

represent the key in numerical form
fold and add
divide by a prime number and use the remainder as an address

Step 1. represent the key in numerical form: a common approach (and simple) is to use the ASCII codes of the key to make a number: concatenate them

Step 2. fold and add: cut off pieces and add them together (like sets of 2 or 3 codes). Make sure we take into account the size of the numbers we want to work with (if 2 bytes then max. value = 32,767). We have to make sure each successive sum is less than our allowable max.

find the largest allowable intermediate result. If using two letters and they are all upper case this would be ZZ = 9090. if we choose 19937, adding 9090 wouldn't cause overflow so we're OK.
Use MOD to make sure values don't get bigger than 19937 - choose 19937 because it's a prime number and produces better random distribution than a non-prime.

Step 3. divide by the size of the address space: to cut down the number produced so it falls within the range of addresses within the file. Again can use MOD. Again a prime number produces better results and we are not necessarily restricted to a specific number.

function hash (key, maxad)

{ where key is the record key (12 chars long) and maxad is the address size }
set sum to 0
set j to 0
while ( j < 12)
set sum to (sum + 100*key[j] + key[j+1])
mod 19937
j := j + 2
endwhile
return (sum mod maxad)
end function

Hashing Functions and Record Distributions

Ideal is when we have no synonyms, worst is when all keys hash to the same address, acceptable is when we have a small number of synonyms.
Some other Hashing Methods:
Examine keys for a pattern (use natural order of keys) - either all or part of the key can be used in exam (useful for static files); EXAMPLE: after looking at the keys we determine that the 9th, 7th, 5th, and 2nd values within the keys are fairly evenly distributed so we can extract those 4 keys to form the address.

Fold parts of the key (use natural order of keys) - destroys original key pattern but may preserve certain subsets of keys that naturally spread themselves out
Divide the key by a number (use natural order of keys) - preserves consecutive key sequences - but produces collisions when there are several consecutive sequences

- truncation - cut off part of the number and use it
- extraction - pull out digits from elsewhere

Square the key and take the middle (randomize keys) - mid-square method treat the key as one large number; square it and extract the # of digits needed from the middle (453 * 453 = 205,209 - taking the middle gives us 52. This works pretty good as long as keys don't have many leading or trailing zeros.

Radix transformation (randomize keys) - convert the key to some other base; take result modulo maximum address (ex. to get addr 0-99; if key is 453 - base 11 equivalent = 382; 382 mod 99 = 85, so 85 is hash address) - 2 approaches: 1. Covert to new base; 2. Multiply current digits by new base

NAME	ASCII	PRODUCT CODES	HOME ADDRESS
BALL	66 65	4290	290
LOWELL	76 79	6004	004
TREE	84 82	6888	888
OLIVER	79 76	6004	004***