sequential search is O(N) access
B-Trees give us O(logk N)
O(1) means that access always takes the same small number of accesses - this is the ideal
hashing can give us O(1)
EXAMPLE:
have 75 records and 1000 spaces
key is hashed using ASCII codes of first two characters of name, multiplying and using rightmost 3 digits of result for address
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Given that the keys conform to certain patterns, a perfect hashing function might be found. Here are four such algorithms:
1. Quotient Reduction. It assumes the keys are in ascending order and evenly spread, although not continuous:
hash(w) = (w + s) / M
let's say the keys are:
1, 3, 8, 14, 17, 23, then (w + 3) / 5 yields
0, 1, 2, 3, 4, 5
2. Remainder Reduction. Here the keys don't have to be evenly distributed.
hash(w) = ((d + wq) mod M) / K
In this case as well, d, q, M, K must be chosen only after close examination of the keys themselves.
3. Associated Value Hashing: (a.k.a. Cichelli's Algorithm) Assumes the keys are characters.
hash(key) = key length +
associated value of the key's first character +
associated value of the key's last character
This one involves extensive searching of the keys to find appropriate associated values for the first and last characters based on the frequency of use of each character in both the first and last positions.
The Algorithm:
Assign the value 0 to the first and last characters of the key in the set.
4. Reciprocal Hashing:
hash (w) = c/(d * w + e) Mod n
n is the number of keys in the set, and the constants c, d, and e must be found such that (c/w) Mod n is different for each key and the constants d and e are chosen so that d * (w) + e are pair wise relatively prime for all keys in the set.
In most cases producing a perfect hashing algorithm isn't easy. Suppose we have 4000 records and 5000 places. It can be shown that only 1 out of 10120,000 avoids collisions totally. (BTW there are ~1060 atoms in the Universe) Don't waste your time trying.
Solution is to reduce the collisions to an acceptable level. - like only 1 out of 10 searches results in a collision - this leaves the average access still pretty low.
Ways to reduce the # collisions:
HASH means to 'chop into small pieces ... muddle or confuse.'
A SIMPLE HASHING ALGORITHM:
Step 1. represent the key in numerical form: a common approach (and simple) is to use the ASCII codes of the key to make a number: concatenate them
Step 2. fold and add: cut off pieces and add them together (like sets of 2 or 3 codes). Make sure we take into account the size of the numbers we want to work with (if 2 bytes then max. value = 32,767). We have to make sure each successive sum is less than our allowable max.
Step 3. divide by the size of the address space: to cut down the number produced so it falls within the range of addresses within the file. Again can use MOD. Again a prime number produces better results and we are not necessarily restricted to a specific number.
function hash (key, maxad)
Some other Hashing Methods:
Examine keys for a pattern (use natural order of keys) - either all or part of the key can be used in exam (useful for static files); EXAMPLE: after looking at the keys we determine that the 9th, 7th, 5th, and 2nd values within the keys are fairly evenly distributed so we can extract those 4 keys to form the address.
Divide the key by a number (use natural order of keys) - preserves consecutive key sequences - but produces collisions when there are several consecutive sequences
- truncation - cut off part of the number and use it
- extraction - pull out digits from elsewhere
Square the key and take the middle (randomize keys) - mid-square method treat the key as one large number; square it and extract the # of digits needed from the middle (453 * 453 = 205,209 - taking the middle gives us 52. This works pretty good as long as keys don't have many leading or trailing zeros.
Radix transformation (randomize keys) - convert the key to some other base; take result modulo maximum address (ex. to get addr 0-99; if key is 453 - base 11 equivalent = 382; 382 mod 99 = 85, so 85 is hash address) - 2 approaches: 1. Covert to new base; 2. Multiply current digits by new base