- CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified November 16, 2001 11:16 AM
HASHING - Part V
Dynamic Hashing
-
- The main idea behind dynamic hashing is to allow us to extend the address space. Our goal is to increase the actual number of addresses rather than just allowing each address [i.e. hash function value] to hold more keys.
-
- To make this work we need > 1 hash function.
-
- We start with h0(k) and as the address space grows move to h1(k)
-
- Note h0(k) and h1(k) must be related so that both will work at the same time if necessary (we don't want to have to re-load the table).
-
- One simple way to do this is to use a hash function that can produce a fractional result and then use successive portions of the fraction to address subdivisions of the original hash address.
-
- Dynamic hashing allows for slower, more gradual growth. By contrast, extendible hashing deals with growth by doubling the directory (but there are no additional addresses to hash to).
Linear Hashing
Assume we have an address space consisting of N buckets with addresses from 0,1,..N-1
For placement of new keys: When a bucket (any bucket) becomes full, splitting occurs in a predetermined order. There is no directory.
Assume we have a hash function h0(k) = K mod 5 Address space looks like this:
p^ ...........................When we split, bucket 0 gets split first and 1/2 the keys from 0 are moved into the new bucket 5. Currently all keys in B0 end in 0 or 5 so to split we move all keys that end in 5 to the new bucket.
When searching now:
.... p^ ...........................If h0(k) < p then the bucket we are looking for has already been split so we use h1(k) = k mod 10. Once all buckets are split, we only use h1(k) and we start the process again. We need some way to handle full buckets that aren't due to split yet so we must reserve a number of overflow buckets. Each time we add an overflow bucket we do another split. Eventually those buckets that have overflowed will be re-distributed.
- An alternative to actually waiting for a bucket to get full is to set an overall load factor. Fix an upper and lower bound on the load factor and then we expand (contract) whenever we get above (below) the bound.
- The expected cost of expanding or collapsing these buckets varies cyclically.
Spiral Storage
Spiral storage deals with the cyclical cost increases of linear hashing. The trick here is in how we map the hashing function onto the address space itself. To make this work we need a progression of address 'sets' whose values partially overlap from one phase to the next.
-
- We start with a function that maps uniformly into [0,1), in other words 0 <= h(k) < 1 (a standard system random number generator might work using the key itself as the seed)
-
- Our goal is to map the key onto a given space defined by the function:
- [floor(d S), floor(d S+1)-1)
- We specify d as our growth factor (it must be some value > 1).
-
- Given that N is related to the number of addresses we have to use, we calculate S as follows:
- d S+1 - d S = N
- solving for S we get
- S = log d N
- We generate an address in the given address space using the function floor(d x). This one is called the expansion function.
-
- Note that this scheme intentionally loads the beginning of the address space more heavily than the end.
- We have a set number of buckets; each the same size (i.e. each bucket holds the same number of records) BUT we distribute the keys unevenly:
- - earlier buckets experience a heavier load than later buckets
- - the initial hash function produces a random distribution but the mapping function does not
- { the mapping function maps the hash result onto an address }
- The Result: earlier buckets fill first.
-
- Now, when we need more space....
- add new buckets at the end (logically/conceptually at least)
- move all the keys from the 1st bucket into the new buckets (using a NEW mapping function)
- "decommision" the first bucket
To make this work:
- we need a series of mapping functions whose values "overlap" so that records in older buckets (i.e. ones that didn't get moved when we grew) can still be found.
We need a fucntion F(x) and a related function F'(x) [ and another related function F''(x) for the next time we grow....]. F'(x) must have more values (addresses) than F(x) did.
- Let's start with a key K and some function hash(K) such that hash(K) = [0, 1).
- We also need 2 other values:
- d : the growth factor (i.e. how many buckets to add each time we grow)
- N : a 'scale' factor; The number of addresses we have is (approximately) given by N * (d-1) = A
- In other words: N = A / (d-1).
- This function hash(K) must be mapped onto a real # x in the range [S, S+1). The fractional parts of x and hash(K) must agree.
-
- Once we have d and N we can calculate S:
- N = ceiling(d S-1) - floor(d S), so...
- S = log d N
-
- Example 1: N=8 (8 addresses) d=2 (growth factor)
- so 2 S+1 - 2 S = 8 => S = log 2 8 = 3
- first address is 23 = 8
- last address is 24 - 1 = 15
- if h(K) = 0.4 then x = ceiling(3 - .4) + .4 = 3.4
- trunc( 23.4 ) = 10 (10 is the address we map to)
- Example 2:
- Suppose we have a growth factor (d) of 2 and a scale factor (N) of 5:
-
-
- N=5 (5 addresses) d=2 (growth factor)
- so 2 S+1 - 2 S = 5 => S = log 2 5 = 2.32193
-
- first address is 22.32193 = 5
- last address is 23.32193 - 1 = 9
-
- if h(K) = 0.75 then x = ceiling(2.32193 - 0.75) + .75 = 2.75
- floor( 22.75 ) = 6 (6 is the address we map to)
-
- Here's how the rest of the values map:
-
| H(key) |
x |
address |
% of keys |
| [2.32193, 2.58496) |
[0.32193, 0.58496) |
5 |
26.3 |
| [2.58946, 2.80735) |
[0.58496, 0.80735) |
6 |
22.2 |
| [2.80735, 3.00000) |
[0.80735, 1.00000) |
7 |
19.3 |
| [3.00000, 3.16993) |
[0.00000, 0.16993) |
8 |
17.0 |
| [3.16993, 3.32193) |
0.16993, 0.32193) |
9 |
15.2 |
| when bucket 5 fills and we need to grow the address space we add: |
| [3.32193, 3.45943) |
[0.32193, 0.45943) |
10 |
13.7 |
| [3.45943, 3.58496) |
[0.45943, 0.58496) |
11 |
12.6 |
| we move records from bucket 5 into the new ones and DELETE bucket 5 |
| And when bucket 6 fills we grow the address space again so we add: |
| [3.58496, 3.70044) |
[0.58496, 0.70044) |
12 |
12.0 |
| [3.70044, 3.80735) |
[0.70044, 0.80735) |
13 |
10.5 |
- When we need to grow we simply change S from log 2 5 to log 2 6. We recalculate S, and just plug it back into the rest of our function.
-
- So now NOTHING will map to 5 and those keys that used to map to 5 will instead map to 10 & 11
CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified November 16, 2001 11:16 AM