CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 5, 2000 11:19 AM

Predicting the Distribution of Records

No nice mathematical tools available for predicting collisions for distributions that are better than random, but there are tools for understanding this behaviour when records are random. They can still give us some estimates that can be useful. The likelihood of collisions in hashing tables relates to the well known mathematical diversion:

THE Birthday Surprise: How many randomly chosen people need to be in a room before it becomes likely that two people will have the same birthday (month and day only)? Since (apart from leap years) there are 365 possible birthdays, most people guess that the answer will be in the hundreds, but in fact it is 24!

Get person 1; check off her birthday; get person 2; probability that he will have a different birthday is 364/365; probability for person 3 is 363/365 etc. If the first m-1 people have different birthdays, then the probability that person m has a different birthday is

(365 - m + 1) / 365

People's birthdays are independent of each other so the probabilities multiply so the probability that m people all have different birthdays is

X X X ... X

The expression becomes less than 0.5 whenever m 24.

Poisson Distribution:

want to predict # of collisions expected in a file that can have only one record/address
look at what happens to a single address when a hash function is applied to a key
when all of the keys are hashed, what are the chances that:
1. none will hash to the given address?
2. exactly one will hash to the address?
3. exactly 2 keys will hash to the given address
4. exactly 3,4,... keys will hash to the given address?
5. all keys will hash to the given address?
when a single key is hashed, there are two possible outcomes:

1. the address was not chosen (A)

2. the address was chosen (B)

how do we express the probabilities:

p(B) = b = ,

since the address has one chance in N of being chosen, and

p(A) = a = = 1 -

since the address has N - 1 chances of not being chosen.

If there are 10 addresses then probability of address being chosen = .1; chances of not being chosen = .9. If there are two keys hashed p(BB) = .1 * .1 = .01 since the two are independent of each other.

Other outcomes are possible when we hash two keys:

1 key 'hits' while the other misses:

p(BA) = b * a = * = .1 * .9 = .09 for N = 10

In general when we want to know the probability of a certain sequence of outcomes, such as BABBA we can compute the result with simple substitution:

p(BABBA) = b * a * b * b * a = b³a² for N = 10 : (0.9)³*(0.1)²

If we want to know the probability of 2 As and 2 Bs, regardless of order, we do it this way:

p(AABB) = a * a * b * b for N = 10 : (0.1)²*(0.9)² = .081

other combinations are: BBAA, BABA, BAAB, ABBA, ABAB and the probabilities for each is the same.

Since they are all independent of each other, the probabilities of 2 As and 2 Bs is the sum of the individual outcomes: 0.081 * 6 = 0.486

In general, the event r trials result in r - x As and x Bs can happen in as many ways as r - x letters A can be distributed among r places. The probability of each such way is

a^r-xb^x

and the number of such ways is given by the formula

C =

This is formula for # of ways to select x items out of a set of r items.

When r keys are hashed, the probability that an address will be chosen x times and not chosen r - x times can be expressed as

p(x) = Ca^r-xb^x

If we know there are N addresses, we get

p(x) = C

We can use this to compute the probability that a given address will have 0 records assigned to it:

p(0) = C

If x = 1 we can compute the probability that one record will be assigned to a given address:

p(1) = C

If N and r are large this is a pain to compute, we can use an approximation : Poisson Function.

The Poisson Function Applied to Hashing

p(x) =

where N = the number of available addresses; r = the number of records to be stored; x = the number of records assigned to a given address; then p(x) gives the probability that a given address will have had x records assigned to it after the hashing function has been applied to all n records.

Suppose that there are 1,000 addresses (N = 1,000) and 1,000 records whose keys are to be hashed to the addresses (r = 1,000). Since r/N = 1, the probability that a given address will have no keys hashed to it (x = 0) is

p(0) = = 0.368

p(1) = = 0.368

p(2) = = 0.184

p(3) = = 0.061

In general, if there are N addresses, then the expected number of addresses with x records assigned to them is

Np(x)

Another way to think about it is that p(x) as giving the proportion of addresses having x logical records assigned by hashing.

Predicting Collisions for a Full File

10,000 records and 10,000 addresses:

How many addresses are expected to have NO records assigned?

p(0) = = 0.3679 # addresses w/ no records = 10,000 * p(0) = 3,679

10,000 * p(1) = 0.3679 * 10,000 = 3,679

10,000 * p(2) = 0.1839 * 10,000 = 1,839

10,000 * p(3) = 0.0613 * 10,000 = 613

This means that 3,679 addresses will be empty, 3679 will have 1 record (fine) 1,839 will have 2 (small problem); 613 will have 3 (bigger problem) = 1226 overflow records for these 613 addresses.

Need a plan for dealing with these overflows to try and reduce the number that actually overflow.

How Much Extra Memory do We Need?

Since perfect hashing functions are nearly impossible to find for dynamic files, one way to deal with overflow is to allow extra space in the file. Packing Density refers to the number of records r to be stored in the available space N; Packing density = r/N (also called loading factor)

E.g. 75 records in 100 spaces = 75/100 = 75%

Packing density gives a measure of the amount of space in a file actually used. Raw size and its address space don't matter; what does is the relative sizes of the two - packing density.

Obviously the more free space you have allotted (lower packing density) the less likely you are to have a collision. Trade-off is how much space are we willing to waste to cut down on collisions?

depends on many things - disk; O/S; file use; etc.

Predicting Collisions

p(x) =

Note r and N always occur as a ratio - they are not independent. - 500 records in 1,000 addresses have same behaviour as 500,000 records in 1,000,000 addresses. r/N = packing density

If r = 500; N = 1,000; r/N = 500/1,000 = 0.5 = packing density.

How many addresses should have no records assigned?

Np(0) = 1,000 * = 0.607 X 1,000 = 607

How many addresses should have exactly 1 record?

Np(1) = 1,000 * = 0.303 = 303

How many should have 1 plus 1 or more synonyms?

Np(2) + Np(3) + Np(4) + ...= N[p(2)+p(3)+p(4)]

We really only need to compute the first 5 terms because after that they become insignificantly small, so p(2)+p(3)+p(4)+p(5) = 0.0758 + 0.0126 + 0.0016 + 0.0002 = 0.0902; and the number of addrs with one or more synonyms is just 1,000 times this so: 90.

Here's another way to look at it:

# of addresses with 1 record plus 1 or more synonyms:

{eqn1} N * [p(2) + p(3) + p(4) + ...]

We know that

{eqn2} p(0) + p(1) + p(2) + p(3) + p(4) + ... = 1

(sum of all probabilities must equal 1)

Turning the eqn. around, we get:

{eqn3} p(2) + p(3) + p(4) + ... = 1 - [p(0) + p(1)]

Now, plug eqn3 back into eqn1 to get:

N * [ 1 - (p(0) + p(1)) ] (and from calculations of predicted collisions with 500 rec.; 1000 addr., packing density of .5)

= N * [1 - .607 + .303]

= 1,000 * [1 - .91]

= 1,000 * .09 = 90 (notice this is an exact answer and not the approximation from before)

Assuming 1 record space per address, how many overflow records are to be expected?

1 overflow record for p(2) so 1*Np(2)

2 overflow records for p(3) so 2*Np(3)

3 overflow records for p(4) so 3*Np(4)

4 overflow records for p(5) so 4*Np(5)

so 1*0.0758 + 2*0.0126 + 3*0.0016 + 4 * 0.0002 = 107 overflow records expected

What percentage of records should be overflow records?

107/500 records = .214 = 21.4%

Sooo.. with a packing density of only 50% we can expect that 21% of all records will have to be stored somewhere other than at their home address. At 100% we expect 36.8% overflow and at 10% we still get 4.8%.

Having only 4.8% collisions also sounds nice but with a packing density of 10% this means there are 9 empty spaces for every one used and STILL there are collisions!

36.8% looks good but consider this... if the file is full and 36.8% of the records CANNOT be stored at their home address.... WHERE ARE THEY? Answer... THEY MIGHT BE AT SOMEONE ELSE'S ADDRESS ... which makes the ones that have been displaced 'homeless' too!! (Q: What does this depend on?)

In general, collisions can be predicted as follows:

r = the number of records to be assigned

N = the number of addresses available

L = r/N = the loading factor

e = 2.7183 (natural log)

Addresses with 0 records = Np(0) = N *

Addresses with 1 records = Np(1) = N *

Addresses with 1 synonym = Np(2) + Np(3) + Np(4) + Np(5)

= N *

Another way to look at this:

{eqn4} # expected overflow records =

r - N * [1 - p(0)]

where r is the total # of records

N is the total number of addresses

so N*[1-p(0)] is the expected number of addresses that will have one or more synonyms (p(o) is # addresses with no keys so those with 1 or more = all the rest)

thus, the number of overflow records

= total records - # of occupied addresses

( since all records have to be hashed to the occupied addresses, and each occupied address has one and only one record that is not an overflow record)

Substituting from before:

= 500 - 1,000 * (1 - 0.607)

= 500 - 393

= 107 COLLISIONS

Unless your list is static and you have created a perfect hashing function, there will be collisions. SO, every hashing algorithm must have a way of dealing with collisions. A function that has a large number of collisions is said to exhibit primary clustering. A probe is an access to a distinct location (i.e. one look into the list).