Predicting the Distribution of Records
No nice mathematical tools available for predicting collisions for distributions that are better than random, but there are tools for understanding this behaviour when records are random. They can still give us some estimates that can be useful. The likelihood of collisions in hashing tables relates to the well known mathematical diversion:
THE Birthday Surprise: How many randomly chosen people need to be in a room before it becomes likely that two people will have the same birthday (month and day only)? Since (apart from leap years) there are 365 possible birthdays, most people guess that the answer will be in the hundreds, but in fact it is 24!
Get person 1; check off her birthday; get person 2; probability that he will have a different birthday is 364/365; probability for person 3 is 363/365 etc. If the first m-1 people have different birthdays, then the probability that person m has a different birthday is
(365 - m + 1) / 365
People's birthdays are independent of each other so the probabilities multiply so the probability that m people all have different birthdays is
X
X
X ... X 
The expression becomes less than 0.5 whenever m
24.
1. the address was not chosen (A)
2. the address was chosen (B)
p(B) = b =
,
since the address has one chance in N of being chosen, and
p(A) = a =
= 1 - ![]()
since the address has N - 1 chances of not being chosen.
If there are 10 addresses then probability of address being chosen = .1; chances of not being chosen = .9. If there are two keys hashed p(BB) = .1 * .1 = .01 since the two are independent of each other.
Other outcomes are possible when we hash two keys:
1 key 'hits' while the other misses:
p(BA) = b * a =
*
= .1 * .9 = .09 for N = 10
In general when we want to know the probability of a certain sequence of outcomes, such as BABBA we can compute the result with simple substitution:
p(BABBA) = b * a * b * b * a = b3a2 for N = 10 : (0.9)3*(0.1)2
If we want to know the probability of 2 As and 2 Bs, regardless of order, we do it this way:
p(AABB) = a * a * b * b for N = 10 : (0.1)2*(0.9)2 = .081
other combinations are: BBAA, BABA, BAAB, ABBA, ABAB and the probabilities for each is the same.
Since they are all independent of each other, the probabilities of 2 As and 2 Bs is the sum of the individual outcomes: 0.081 * 6 = 0.486
In general, the event r trials result in r - x As and x Bs can happen in as many ways as r - x letters A can be distributed among r places. The probability of each such way is
ar-xbx
and the number of such ways is given by the formula
C = 
This is formula for # of ways to select x items out of a set of r items.
When r keys are hashed, the probability that an address will be chosen x times and not chosen r - x times can be expressed as
p(x) = Car-xbx
If we know there are N addresses, we get
p(x) = C
We can use this to compute the probability that a given address will have 0 records assigned to it:
p(0) = C
If x = 1 we can compute the probability that one record will be assigned to a given address:
p(1) = C
If N and r are large this is a pain to compute, we can use an approximation : Poisson Function.
p(x) = 
where N = the number of available addresses; r = the number of records to be stored; x = the number of records assigned to a given address; then p(x) gives the probability that a given address will have had x records assigned to it after the hashing function has been applied to all n records.
Suppose that there are 1,000 addresses (N = 1,000) and 1,000 records whose keys are to be hashed to the addresses (r = 1,000). Since r/N = 1, the probability that a given address will have no keys hashed to it (x = 0) is
p(0) =
= 0.368
p(1) =
= 0.368
p(2) =
= 0.184
p(3) =
= 0.061
In general, if there are N addresses, then the expected number of addresses with x records assigned to them is
Np(x)
Another way to think about it is that p(x) as giving the proportion of addresses having x logical records assigned by hashing.
10,000 records and 10,000 addresses:
How many addresses are expected to have NO records assigned?
p(0) =
= 0.3679 # addresses w/ no records = 10,000 * p(0) = 3,679
10,000 * p(1) = 0.3679 * 10,000 = 3,679
10,000 * p(2) = 0.1839 * 10,000 = 1,839
10,000 * p(3) = 0.0613 * 10,000 = 613
This means that 3,679 addresses will be empty, 3679 will have 1 record (fine) 1,839 will have 2 (small problem); 613 will have 3 (bigger problem) = 1226 overflow records for these 613 addresses.
Need a plan for dealing with these overflows to try and reduce the number that actually overflow.
Since perfect hashing functions are nearly impossible to find for dynamic files, one way to deal with overflow is to allow extra space in the file. Packing Density refers to the number of records r to be stored in the available space N; Packing density = r/N (also called loading factor)
E.g. 75 records in 100 spaces = 75/100 = 75%
Packing density gives a measure of the amount of space in a file actually used. Raw size and its address space don't matter; what does is the relative sizes of the two - packing density.
Obviously the more free space you have allotted (lower packing density) the less likely you are to have a collision. Trade-off is how much space are we willing to waste to cut down on collisions?
p(x) = 
Note r and N always occur as a ratio - they are not independent. - 500 records in 1,000 addresses have same behaviour as 500,000 records in 1,000,000 addresses. r/N = packing density
If r = 500; N = 1,000; r/N = 500/1,000 = 0.5 = packing density.
Np(0) = 1,000 *
= 0.607 X 1,000 = 607
Np(1) = 1,000 *
= 0.303 = 303
Np(2) + Np(3) + Np(4) + ...= N[p(2)+p(3)+p(4)]
We really only need to compute the first 5 terms because after that they become insignificantly small, so p(2)+p(3)+p(4)+p(5) = 0.0758 + 0.0126 + 0.0016 + 0.0002 = 0.0902; and the number of addrs with one or more synonyms is just 1,000 times this so: 90.
Here's another way to look at it:
# of addresses with 1 record plus 1 or more synonyms:
{eqn1} N * [p(2) + p(3) + p(4) + ...]
We know that
{eqn2} p(0) + p(1) + p(2) + p(3) + p(4) + ... = 1
(sum of all probabilities must equal 1)
Turning the eqn. around, we get:
{eqn3} p(2) + p(3) + p(4) + ... = 1 - [p(0) + p(1)]
Now, plug eqn3 back into eqn1 to get:
N * [ 1 - (p(0) + p(1)) ] (and from calculations of predicted collisions with 500 rec.; 1000 addr., packing density of .5)
= N * [1 - .607 + .303]
= 1,000 * [1 - .91]
= 1,000 * .09 = 90 (notice this is an exact answer and not the approximation from before)
1 overflow record for p(2) so 1*Np(2)
2 overflow records for p(3) so 2*Np(3)
3 overflow records for p(4) so 3*Np(4)
4 overflow records for p(5) so 4*Np(5)
so 1*0.0758 + 2*0.0126 + 3*0.0016 + 4 * 0.0002 = 107 overflow records expected
107/500 records = .214 = 21.4%
Sooo.. with a packing density of only 50% we can expect that 21% of all records will have to be stored somewhere other than at their home address. At 100% we expect 36.8% overflow and at 10% we still get 4.8%.
Having only 4.8% collisions also sounds nice but with a packing density of 10% this means there are 9 empty spaces for every one used and STILL there are collisions!
36.8% looks good but consider this... if the file is full and 36.8% of the records CANNOT be stored at their home address.... WHERE ARE THEY? Answer... THEY MIGHT BE AT SOMEONE ELSE'S ADDRESS ... which makes the ones that have been displaced 'homeless' too!! (Q: What does this depend on?)
In general, collisions can be predicted as follows:
r = the number of records to be assigned
N = the number of addresses available
L = r/N = the loading factor
e = 2.7183 (natural log)
Addresses with 0 records = Np(0) = N *![]()
Addresses with 1 records = Np(1) = N * ![]()
Addresses with
1 synonym = Np(2) + Np(3) + Np(4) + Np(5)
= N * 
Another way to look at this:
{eqn4} # expected overflow records =
r - N * [1 - p(0)]
where r is the total # of records
N is the total number of addresses
so N*[1-p(0)] is the expected number of addresses that will have one or more synonyms (p(o) is # addresses with no keys so those with 1 or more = all the rest)
thus, the number of overflow records
= total records - # of occupied addresses
( since all records have to be hashed to the occupied addresses, and each occupied address has one and only one record that is not an overflow record)
Substituting from before:
= 500 - 1,000 * (1 - 0.607)
= 500 - 393
= 107 COLLISIONS
Unless your list is static and you have created a perfect hashing function, there will be collisions. SO, every hashing algorithm must have a way of dealing with collisions. A function that has a large number of collisions is said to exhibit primary clustering. A probe is an access to a distinct location (i.e. one look into the list).