CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 5, 2000 11:30 AM
HASHING - Part IV
  • simple idea - store a number of records at one (logical) address
  • on sector addressed disks a bucket can be one sector and on block addresses disks it can be a block
  • there may still be an overflow situation but it will happen much less often
  • packing density remains the same but the ratio of records to addresses changes
  • with packing density of 75%, and 750 records we need 1000 addresses so r/N = 0.75
  • with 2 records per bucket there are now 500 addresses and r/N = 1.5; if we now plug this value back into the function telling us which records can't be at their home addresses overflow drops from 29.6% to 18.7%
  • if we keep packing density the same and increase bucket size to 10; overflow drops to only 4%
  • buckets are managed much like blocks (need to keep count and information about where the records start and stop
     
    Dealing with overflow:
    essentially the same problem as single record collision resolution
     
    consecutive spill addressing: store the overflow records in the nearest available bucket
     
    bucket chaining: create an overflow area; a new bucket is created in the overflow area and a link made from the primary bucket to the new bucket
     
    overflow area

    DELETIONS

    more complicated than deletions from other structures

    progressive overflow searches *MUST* always end at an empty address (deletion sort of messes this up)

    if chained: we can follow the links and adjust them (like deletion from a linked list)

    What about coalesced chains?

    Deletions from Open Addressing schemes present a special problem: if we delete a value that happens to have a synonym, we won't even know to look at the synonym. We still want to re-open locations rather than just setting tombstones because if we don't we will eventually have to search the entire list, all of the time.

     

     What about deletion from tables inserted using Brent's Method or Binary Tree Insertion?

     How would Algorithm R have to be changed?

    - we'd have to test the key to see if j is a suitable location for the key at i by hashing it and stepping to j
    - if not, i is discarded as a viable option

    If j has a synonym it will be found this way (for sure?)

    If no synonym is found, is it safe to leave the space as empty?

  • treat the index as the hash table and have it point to the records or sets of records (linked list, entry sequenced, sorted)

    EXTENDIBLE HASHING

  • all previous methods have dealt with relatively static files
  • up till now the solution to a full file has been to re-load
  • there is another way (extendible hashing)
  • the main difference is you know # addr. needed for static hashing; in extendible we don't
  • we must approach synonyms and full buckets differently (eg. progressive overflow w/ extendible hashing??? - will have a maximum address but not a maximum # of records [!!])
  • we need a hybrid structure
  • we can incorporate ideas from other data structures (like B-Trees) into our hybrid structure...
    One approach is to use tries... (also called radix searching)
     
    - the keys are used to determine the branching structure
     
    - e.g. if it's numeric each node has 10 possible branches
    - the leaves hold the records
     
    If we actually represent this structure as a tree we have to do comparisons to find records. This is not in the spirit of hashing [ O(1) search] but if we flatten this structure we can turn it into a directory of hash addresses and pointers to buckets.
     
    Step 1 is to extend the nodes of the trie so it is a full structure. Using the binary values of the trie, it looks like this:
     

    This gives us the same direct addressing (no searching) associated with hashing since we now use the same portion of the key in all cases.
     
    Remember, we're talking about extendible hashing here: the file will grow. So, what do we do when a bucket is full? We want to let it grow without increasing the search lengths too much. If bucket C overflows the solution is simple. Since C contains keys that start with both 10 and 11 all we have to do is split it and redistribute the keys:
     
    But what if bucket B overflows? The solution this time is to "grow the tree" to the next level:

     
     
    We now have room for growth again without changing our hash function or re-loading the table.
     
    OK, so why use hashing at all? Why not just do this using the original keys ?!
     
    Well, one of the properties that's fundamental to a good hashing algorithm is that it randomizes the keys, spreading them evenly throughout the address space. By applying a good hash function to the keys we can ensure that the buckets will fill at a more or less even rate thus eliminating the problem with clustering and lop-sided trees. By retaining the hashing function we also retain even growth.
     
    Now, what about deletion? This is after all dynamic hashing so we have to allow for shrinkage as well as growth. When do we combine buckets? Which ones do we combine when we do? We probably shouldn't just wait till one is empty before we take action.
     
    In the example above only B and E can be combined. These are referred to as buddy buckets. None of the others can be combined. If however, B and E do get combined, we are now in a position to collapse the directory.

    Back to Top
    CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 5, 2000 11:30 AM