CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 4, 2000 05:38 PM

B+TREES

 

B+ Trees and Indexed Sequential Files
 
Indexed Sequential Files attempt to give the best of both - indexed (direct) access and sequential (physically contiguous records - no seeking)
 
- random access has been achieved using indices but if we want to access sequentially we end up with one seek per access
- sequential organization makes sequential access efficient but leaves us having to do linear searches (or at best binary searches) for random access; we also end up with a lot of overhead when adding or deleting records
- examples of file that require both: student files; credit card systems
 
The Sequence Set
- forget the index for now and let's concentrate on keeping the records in order (sequence set)
- forget about sorting and resorting the entire file - it's too expensive
- need to localize the changes
ANSWER : blocking - using one block to hold > 1 record
 
Using Blocks
- insertion can cause overflow just like B-Trees; splitting process is similar but there is no promotion [slide 1]
- deletion can cause underflow (less than half full) [slide 2]
1. If neighbouring block is also half full then concatenate, and free one block for re-use; neighbours are logical rather than physical blocks
2. If neighbouring blocks are more than half full then redistribute
associated costs:
- internal fragmentation means the file takes up more space - this can be helped some by using redistribution before splitting and by using 2 to 3 splits.
- Sequential access is only guaranteed within one block
 
Choice of Block Size:
- Should be big enough so we can hold several blocks in RAM so we can do splitting etc.
- Want to keep overhead for random accesses to a minimum - we need to read in an entire block even if we only want one record
- Remember clusters (the minimum # of sectors allocated at one time); data can be accessed sequentially within one sector since we only need one seek to get to the entire sector
- If you get to choose your own cluster sizes: need to consider internal fragmentation as well as RAM limitations
 
Back to Indices...
Borrow from the B-Trees :
 
Prefix B+ Tree
To build our first simple Prefix B+Tree - we use the separators as the keys in the B-Tree pages and attach pointers into the sequence set to the leaves [slide 5]
 
It's called simple prefix because of the algorithm used to determine the separators
since the index set is a B-Tree, a node containing N separators points to N + 1 children
 
Maintenance:
 
Deletions:
1. Deletions that don't involve redistribution or concatenations can leave the index alone
2. Deletions that involve redistribution may change one index key
3. Deletions that involve concatenation may involve redistribution or concatenation in the index keys as well as changes to the index keys
Insertions:
1. Insertions that don't involve redistribution or splitting don't change the index; since we insert on the basis of the existing index set it makes sense that the index set will still be valid
2. Insertions that involve redistribution may change the index in the same way as a deletion that results in redistribution
3. Insertions that result in splitting result in adding a key to the index

Looking at it another way:

changes to records that don't result in redistribution, splitting or concatenation [slide 6]
no change to the index set
changes to records that result in redistribution
possible change to key value but the index structure remains intact
changes that result in splitting [slide 7]
require addition of a new key to the index set (may case redistribution or splitting of the index set)
changes that result in concatenation [slide 8]
require deletion of a key in the index set (may cause redistribution or concatenation of the index set)
all changes work from the bottom up
- changes are first made in the record block and the required information is passed ‘upward' to the index set (which is a B-Tree)
 
What information must be passed to the index set?
If blocks are split? (A new key to insert)
If blocks are concatenated? (A key to delete)
If blocks are redistributed? (A key to change)
We already know how to do the first two in a B-Tree, the third simply involves a change of key (replace key A with value B) and does not affect the structure otherwise
 
Index Set Block Size
physical size of index node usually same as physical size of block in sequence set (call index node an index block (great!))
1. Want good fit between block size, physical characteristics of disk, and available memory
2. Common block size makes for simpler buffering - can implement sequence set blocks as virtual too - we can have a virtual simple prefix B+tree (!!)
3. The whole thing is often in one file so manipulation is simpler if everything has the same size block
 
Internal Structure of Index Set Blocks:
- so far we've used only B-Trees with a fixed number of keys - the whole point behind using the shortest separator is to save space - we can't if we fix the size of the space for the keys
- if the index set contains keys of varying lengths.... we need a way to tell where keys start and end....
answer - MAKE AN INDEX INTO THE INDEX SET !!!
 
HOLD ON -- don't try to keep track of all the details all the time -- look at the details when necessary but look at the big picture
 
- an index into the index tells us where each of the keys begins - we still need a way to tell where they end (maybe a null character or something); we need also to know how many keys are in this block; with that information we can do a binary search into this monstrosity
 
- now the index set looks like this:
- separator count (# of keys)
- total length of (all) separators
- separators (the keys themselves)
- index to separators (index to index - tells where keys begin)
- relative block numbers (the blocks of records to which the separators or keys point)
 
- a block is not just an arbitrary chunk cut out of a homogenous file; it is more than just a set of records
- these blocks have a sophisticated structure all their own
- this idea is more useful as block size increases
- with very large blocks it is imperative that we have an efficient way of processing all data
- the node within the B-Tree index set here is of variable order
- # of separators is directly limited by block size
- it still has a maximum order and therefore a minimum depth
- since it is variable order, determining when a block is full or half full is no longer simple
- decisions about when to split, concatenate or redistribute are more complicated
 
Loading a Simple Prefix B+Tree
B+Trees
Simple Prefix B+Trees are but one form of B+Trees
 
plain B+Trees use an actual key as a separator
 
when to use which?
1. When the extra cost of managing variable separators outweighs benefits of shorter separators
2. When key sets don't compress
 
remember other simpler file structures - use them when they will do the trick
 
can use parts of various structures to build hybrids - such as using a simple index into a blocked sequence set
 
SUMMARY: B-Trees; B*Trees, B+Trees
IMPORTANT DIFFERENCES:
B-TREES: info is grouped as pairs (key and associated info) distributed over all nodes
B+TREES: all key and record info is contained in a linked sequence set; accessed through index set containing separators
SIMPLE PREFIX B+TREES: uses simple separators built from keys

Back to Top
CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 4, 2000 05:38 PM