Indexing

INDEXING

Indexed Sequential [SLIDE 1]

index lets you impose order on a file w/o rearranging records
lets impose several orders simultaneously w/o changing record order (multiple access)
allows keyed access to variable-length records
index can be sorted - allows binary searches
index can be in separate file (so size isn't fixed) or as part of data file
index 'records' can be fixed for easy manipulation
index file may be small enough to load into memory for fast manipulation
index file can be managed as array data structure within the program
Record Addition: [SLIDE 2] add record to index file (possible insertion); append to data file
Record Deletion: [SLIDE 3] delete index entry (mark it and wait for delete until we're ready to save file); mark data record (update free-list)
Update Record: if key field is changed result is much like delete followed by append

: if key field unchanged; then index entry will be changed only if records must be relocated

If index too big for memory:

use hashing - fast way to get an address
use tree-structured index
creative sorting (e.g. group together indices for records used most often)

Multiple Keys [SLIDE 4]

keep a primary key that is unique; lookup using secondary keys which point to primary key which holds byte offset
want to keep byte offset info in as few places as possible so there's less to update
all additions/deletions/modifications require updating ALL key indices (adding/removing/rearranging etc.)

Secondary Keys

secondary keys usually based on some field(s) in data record; primary key is sometimes inaccessible(not meaningful) to user
Update doesn't change any key: simplest; indices unaffected unless record must be moved - then update only key that has byte offset (primary key)
Update changes secondary key: [SLIDE 5] rearrange secondary key index if necessary
Update changes primary key: rearrange primary key index; update pointers in all secondary indices (can be RRN_index or string/number directly related to data field). Secondary indices should not need to be moved
once you have several secondary keys - can begin to search using logical expressions - involves parsing the request and building an efficient search request (Selective Indices)
Problems:[SLIDE 6]

updates may be expensive (secondary indices will get rearranged a lot)
duplicate secondary keys result in wasted space
Solution 1: create array of references for secondary keys (i.e. all 'Becker's) - cuts down on requirement to rearrange secondary key index - just change the 'pointer' and leave the entry

Problem: it wastes space if secondary key records are fixed size

Inverted Lists

Solution: create linked list of entries for secondary keys
inverted because we are trying to go back from secondary key to primary key
now we won't have to rearrange the secondary key index often: only when a new secondary key value appears

Now we have LOTS of files: [SLIDE 7]

1. Data File
2. Primary Index Key File
3. Secondary Index Key File
4. Secondary Key Linked List of Entries File
There are as many sets of (3) and (4) as there are secondary keys.

Advantages: [SLIDE 8]

Secondary Key File only needs to be rearranged when a new key is saved
Rearranging Secondary Key File will likely be faster because it tends to be smaller
Less need for sorting = less overhead
Secondary Key Linked List contains fixed records that are entry sequenced - never needs sorting (need to manage holes though)

Disadvantages:

Records in the Secondary Key Linked List aren't necessarily grouped together - affects disk access (could institute paging system)

Binding: When are keys bound to physical addresses?

Primary Keys are bound when they are created
Secondary Keys are bound when they are used (through de-referencing) may result in more disk accesses for searches; but substantially cuts down the cost of updates less error prone than direct binding

Practice Safe Programming!

Changes should affect as few places as possible.

Code execution is fast and cheap. (yeah, right!)

A FEW WORDS ON SEARCHING....

Binary Search is O(log₂ N); pretty good... but for large files like 1,000,000 records, binary search is about 20.... if that means 20 disk accesses, this could be unacceptable.

Interpolation Search:

more like how we do it

choose next location based on its estimated position in the file

key(Sought) - key(Lower)
next = Lower + -------------------------------------- (Upper - Lower)
key(Upper) - key(Lower)
O(log₂log₂N) more work than binary, but fewer disk accesses

Self-Organizing Lists:

records are returned to the head of the list

most frequently used records are closest to the front

Transpose Method:

switch sought record with its immediate predecessor

commonly sought records tend to migrate to the head of the list

Count Method:

keep a count of # times record is accessed

record gets moved to location ahead of those with lower counts