CPSC 461: Copyright (C) 2003 Katrin Becker Last Modified June 21, 2003 03:15 PM
Nyberg, Barclay, Cvetanovic,Gray,Lomat
"AlphaSort: A Cache-Sensitive Parallel External Sort"
- Summary
Keywords:
cache-sensitive, memory-intensive, clustered data structures, cache locality, file striping, QuickSort, replacement-selection, MinuteSort, PennySort
- The Sort Benchmark:
- Input is a disk-resident file of one million 100-byte records
- Records have 10-byte key fields and can't be compressed
- The Input record keys are in random order
- The output file must be a permutation of the input file sorted in key-ascending order.
Performance metric is the elapsed time of the following seven steps:
- Launch the Sort Program
- Open the Input File and Create the Ouput File
- Read the Input File
- Sort the records in key-ascending order
- Write the output file
- Close the Files
- Terminate the Program
Bottlenecks during a sort:
Typical Memory Hierarchy:
- Registers
- On_chip instrauction and data caches (I-cache and D-cache)
- unified (program & data) CPU-Board Cache (B-cache)
- Main memory
- Disks
- near-line & off-line storage
Optimizing the use of processor cache on a sort:
- sort record groups as they arrive from disk
- sort (key-prefix, pointer) pairs rather than the whole record
- merge runs using replacement-selection sort
Replacement Selection:
- Replacement-selection sort produces runs that are (on average) twice as large as memory.
- worst-case behaviour very close to average behaviour
- terrible cache locality (this can be addressed by paging the tree)
QuickSort:
- Quick-Sort produces runs that are half as large as memory.
- Worst case behaviour is N2
- In practice, it still has superior performance.
- The Sorts:
- Record Sort
- This is the sort with which we are most familiar.
- The records themselves are compared against each other and moved around until they are all “in place”.
- *many accesses***many moves***no space overhead*
- Pointer Sort
- Place pointers to the records in the correct order without moving the records themselves.
- When done, retrieve and move the records.
- Still requires records access for each compare.
- *many accesses***one move***little space overhead*
- Key/Pointer Sort
- Get keys, and place key-pointer pairs in the correct order without moving the records themselves.
- When done, retrieve and move the records.
- Still requires records access for each compare.
- *one access***one move***moderate space overhead*
- Key-Prefix Pointer Sort
- Get prefixes, and place prefix-pointer pairs in the correct order without moving the records themselves. Record access required when prefixes are the same.
- When done, retrieve and move the records.
- Still requires records access for each compare.
- *few accesses***one move***middle space overhead*
-

- Shared-Memory Multiprocessor Optimizations:
- - posses a different situation from a multi-processor system where each processor has its own memory (more like a network then)
- - break up the sorting work into independent chores that can be handled by "the workers",
- Chores that can be done independently:
- - generating the arrays of prefix-pointer pairs
- - do the QuickSort (each gets one run )
- ROOT merges all key-prefix/pointer pairs to produce a string of sorted pointers
- - gather records into output buffers
- ROOT writes them
- File Striping:
- - breaking up file across multiple devices
- - bandwidth growth is near linear until a contoller saturates (one controller can cope with > 1 disk)
- General Comments on Cache Misses:
- - a program that doesn't fit in the cache will suffer from either instruction or data cache misses
- - can reduce cache-misses when using a tree by clustering the nodes of the tree
- - can reduce cache misses by using a "line-list": like a linked-list, but each "line" is the size of cache
- - watch out for potential fragmentation problems (sound familiar?)
- - on the other hand, line-lists can improve memory utilization by reducing the number of pointers required
- - can be used effectively when the data being accessed is bigger than cache
- - can aslo cluster items B-tree style (but un this case it is far more static and therefor easier to maintain)
CPSC 461: Copyright (C) 2003 Katrin Becker Last Modified June 21, 2003 03:15 PM