- CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified May 24, 2003 11:38 PM
Introduction
-
- Why is knowledge of file structure and processing important?
- computers have little memory
- files stored on disk and tape
- DISK VS RAM:
- disks - enormous capacity
- non-volatile
- cheaper
- SLOW
- can store hundreds or thousands of files; maybe millions of records per file
ORGANIZATION IS KEY
The analogy used in Folk & Zoellick's book is a good one: here's a slightly modified version. Let's say that RAM access for some 'bit' of information is equivalent to our sitting at a desk and reaching for a nearby book. If the elapsed time for this 'information retrieval' is 20 seconds, then an operation equivalent to a disk access would be like going to a store and having them special order the book from the publisher, waiting for it to arrive and then bringing it home. The elapsed time for this operation is 58 DAYS! Clearly, we want to minimize the number of times we have to do this. We want to minimize the number of disk accesses required and maximize the amount of USEFUL information retrieved during each read from disk. Ideally, we would require only 1 disk access and be able to fill memory with only information we actually need. This ideal is rarely possible, but this is the goal to keep in mind.
- static files fairly easy to work with (but static files are rare)
- more common are files that grow, shrink, and change
can still study behaviour - goal is to minimize access time for MOST cases
- file structure development follows historical development:
- From: article: nyberg, barclay, cvetanovic,gray,lomat "PDF [paper: VLDB Journal, 4, 603-627 (1995) ] "AlphaSort: A Cache-Sensitive Parallel External Sort"
-
- An Historical Perspective:
-
- Many techniques designed to manipulate information in and between files were first developed when machines were limited and slow and secondary storage devices were crude and slow.
- - most processing was accomplished in a batch environment
- - large files were stored on tape (magnetic tape storage = sequential access ONLY & ALWAYS)
- - networks were not an issue (they weren't even part of the picture)
- - algorithms assumed exclusive access to the system and resources
-
- As the hardware and the discipline matured (disks = direct access -> index is now reasonable), it became possible to manage bigger and more complicated files (bigger files = more complicated indices), so they needed more sophistocated ways of organizing and accessing information within files (binary trees = improved access time by LOG2N but... only full trees are efficient, and trees get lop-sided , so we come to....AVL trees, then B-trees, then B*trees)
-
- Today it is often not necessary and sometimes not even possible to get direct access to control how files are physically laid down on a device. Now the O/S handles most of it in cooperation with the device controller - usually better (more efficient) than you could). Most systems are now multi-user systems, possibly with multiple processors or multiple servers and having to share resources with other processes adds a level of complication that takes many of the choices we could otherwise make out of our hands.
-
- You've all been told:
- - clear, easy to maintain code is of paramount importance
- - machines and resources are cheap; people are expensive
- - code must be modular, re-usable
-
- Well, here's a little secret..
- Efficiency DOES count, especially when
- 1. processing large amounts of data
- 2. doing large numbers of calculations
- 3. when programs are very large
-
- In this course, we will concern ourselves most with #1, but the concepts learned are applicable elsewhere too.
-
- To do this right, you need to:
- - know something about the devices we use (the treatment in this course will be superficial):
- - how information is laid out (physically)
- - how information is accessed
- - know something about the operating systems you use:
- - how files are organized
- - how requests are processed
-
- With this information you can make an informed decision about:
- - configuration of devices (one disk, a RAID, CD's, tape, some hybrid, etc.)
- - which algorithms to use for this application (use some well-known approach or create your own hybrid)
- - how to organize data so access is as efficient as possible (indexed records, hash files, signatures, or another hybrid...)
CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified May 24, 2003 11:38 PM