461 - Introduction

CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified May 24, 2003 11:38 PM

Introduction


Why is knowledge of file structure and processing important?
computers have little memory
files stored on disk and tape DISK VS RAM: disks - enormous capacity non-volatile cheaper SLOW
DISK VS RAM:
disks - enormous capacity
non-volatile
cheaper
SLOW
can store hundreds or thousands of files; maybe millions of records per file

ORGANIZATION IS KEY

The analogy used in Folk & Zoellick's book is a good one: here's a slightly modified version. Let's say that RAM access for some 'bit' of information is equivalent to our sitting at a desk and reaching for a nearby book. If the elapsed time for this 'information retrieval' is 20 seconds, then an operation equivalent to a disk access would be like going to a store and having them special order the book from the publisher, waiting for it to arrive and then bringing it home. The elapsed time for this operation is 58 DAYS! Clearly, we want to minimize the number of times we have to do this. We want to minimize the number of disk accesses required and maximize the amount of USEFUL information retrieved during each read from disk. Ideally, we would require only 1 disk access and be able to fill memory with only information we actually need. This ideal is rarely possible, but this is the goal to keep in mind.

static files fairly easy to work with (but static files are rare)
more common are files that grow, shrink, and change

can still study behaviour - goal is to minimize access time for MOST cases

file structure development follows historical development:

From: article: nyberg, barclay, cvetanovic,gray,lomat "PDF [paper: VLDB Journal, 4, 603-627 (1995) ] "AlphaSort: A Cache-Sensitive Parallel External Sort"


An Historical Perspective:

Many techniques designed to manipulate information in and between files were first developed when machines were limited and slow and secondary storage devices were crude and slow.

- most processing was accomplished in a batch environment
- large files were stored on tape (magnetic tape storage = sequential access ONLY & ALWAYS)
- networks were not an issue (they weren't even part of the picture)
- algorithms assumed exclusive access to the system and resources

As the hardware and the discipline matured (disks = direct access -> index is now reasonable), it became possible to manage bigger and more complicated files (bigger files = more complicated indices), so they needed more sophistocated ways of organizing and accessing information within files (binary trees = improved access time by LOG₂N but... only full trees are efficient, and trees get lop-sided , so we come to....AVL trees, then B-trees, then B*trees)

Today it is often not necessary and sometimes not even possible to get direct access to control how files are physically laid down on a device. Now the O/S handles most of it in cooperation with the device controller - usually better (more efficient) than you could). Most systems are now multi-user systems, possibly with multiple processors or multiple servers and having to share resources with other processes adds a level of complication that takes many of the choices we could otherwise make out of our hands.

You've all been told: - clear, easy to maintain code is of paramount importance - machines and resources are cheap; people are expensive - code must be modular, re-usable

Well, here's a little secret.. Efficiency DOES count, especially when 1. processing large amounts of data 2. doing large numbers of calculations 3. when programs are very large

In this course, we will concern ourselves most with #1, but the concepts learned are applicable elsewhere too.

To do this right, you need to: - know something about the devices we use (the treatment in this course will be superficial): - how information is laid out (physically) - how information is accessed - know something about the operating systems you use: - how files are organized - how requests are processed

With this information you can make an informed decision about: - configuration of devices (one disk, a RAID, CD's, tape, some hybrid, etc.) - which algorithms to use for this application (use some well-known approach or create your own hybrid) - how to organize data so access is as efficient as possible (indexed records, hash files, signatures, or another hybrid...)