CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified June 15, 2000 02:25 PM

FILE ANATOMY

File:
- collection of bits, usually logically related
- all files need some way of marking the end
How?
1. know size (advantages? disadvantages?)
2. EOF marker (advantages? disadvantages?)
 
Parts of a File
 
Unstructured Files:
Stream Files - basically strings of bytes
- simple to create
- (usually) inefficient to manipulate
- access to information is sequential/ addresseable by byte #
- additions typically go at the end
- mods and deletes can be problematic
- often updated by making a new copy
 
Structured Files ("data set"):
- have a logical internal structure
Types:
Record File
Self-Describing Files
Meta-data
Raster/ Vector Images
Mixed Object Files
 
access varies by type
 
Can do object-oriented file access BUT need to be able to recognize, read, convert all conceivable types.
 
Record Files:
- collection of related records
 
Record is collection of related fields treated as a unit
Properties/ Attributes:
- size/ length (fixed or not)
- # of fields (fixed or not)
- can be indexed (ID number)
- has some way to separate records (delimiters/ self-describing)
 
Field is a unit of information (logical rather than physical - conceptual tool)
Properties/ Attributes:
- size/ length (fixed or not)
- has some way to separate fields (delimiters/ self-describing)
 
Keys:
many record files have records that contain a key
- key is a field used as ID or to control its use
- one key must be the primary key
- unique
- unmodifiable
- canonical
- dataless
 
Files are always variable length. They may or may not have a header record which may or may not be different from the other records.
 
Record Access
- direct access requires some kind of key
- to be useful, key must be unique
- simple keyed-sequential file behaves much like an array; if keys are sorted but not predictable as to location (like names) it is like a simple sorted list; searches are sequential, O(N)
- performance can be improved with blocking but only to a point
- keyed-sequential fine for small files; files that are searched infrequently; files where most common access is for continuous groups of records
 
direct access requires location of record you need
can be done with index
can be done with relative record number (RRN) - IF record size is known and fixed
Choosing Record Structures and Length (fixed size)
- can have fixed fields or variable fields (implications/ tradeoffs?)
- if fixed fields: must leave enough room to store required info without wasting space - potential for being unable to store all we need to
- if variable length fields: leave enough total room and field locations will be variable - more overhead during creation; some overhead in finding fields but may be possible to leave enough space in record as a whole to accommodate all fields (except in pathological cases)
- can speed access to individual fields by mixing fixed and variable fields - some values are by nature of known, fixed length - so put fixed fields at start of record and variable ones at end.
- actual field order need bear no relation to the order in which fields are used
 
Other kinds of Files
Self-Describing Files have some kind of header that describes the structure of the file
- need to know how to read the header or the file is useless
- eg. each record or field can have an embedded label that is recognizable
 
Metadata (meta X ::= X about X)
- (header) data in file describing primary data in file
e.g. FITS (flexible image transport system) for astronomical data
 
Raster/ Vector Images - requires metadata or known format
 
Mixed Object Files - need a header that identifies objects
trade-off is overhead in file versus overhead as separate files
 
 
Program Management Considerations
 
Efficiency - space and time
Correctness
Portability - O/S, Language, Architecture
Maintainability - mods and lifespan
Comprehensibility
Robustness (reliable, general)
 
Program Development
33 % design
17% coding and entering
50% testing & debugging
Organizing Efficiently
Data Compression - see later section
 
Recovering Wasted Space in Record Files
What if updated record is bigger than original?
Put new data at end of file with pointer (increased complexity) or put new record at end of file with pointer (wasted space).
Eventually will need to compact.
Deletion of Records: must be able to recognize deleted records - mark them - need code to recognize deleted records BUT may be able to 'undelete'
How to reuse space?
Compaction = periodic reclamation;
Simplest method - write records (one at a time or build into blocks) to new file.
In place - need to free some space in file then start rewriting.
When? # deleted records = X; or by the clock (end of fiscal year)
 
Dynamic Space Reclamation - when compaction not useful (volatile/interactive)
FIXED RECORDS: (can use RRN's to locate)
deleted records must be marked
must be able to find them
Method 1: sequential search to find first available space
Method 2: know if there are slots and if so where
use simple linked list with head in file header
use linked list handled as stack
delete_rec procedure that marks a record as deleted and adds it to the avail list
write get_location function that returns RRN of reusable slot or next record to append to file; whichever is appropriate
 
VARIABLE LENGTH RECORDS:
can mark them in the same way
trickier to locate
doing deletion straight-forward
writing updated/new record a little more complicated
use byte-offset of record instead of RRN
stack not practical because we'll have to find a slot of adequate size
still write get_location function but needs to search avail-list for appropriate slot
 
First Fit Placement Strategy: if avail-list is unsorted linked list and we take first space that's big enough can end up with tiny fragments that are totally unusable (internal fragmentation if unused space not added to avail-list; external fragmentation if it is)
Best Fist Placement Strategy: find best fit (may be good to sort avail-list) : still leaves little holes
Worst Fit Placement Strategy: take biggest available slot - leaves bigger holes
 
PROBLEM: What if we update a record and now its smaller? If we don't do anything - internal fragmentation and we have no way to reclaim space.
SOLUTION 1: put vacated space in avail-list - external fragmentation
SOLUTION 1A: as above but concatenate adjacent pieces of available space to make bigger slots whenever possible coalescing the holes (works best if list is sorted by location)
SOLUTION 2: 'delete' and re-write the record, thereby putting its slot in avail-list and trying to find a new one

GOAL IS TO ACHIEVE BALANCE BETWEEN EFFICIENT USE OF SPACE AND TIME

Finding Things
often need to find record where some field has specific value
Sequential Search = O(N)
Binary Search = O(log N) - needs to be sorted
Problem: Sorting a Disk File is no Small Feat - internal sort (everything in RAM) often not an option - remember sorting requires several passes over the list
Problem: Binary Search on file with a million records still takes about 20 disk accesses!
Problem: Keeping a file sorted is very expensive!
 
Keysort
also called tag sort
only sort the keys
- create array of keys and pointers to the actual records (means access to the file)
- sort the array
- rewrite the file according to new record order (requires search in file for EACH record)
can make this better by not rewriting the file - SAVE THE INDEX
 
Pinned Records
one that can't be moved because other records contains 'physical' pointers to it; if it's moved someone else's pointers are left dangling

Back to Top
CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified June 15, 2000 02:25 PM