461 - File Anatomy

CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified June 15, 2000 02:25 PM

FILE ANATOMY

File:


Parts of a File

Unstructured Files:

Stream Files - basically strings of bytes

- simple to create
- (usually) inefficient to manipulate
- access to information is sequential/ addresseable by byte #
- additions typically go at the end
- mods and deletes can be problematic
- often updated by making a new copy

Structured Files ("data set"):

- have a logical internal structure
Types:

Record File
Self-Describing Files
Meta-data
Raster/ Vector Images
Mixed Object Files

access varies by type

Can do object-oriented file access BUT need to be able to recognize, read, convert all conceivable types.

Record Files:

- collection of related records

Record is collection of related fields treated as a unit

Properties/ Attributes:

- size/ length (fixed or not)
- # of fields (fixed or not)
- can be indexed (ID number)
- has some way to separate records (delimiters/ self-describing)
Field is a unit of information (logical rather than physical - conceptual tool)

Properties/ Attributes:

- size/ length (fixed or not)

- has some way to separate fields (delimiters/ self-describing)
Keys:

many record files have records that contain a key
- key is a field used as ID or to control its use
- one key must be the primary key

- unique
- unmodifiable
- canonical
- dataless
Files are always variable length. They may or may not have a header record which may or may not be different from the other records.

Record Access

- direct access requires some kind of key
- to be useful, key must be unique
- simple keyed-sequential file behaves much like an array; if keys are sorted but not predictable as to location (like names) it is like a simple sorted list; searches are sequential, O(N)
- performance can be improved with blocking but only to a point
- keyed-sequential fine for small files; files that are searched infrequently; files where most common access is for continuous groups of records

direct access requires location of record you need

can be done with index
can be done with relative record number (RRN) - IF record size is known and fixed

Choosing Record Structures and Length (fixed size)

- can have fixed fields or variable fields (implications/ tradeoffs?)
- if fixed fields: must leave enough room to store required info without wasting space - potential for being unable to store all we need to
- if variable length fields: leave enough total room and field locations will be variable - more overhead during creation; some overhead in finding fields but may be possible to leave enough space in record as a whole to accommodate all fields (except in pathological cases)
- can speed access to individual fields by mixing fixed and variable fields - some values are by nature of known, fixed length - so put fixed fields at start of record and variable ones at end.
- actual field order need bear no relation to the order in which fields are used

Other kinds of Files

Self-Describing Files have some kind of header that describes the structure of the file

- need to know how to read the header or the file is useless
- eg. each record or field can have an embedded label that is recognizable

Metadata (meta X ::= X about X) - (header) data in file describing primary data in file e.g. FITS (flexible image transport system) for astronomical data
Raster/ Vector Images - requires metadata or known format

Mixed Object Files - need a header that identifies objects

trade-off is overhead in file versus overhead as separate files

Program Management Considerations

Efficiency - space and time

Correctness

Portability - O/S, Language, Architecture

Maintainability - mods and lifespan

Comprehensibility

Robustness (reliable, general)

Program Development

33 % design
17% coding and entering
50% testing & debugging

Organizing Efficiently

Data Compression - see later section

Recovering Wasted Space in Record Files

What if updated record is bigger than original? Put new data at end of file with pointer (increased complexity) or put new record at end of file with pointer (wasted space). Eventually will need to compact.
Deletion of Records: must be able to recognize deleted records - mark them - need code to recognize deleted records BUT may be able to 'undelete'
How to reuse space?

Compaction = periodic reclamation;
Simplest method - write records (one at a time or build into blocks) to new file.
In place - need to free some space in file then start rewriting.
When? # deleted records = X; or by the clock (end of fiscal year)

Dynamic Space Reclamation - when compaction not useful (volatile/interactive)

FIXED RECORDS: (can use RRN's to locate)

deleted records must be marked
must be able to find them
Method 1: sequential search to find first available space
Method 2: know if there are slots and if so where use simple linked list with head in file header use linked list handled as stack delete_rec procedure that marks a record as deleted and adds it to the avail list write get_location function that returns RRN of reusable slot or next record to append to file; whichever is appropriate

VARIABLE LENGTH RECORDS:

can mark them in the same way
trickier to locate
doing deletion straight-forward
writing updated/new record a little more complicated
use byte-offset of record instead of RRN
stack not practical because we'll have to find a slot of adequate size
still write get_location function but needs to search avail-list for appropriate slot

First Fit Placement Strategy: if avail-list is unsorted linked list and we take first space that's big enough can end up with tiny fragments that are totally unusable (internal fragmentation if unused space not added to avail-list; external fragmentation if it is)

Best Fist Placement Strategy: find best fit (may be good to sort avail-list) : still leaves little holes

Worst Fit Placement Strategy: take biggest available slot - leaves bigger holes

PROBLEM: What if we update a record and now its smaller? If we don't do anything - internal fragmentation and we have no way to reclaim space.

SOLUTION 1: put vacated space in avail-list - external fragmentation

SOLUTION 1A: as above but concatenate adjacent pieces of available space to make bigger slots whenever possible coalescing the holes (works best if list is sorted by location)

SOLUTION 2: 'delete' and re-write the record, thereby putting its slot in avail-list and trying to find a new one

GOAL IS TO ACHIEVE BALANCE BETWEEN EFFICIENT USE OF SPACE AND TIME

Finding Things

often need to find record where some field has specific value
Sequential Search = O(N)
Binary Search = O(log N) - needs to be sorted
Problem: Sorting a Disk File is no Small Feat - internal sort (everything in RAM) often not an option - remember sorting requires several passes over the list
Problem: Binary Search on file with a million records still takes about 20 disk accesses!
Problem: Keeping a file sorted is very expensive!

Keysort

also called tag sort
only sort the keys

- create array of keys and pointers to the actual records (means access to the file)

- sort the array

- rewrite the file according to new record order (requires search in file for EACH record)

can make this better by not rewriting the file - SAVE THE INDEX

Pinned Records

one that can't be moved because other records contains 'physical' pointers to it; if it's moved someone else's pointers are left dangling