461 - Gaining Control

CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified May 20, 2000 03:55 PM

Gaining Control of File Structures and I/O

Question: How do we write a binary tree to a file?

Most of the I/O we've done so far is text.

Advantages: : it's simple : data can be manipulated with an editor : 1:1 relationship between what we see and what we are manipulating : lots of builtin language support for file ops
Disadvantages: : inefficient if the values aren't actually text or if data is mixed (eg. numbers and words) : takes more space: (65,535 as int vs "65535" as char) : slower - values need to be converted back and forth to be used : sometimes tricky to get access other than stream : the data we want to save may be structured (tree, indexed, etc.); data located in memory using pointers; saving the structure of the data in the file may be awkward

For efficient File I/O we must take control over the I/O

- so we can put what we want in the file *where* we want (i.e. direct read/write)
- so we can make use of byte offsets

Question: what's wrong with using regular pointers in the file?

What do we need to take control? {in C++}


The default I/O format in C++ is character so we need to be able to define Binary Files...

- open the file as binary with ios::binary

myfile.open( fname, ios::binary | ios::in | ios::out);
This allows us to read and write raw data (floating point values, other numbers, bit strings, etc.)

- we need to be able to position the r/w file pointers:

myfile.seekg( < ± n bytes>, relative_position )
// positions the get (read) pointer n bytes forward or back from relative
// position which could be one of: ios::beg, ios::cur, ios::end
myfile.seekp( < ± n bytes>, relative_position )
// positions the put (write) pointer n bytes forward or back from relative
// position which could be one of: ios::beg, ios::cur, ios::end

- we need to be able to find out where we are now:

myfile.tellg(); myfile.tellp();
// both return a value of type streampos

- we need to be able to read/write "raw" bytes:

myfile.read( buf_ptr, #ofbytes );
myfile.write( buf_ptr, #ofbytes );

Now let's say we want to be able to create and manipulate an Indexed, Varying-Length Record File. It's a big file so we can't count on it fitting in memory. We want to be sure each physical read will get us something we can use and that we don't end up generating unnecessary physical reads and writes. We also want to make sure that one of our logical records doesn't span a physical record. One way to do this is to always read and write whole blocks (in this case it will be 2K chunks - the smallest amount that can be read/written on Unix)

Our file will consist of the following main parts:

Header
Index
Records
Free List

The Header:

(remember that all "pointers" will actually be byte-offsets in the file)
Record Count
Ptr to 1st Record
Index Count?
Ptr to Index (?list of index blocks?)
Free List Count?
Ptr to Free List (?the free list itself?)

What value range do we need on the Byte-offsets?

(how many bytes can we address with int, long int; do we need it signed?)

What if Header turns out to be varying-length? (yuk)
Do we still want it at the beginning of the file?
Is there something to be gained by putting it at the end of the file?

The header is one part of the file that will probably be used a lot so it would be worthwhile to 'unpack' it into a local structure at the start of the program and then write it once, just before closing.

The Index:

If we want to control blocking and buffering we can allocate the index within the file in blocks too (maybe > 1). If we do that, we'll want an efficient way to find out which index block we need - we could create an index to the index and keep that in the header (or at least handy).

The header would then want to know the location of each index block (in order) and perhaps the value of the last key in that block. Then we simply check the header to find out which index block we want to read.
how do we unpack a block, once read? how do we pack it again for writing?

Keep in mind that the byte-offsets are only useful in positioning the file pointers for reads and writes. Once read into a buffer they are no longer of use until it's time to re-write to the file (whereupon they had better still be accurate)

Now, what info do we need in the index? (let's say the primary key is a 6-digit id number)

- is there any advantage to keeping this key as char? we can store a 6-digit value in 20 bits
- let's say we use 4-bytes for the key, and 4 bytes for the byte offset to the record, how many index entries can we have per block? (256)
- how many blocks are we likely to need at max? (4000) - 6 digits gives us ~1 million

When does the index get modified (and in what way?)

- when record modified (requiring a move of the record) - change to byte-offset but that's all
- when new record added (new index entry) - what if the index block is full?
- when a record is deleted (delete index entry?)

If we want fast access to the header and index blocks, how about a fixed size header which has a 'pointer' to a list of index blocks. That way, we can open the file, read the n bytes we know comprise the header and then look in the header to see what else we need to 'load' and how big it is. So then our header could look like this:

typedef struct head_struct {

unsigned long int rec_cnt;
unsigned long int rec1_ptr;
unsigned long int index_cnt;
unsigned long int ilist_ptr;
unsigned long int free_cnt;
unsigned long int flist_ptr;
};

head_struct Header; // total size = 24 bytes

myfile.seekg(0, ios::beg);
myfile.read( (char*)& Header, sizeof Header);

We would then expect the first i-list 'pointer' to be 24 (does that mean we don't need it?)
What if we keep the header, i-list, and f-list at the end of the file? What about keeping just the header at the beginning and the i- and f-lists at the end?

Let's get back to the index i-list.

It will consist of pairs of 4-byte numbers where the first is an offset to an index block and the second is the value of the last key in that block.
256 of these pairs make up 2K.
Seems reasonable enough to allocate these in chunks -
how many i-list blocks will we need max? ( one i-list block gives us access to 256 pairs X 256 index entries/pair = 65536 index entries/i-list block) we shouldn't need more than 16 i-list blocks - with such a small number, we could maybe afford to keep them in the header (does this look familiar yet??)

Now what about reading blocks of varying-length records?

- set to appropriate byte offset (nearest 2K boundary: 2048* int(byte-offset/2048))
- read in a block (unsigned char buffer[2048]?)

myfile.read( (char*) buffer, sizeof buffer);

Suppose our record consists of:

typedef struct Record {

long int ID;
char* name;
float gpa;
unsigned char n_of_courses;
char** courses;

};

In the file, we've organized the record this way:

2 bytes: record size
2 bytes: length of name (n)
1 byte : # of courses
4 bytes: idno
4 bytes: gpa
n+1 bytes: name (null-terminated)
? bytes: (recsize-(14+n)) list of courses, null separated

- we don't know how many records are in a block (we'll worry about that later

some stuff we'll need for reading and writing:

int nextrec = 0; // byte 'offset' within the buffer
int nextch; // for stepping through buffer

int* intp; // generic pointer to an int - 2 bytes
float* fltp; // generic pointer to float - 4 bytes
long int* lip; // generic pointer to long int - 4 bytes
unsigned char num; // 1 byte number

Record rec; // place for the unpacked record
int recsize; // size of record in file
int namelen; // size of name string in file

****for this to work, we MUST be sure of the size of the types we are using (int, float, etc.)

Unpack: to get 1 record (the first)

intp = &buffer[0]; // get pointer to the 2-byte record size
recsize = *intp;

intp = &buffer[2]; // get pointer to 2-byte name char-count
namelen = *intp;

rec.n_of_courses = buffer[4]; // 1-byte value: direct copy

lip = &buffer[5]; // pointer to 4-byte id number
rec.ID = *lip;

fltp = &buffer[9]; // gpa
rec.gpa = *fltp;

rec.name = new char[namelen+1]; // make room for the name
strcpy( rec.name, &buffer[13] );

nextch = 14 + strlen(rec.name); // nextch = where we are in the buffer
for (i = 0; i <= rec.n_of_courses; i++)
{

rec.courses[i] = new char[12];

// Q: what if we don't know max length of a course name?

strcpy( courses[i], buffer[nextch]);
nextch += strlen(rec.courses[i]) + 1;
}

nextrec += recsize;

Pack: to 'put' 1 record (the first)


recsize = 14 + strlen(rec.name);

// we'll add the rest as we figure it out

namelen = strlen(rec.name); // length of name

intp = &buffer[2]; // write pointer to 2-byte name char-count
*intp = namelen;

buffer[4] = rec.n_of_courses; // 1-byte value: direct copy

lip = &buffer[5]; // pointer to 4-byte id number
*lip = rec.ID;

fltp = &buffer[9]; // gpa
*fltp = rec.gpa;

strcpy( &buffer[13], rec.name ); // the name

nextch = 14 + namelen; // nextch = where we are in the buffer
for (i = 0; i <= rec.n_of_courses; i++)
{

strcpy( buffer[nextch], courses[i]);
nextch += strlen(rec.courses[i]) + 1;
}

intp = &buffer[0]; // record size
*intp = nextch -1;

nextrec += recsize;