461 - Intro to File Formats

CPSC 461: Copyright (C) 2003 Katrin Becker Last Modified May 26, 2003 10:42 PM
FILE FORMATS
Introduction
FILE FORMATS
Introduction

- talked about different ways to organize data within a file
- the underlying asumption has usually been that we are dealing with records of some sort: often text. It is the fundamental framework for almost all manipulation of stuff in files.
What is Data?
There are in fact many forms of data - different types have different requirements eg. data collected from simulations - data collected from our own data compression experiments (should have had 1000's of files) - astronomical studies - X-ray; MRI data - geological surveys - sensor readings - population studies

There is a tremendous amount of data that exists in text form; there is also a tremendous amount of data that is impractical to store this way.

We are now able to gather data and do analyses that were impossible only 5-10 years ago.
We now have the memory capacity and compute power to deal with many kinds of media - and not just for storage and play-back. We no longer wish to merely play tunes; we want to analyse them. We no longer just generate pictures for animations: we simulate fur and individual snowflakes.

Some "Fundamental Truths"...
TRUTH: Ultimately, EVERYTHING - all data (including text) is stored as numbers; more precisely, BIT PATTERNS
TRUTH: A single bit pattern can be interpreted many different ways:

eg. 01011101 can be:

an unsigned binary number
a 2's compliment value
floating point (IEEE)
ASCII or unicode
a pixel
a bitmap of spacing
TRUTH: On the other hand, a single real-life value can be represented many different ways (maybe depending on how we intend to use it):

eg. PI could be:

the 'p' symbol
"3.14159265" as text
332E3134313539323635h [hex of ASCII]
40490FDBh [hex of single FP]
3 [integer]
011 [binary int]
TRUTH: All programs work on some form of data
....so....
BEGIN: We MUST find a way to transform real-life objects into numerical representations. If we do this badly, we can go no further.
END: We must find a way to transform our results into a displayable form that we can interpret.

Let's say we are interested in the genetics of purebred dogs and we want to find a way to predict whether or not a particular puppy will develop Hip Dysplasia; Possible inputs (information which must be transformed into "data"): - skeletal measurements (numbers?) - X-ray results (pictures?) - weight & growth rates (numbers?) - movement data (measurements?, pictures?, video?) all for the individual as well as relatives and ancestors
We need to represent these in a way they can be compared against each other. How can we compare pictures against numbers?
Once we have begun to analyze this data, we'll probably end up with many new numbers that we hope will tell us various things.

Q: How to display them?

columns of numbers? tables? images?
Chances are, we'll want to keep these data for future analyses
TRUTH: Raw numbers are meaningless without some sort of context.

What are these?
180, 230, 6,195, 210, 250, 215, 215, 215, 225, 190, 205, 210, 200, 195, 230, 215, 220
They are data values. We need something to relate them to (What are they?). If I say these are yields of pot plants, that helps, but it's till not good enough. We need some way to give these values location.
They are yields of various pot plants in grams.

Now we have a framework in which we can work. Notice, though, that one of the values is WAY out of whack. So we go back and check to discover that that measurement is in the wrong units: they measured in ounces, rather than grams. No worries, we'll fix it:

180, 230, 180,195, 210, 250, 215, 215, 215, 225, 190, 205, 210, 200, 195, 230, 215, 220
We now have a 1-Dimensional dataset. We could start to draw conclusions from these values (never mind for the moment that we don't actually have anough data to say anything sensible - it doesn't stop most people). Given the context, we'd say nice things about the high values and nasty things about the low numbers. Here's our plot:

Data Locations are independent variables

Data Values are dependent variables

BUT!!!! Now we find out that these measurements were taken at various times (in other words, different values for "Days from Germination"). Once we add this data location, our dataset takes on a new dimension:

Days from Germination:
85 days	100 days	115 days
180	230	180
195	210	250
215	215	215
225	190	205
210	200	195
230	215	220

Now our picture looks like this:

Independent Variables:

- data locations
- values unique to each data value
- organize and locate the data
Dependent Variables:

data values
what we are interested in

The # of data locations associated with each data value give us the dimensionality of the dataset.

Let's say we add a variety:

- could make it a data value or data location
- if we make it a value, the dataset remains 2D
- if we make it a location, the dataset becomes 3D
- the format would likely be columnar
We could add information adout the growing medium; amount of sunlight; ..... Each new attribute would give us a new dimension, and this in turn would mean we need more samples in each category before we could say anything useful about them. We need to ensure each measure is accurate, uses the same units and is measured under the same conditions. For example, if the amount of daily sunlight were measured in hours by some and minutes by others, we may have no useful values (especially if we can't tell which is which). Similarly, if one group measured sunlight as time from sun-up to sun-down, and another group measured sunlight as only when the sun shone directly on the plants we would end up with useless data as there is no way to compare those two different measures, or make any kind of sensible adjustments.
Once we've gathered all our data, we will need to decide how to store it. It might be useful to include some information in the final data file that describes how the data are grouped, what measurements and units were used, etc. This is the kind of thing that goes into the header (meta-data). We aren't really restricted in terms of how our data are organized - as long as we have a way to describe that organization.
Metafiles:
- usually mixed text and binary
- all contain headers at least some of which is text
- header describes data
Other Organizational Formats: matrix data 3D matrix data multi-dimensional matrix data polygonal structured grids

There are many Standards....
Take a typical (?) evening... - you make supper, watch TV, talk on the phone, read your email, drive to the store and buy something How many standards do we encounter? AC power electrical outlets light bulb sockets TV signals telephone plugs networks, communication gas car batteries gages UPC codes.......
Guess what is the most widely accepted general purpose data format? ASCII

Advantages:

can see the data
can easily add comments
precision is arbitrary (take as many as you need)

Disadvantages:

inefficient
slow to read
random access is awkward
= problem when dataset is very large
NOW: What IS Data?

Does it have to do with the bit patterns?
Does it have to do with the 'values' represented?

Is an executable file a data file?
Is a Java Byte-Code file a data file? (Hint: Who uses it?)
What is data to a compiler? RTS? JVM? O/S?
Main Data File Format Types:

Documents (text; word-processor; LaTex; postscript; etc.)
Scientific
Graphics (raster; vector)

Audio (sound & voice)

Animation (moving pictures)
Multimedia (video; combined)
Electronic Data Interchange (EDI : formats for e-commerce)
Low-Level Data Formats:

2's Complement Binary
Floating Point (there are several standards)
ASCII
ISO 16-bit unicode
Evaluating file formats: - intended purpose (scientific, display....) - general / specific - what is the data organization? flat? sequential? hierarchical? relational? object-oriented? - ASCII / binary/ mixed - hardware independent? - efficiency re: space / reading - self-describing? - support annotations? - extensible? - public domain? proprietary? supported?
A word about sizes:

What did we say was the problem with text? It takes up (wastes) too much space. BUT......
The average book has about a million characters in it; that's about 1 MegaByte.

Compare against:
IMAGES:

1024 X 768 pixels = 786,432 pixels
786,432 pixels X 4 bytes per pixel = 3,145,728 bytes PER picture.
SOUND:

16-bit; 44,100 Hz sample rate (.wav) = ~88K / sec. of recorded sound.

VIDEO:

Let's say 30 frames / second (= 30 pictures) AND sound =
30 X 3,145,728 bytes PER picture + 88,200 =
97,371,840 + 88,200 =
95,176K (X 1024)
= 95 Meg! PER SECOND!
= 11.4Gig for 2 hr. movie


CPSC 461: Copyright (C) 2003 Katrin Becker 1998-2000 Last Modified May 26, 2003 10:42 PM