- CPSC 461: Copyright (C) 2003 Katrin Becker Last Modified May 26, 2003 10:42 PM
- FILE FORMATS
Introduction
-
- - talked about different ways to organize data within a file
- - the underlying asumption has usually been that we are dealing with records of some sort: often text. It is the fundamental framework for almost all manipulation of stuff in files.
- What is Data?
- There are in fact many forms of data
- - different types have different requirements
- eg. data collected from simulations
- - data collected from our own data compression experiments (should have had 1000's of files)
- - astronomical studies
- - X-ray; MRI data
- - geological surveys
- - sensor readings
- - population studies
-
- There is a tremendous amount of data that exists in text form; there is also a tremendous amount of data that is impractical to store this way.
-
- We are now able to gather data and do analyses that were impossible only 5-10 years ago.
- We now have the memory capacity and compute power to deal with many kinds of media - and not just for storage and play-back.
We no longer wish to merely play tunes; we want to analyse them. We no longer just generate pictures for animations: we simulate fur and individual snowflakes.
-
- Some "Fundamental Truths"...
- TRUTH: Ultimately, EVERYTHING - all data (including text) is stored as numbers; more precisely, BIT PATTERNS
-
- TRUTH: A single bit pattern can be interpreted many different ways:
- eg. 01011101 can be:
- an unsigned binary number
- a 2's compliment value
- floating point (IEEE)
- ASCII or unicode
- a pixel
- a bitmap of spacing
-
- TRUTH: On the other hand, a single real-life value can be represented many different ways (maybe depending on how we intend to use it):
- eg. PI could be:
- the 'p' symbol
- "3.14159265" as text
- 332E3134313539323635h [hex of ASCII]
- 40490FDBh [hex of single FP]
- 3 [integer]
- 011 [binary int]
-
- TRUTH: All programs work on some form of data
....so....
- BEGIN: We MUST find a way to transform real-life objects into numerical representations. If we do this badly, we can go no further.
- END: We must find a way to transform our results into a displayable form that we can interpret.
- Let's say we are interested in the genetics of purebred dogs and we want to find a way to predict whether or not a particular puppy will develop Hip Dysplasia;
- Possible inputs (information which must be transformed into "data"):
- - skeletal measurements (numbers?)
- - X-ray results (pictures?)
- - weight & growth rates (numbers?)
- - movement data (measurements?, pictures?, video?)
- all for the individual as well as relatives and ancestors
-
- We need to represent these in a way they can be compared against each other. How can we compare pictures against numbers?
- Once we have begun to analyze this data, we'll probably end up with many new numbers that we hope will tell us various things.
-
- Q: How to display them?
- columns of numbers? tables? images?
- Chances are, we'll want to keep these data for future analyses
-
- TRUTH: Raw numbers are meaningless without some sort of context.
-
- What are these?
-
- 180, 230, 6,195, 210, 250, 215, 215, 215, 225, 190, 205, 210, 200, 195, 230, 215, 220
- They are data values. We need something to relate them to (What are they?). If I say these are yields of pot plants, that helps, but it's till not good enough. We need some way to give these values location.
- They are yields of various pot plants in grams.
-
- Now we have a framework in which we can work. Notice, though, that one of the values is WAY out of whack. So we go back and check to discover that that measurement is in the wrong units: they measured in ounces, rather than grams. No worries, we'll fix it:
-
- 180, 230, 180,195, 210, 250, 215, 215, 215, 225, 190, 205, 210, 200, 195, 230, 215, 220
- We now have a 1-Dimensional dataset. We could start to draw conclusions from these values (never mind for the moment that we don't actually have anough data to say anything sensible - it doesn't stop most people). Given the context, we'd say nice things about the high values and nasty things about the low numbers. Here's our plot:
- Data Locations are independent variables
-
- Data Values are dependent variables
BUT!!!! Now we find out that these measurements were taken at various times (in other words, different values for "Days from Germination"). Once we add this data location, our dataset takes on a new dimension:
|
Days from Germination:
|
|
85 days
|
100 days
|
115 days
|
|
180
|
230
|
180
|
|
195
|
210
|
250
|
|
215
|
215
|
215
|
|
225
|
190
|
205
|
|
210
|
200
|
195
|
|
230
|
215
|
220
|
Now our picture looks like this:
Independent Variables:
- - data locations
- - values unique to each data value
- - organize and locate the data
-
- Dependent Variables:
- data values
- what we are interested in
-
The # of data locations associated with each data value give us the dimensionality of the dataset.
-
- Let's say we add a variety:
- - could make it a data value or data location
- - if we make it a value, the dataset remains 2D
- - if we make it a location, the dataset becomes 3D
- - the format would likely be columnar
- We could add information adout the growing medium; amount of sunlight; ..... Each new attribute would give us a new dimension, and this in turn would mean we need more samples in each category before we could say anything useful about them. We need to ensure each measure is accurate, uses the same units and is measured under the same conditions. For example, if the amount of daily sunlight were measured in hours by some and minutes by others, we may have no useful values (especially if we can't tell which is which). Similarly, if one group measured sunlight as time from sun-up to sun-down, and another group measured sunlight as only when the sun shone directly on the plants we would end up with useless data as there is no way to compare those two different measures, or make any kind of sensible adjustments.
-
- Once we've gathered all our data, we will need to decide how to store it. It might be useful to include some information in the final data file that describes how the data are grouped, what measurements and units were used, etc. This is the kind of thing that goes into the header (meta-data). We aren't really restricted in terms of how our data are organized - as long as we have a way to describe that organization.
-
- Metafiles:
- - usually mixed text and binary
- all contain headers at least some of which is text
- header describes data
-
- Other Organizational Formats:
- matrix data
- 3D matrix data
- multi-dimensional matrix data
- polygonal
- structured grids
- There are many Standards....
- Take a typical (?) evening...
- - you make supper, watch TV, talk on the phone, read your email, drive to the store and buy something
- How many standards do we encounter?
- AC power
- electrical outlets
- light bulb sockets
- TV signals
- telephone plugs
- networks, communication
- gas
- car batteries
- gages
- UPC codes.......
-
- Guess what is the most widely accepted general purpose data format? ASCII
- Advantages:
- can see the data
- can easily add comments
- precision is arbitrary (take as many as you need)
- Disadvantages:
- inefficient
- slow to read
- random access is awkward
- = problem when dataset is very large
- NOW: What IS Data?
- Does it have to do with the bit patterns?
- Does it have to do with the 'values' represented?
- Is an executable file a data file?
Is a Java Byte-Code file a data file? (Hint: Who uses it?)
- What is data to a compiler? RTS? JVM? O/S?
-
- Main Data File Format Types:
- Documents (text; word-processor; LaTex; postscript; etc.)
Scientific
Graphics (raster; vector)
- Audio (sound & voice)
- Animation (moving pictures)
Multimedia (video; combined)
Electronic Data Interchange (EDI : formats for e-commerce)
- Low-Level Data Formats:
- 2's Complement Binary
Floating Point (there are several standards)
ASCII
ISO 16-bit unicode
-
- Evaluating file formats:
- - intended purpose (scientific, display....)
- - general / specific
- - what is the data organization? flat? sequential? hierarchical? relational? object-oriented?
- - ASCII / binary/ mixed
- - hardware independent?
- - efficiency re: space / reading
- - self-describing?
- - support annotations?
- - extensible?
- - public domain? proprietary? supported?
A word about sizes:
-
- What did we say was the problem with text? It takes up (wastes) too much space. BUT......
- The average book has about a million characters in it; that's about 1 MegaByte.
-
- Compare against:
- IMAGES:
- 1024 X 768 pixels = 786,432 pixels
- 786,432 pixels X 4 bytes per pixel = 3,145,728 bytes PER picture.
-
- SOUND:
- 16-bit; 44,100 Hz sample rate (.wav) = ~88K / sec. of recorded sound.
-
- VIDEO:
- Let's say 30 frames / second (= 30 pictures) AND sound =
- 30 X 3,145,728 bytes PER picture + 88,200 =
- 97,371,840 + 88,200 =
- 95,176K (X 1024)
- = 95 Meg! PER SECOND!
- = 11.4Gig for 2 hr. movie
-
-
CPSC 461: Copyright (C) 2003 Katrin Becker 1998-2000 Last Modified May 26, 2003 10:42 PM