CPSC 461: Copyright (C) 2003 Katrin Becker Last Modified May 26, 2003 10:42 PM
FILE FORMATS
Audio

Main Data File Format Types:
Documents (text; word-processor; LaTex; postscript; etc.)
Electronic Data Interchange (EDI : formats for e-commerce)
Scientific
Graphics (raster; vector)
Audio (sound & voice)
Animation (moving pictures)
Multimedia (video; combined)
Storage & retrieval of audio data goes back to????
(A: 1877 when Edison invented the phonograph)
90 years later (1967) we have the first digital audio device
CD's have been around for ~20 years; until then, the vast majority of sound was still recorded in audio form
Analog
----->
Digital
continuous
discrete
why?
The end goal is for us to end up with digitized sound in a file.
Digital Audio represents sound waves

What is sound?
A physical disturbance in a medium. It propagates in the medium as a pressure wave that moves atoms or molecules. Note the "medium" doesn't need to be air - whales hear just fine under water (us too).
Sound is a wave: frequency changes it . A longitudinal wave => disturbance is in the direction of the wave itself. Electromagnetic and ocean waves are transverse => the undulations are perpendicular to the direction of the wave.

Waves (ALL WAVES) have 3 attributes: speed; amplitude; period (note: frequency is not independent of the other three - it is periods / unit time).

The speed of a wave depends on the medium and the temperature.
In the air,
at sea-level,
at 20 degrees C, the speed of sound is 343.8 m/s (1128 ft/sec.)

Infrasonic
0
20Hz
Human
20Hz
20,000Hz
Audible frequencies
Full Range:
16 Hz
22 kHz
22 kHz: 1 period ~ 1.56 cm
Ultra
20KHz
1 GHz
20 Hz: 1 period ~ 19.19 m
Hyper
1 GHz
10 THz

What is amplitude? (Loudness) = displacement of air pressure from the mean. We sense sound when the air molecules hit the diaphragm of the ear and apply pressure. The molecules move back and forth tiny distances: this is experienced as volume.
We are sensitive to a very wide range: If we say the quietest sound we can detect has a value of '1'; then the loudest sound would be a canon blast at a value of 1011. This is very inconvenient, so sound is displayed on a logarithmic scale: 0 - 11. Now the scale is too small, so it gets multiplied by 10 -> 0 - 110 or 0 - 220, and TA-DA... decibels dB

What are acoustics? The science of sound.

A sound wave:

A sine wave:

Which is the period? Amplitude? Frequency?
Frequency is: reciprocal (1/x) of the period: # periods / second = hertz (cycles per second)

What is a modem (modulator / demodulator)? What purpose does it serve? Converts analog to digital and vice-versa.

Continuous Sound - Digital Audio

- first step involves converting sound waves to electrical impulses (microphones do that for us)

- next step involves converting this continuous (analog) electrical impulses into discrete signals (done by sampling the impulses at specific moments in time)
A Wave, Digitized:

When a sound is digitized, some information is ALWAYS lost.

In order to recognize a particular wave the sample rate must be at least 2X the frequency. [Why? - Nyquist Sampling Theorem : for lossless digitization, the sample rate must be at least twice the maximum frequency.]
Samples (snapshots; measurements) are taken at regular intervals. Sample rate is measured in Hertz (samples/second). This means if the highest frequency is 22,000 Hertz, we will need to sample at 44,000 samples/sec. Why did we choose 44,100?

This means we will end up with 44,100 readings (samples) per second : in other words - 44,100 NUMBERS for each second of sound. If we have stereo (2 channels), we must double this (88,200). Then we must decide how many bits to allow for each sample. What are the implications of choosing an 8-bit sample depth? How many distinct samples can we represent? Must they be continuous (in a row)?

There are some different classifications of sound (gee, just like images - maybe classifying things is a common strategy).

Computer-Generated (synthesized): cartoon-like?
Natural Source:
music
voice
everything else
What might be an advantage of distinguishing between these?
Typical voice = 500 Hz - 2,000 Hz (compare to 20 - 20,000 Hz)
Humans are MOST sensitive at 600-6000 Hz (Coincidence? I think NOT.)
For voice encoding we need a smaller range of frequencies than for music. Typical telephone samples @ 8 kHz (8000 samples/second * 8 bits/sample = 64K (Windows Media Player defaults to 64Kbits/sec).
Question: Why is it often hard to distinguish between "s" and "f" on the phone?
Two sides to speech:
Speech generation: need to be able to represent and manipulate in a particular way.
Speech Analysis: different constraints / problems:
verification: this is me
identification: who is this
recognition: what is said
understanding: how it is said
Look:
Graphics (image generation) is to Vision (image analysis) what Speech & sound generation is to sound analysis. In one we are interested in how the humans like it, in the other case we are interested in doing measurements & calculations to extract new information (?)
Normally, samples are given in 8 / 16 bits per second - in some sense, the # of bits relates to the "step" size for distinct sounds.


DAT [Digital Audio Tape] records the entire wave form. CD-ROM uses PCM [Pulse Code Modulation] or DPCM [Differencing Pulse Code Modulation] this is essentially "raw" data
- typically 44100 samples/sec (DAT is 48000)

Q: Given the typical sample rate, what is the highest frequency we can record? How does that relate to normal human hearing (20-20,000Hz)?
What goes into Audio?
- sample rate (determines highest frequency)
- #bits / sample: usually 8 or 16 (determines # distinct frequencies we can represent)
- # channels: 1 = mono; 2 = stereo
- encoding?
10 seconds of sound at 11KHz/sec ; mono; 8 bits = 108Kb of data


MIDI:
- not actually an audio format - communication standard
- does not include sound "samples"
- more like an Object-Oriented file (it is to digital audio what vector graphics are to raster)
- stores descriptions ( like pitch, duration, volume, beginning / end note; instrument specification; etc. )
- allows encoding of about 10 octaves
- grouped into messages (each encodes a musical event
- has same advantages / disadvantages as vector images do.
When the user presses a key on a MIDI device, it creates a MIDI message: beginning of note; stroke intensity; etc. This can be saved, or transmitted to another MIDI device. When the key is no longer pressed, we end the message. the MIDI standrda identifies 128 instruments. It has two components: hardware support (sound generators, microprocessor, keyboard, control panel, auxilliary controllers, memory); and a data format.

Almost every machine has its own audio formats
- they are NOT interchangeable
UNIX: .au
Windows: .WAV
MP3 fairly machine independent
Hearing Threshold: the point at which we can detect sound : depends on environment & frequency.
There are 2 kinds of masking effects on sound:
1. Temporal
2. Frequency (auditory)
Some knowledge of how humans hear sound is essential to permit compression at high rates.
MP3 uses two compression techniques: one lossy and one not.
It uses a perceptual codec.
Sound samples get broken into frames < 1 sec.
The signal is analysed for spectral energy distribution
- how to distribute bits to best account for the audio to be encoded
- max # of bits that can be allocated to each frame
- uses left-ver space in other frames when possible
- compare frequency spread for each frame
- compared to mathematical models of human psychoacoustics
- don't need sounds we can't hear
- outside hearing range
- hidden or masked sounds (similar; loud/quiet)
- then huffman

Back to TopCPSC 461: Copyright (C) 2003 Katrin Becker Last Modified May 26, 2003 10:42 PM