Fundamentals of Digital Audio
Fundamentals of Digital Audio
Fundamentals of Digital Audio
THE COMPUTER MUSIC AND DIGITAL AUDIO SERIES John Strawn, Founding Editor James Zychowicz, Series Editor Digital Audio Signal Processing Edited by John Strawn Composers and the Computer Edited by Curtis Roads Digital Audio Engineering Edited by John Strawn Computer Applications in Music: A Bibliography Deta S. Davis The Compact Disc Handbook Ken C. Pohlman Computers and Musical Style David Cope MIDI: A Comprehensive Introduction Joseph Rothstein William Eldridge, Volume Editor Synthesizer Performance and Real-Time Techniques Jeff Pressing Chris Meyer, Volume Editor Music Processing Edited by Goffredo Haus Computer Applications in Music: A Bibliography, Supplement I Deta S. Davis Garrett Bowles, Volume Editor General MIDI Stanley Jungleib Experiments in Musical Intelligence David Cope Knowledge-Based Programming for Music Research John W. Schaffer and Deron McGee Fundamentals of Digital Audio Alan P . Kefauver The Digital Audio Music List: A Critical Guide to Listening Howard W. Ferstler The Algorithmic Composer David Cope The Audio Recording Handbook Alan P . Kefauver Cooking with Csound Part I: Woodwind and Brass Recipes Andrew Horner and Lydia Ayers Hyperimprovisation: ComputerInteractive Sound Improvisation Roger T. Dean Introduction to Audio Peter Utz New Digital Musical Instruments: Control and Interaction Beyond the Keyboard Eduardo R. Miranda and Marcelo M. Wanderley, with a Foreword by Ross Kirk Fundamentals of Digital Audio New Edition Alan P . Kefauver and David Patschke
Volume 22
Kefauver, Alan P. Fundamentals of digital audio / By Alan P. Kefauver and David Patschke. -- New ed. p. cm. -- (Computer music and digital audio series) ISBN 978-0-89579-611-0 1. Sound--Recording and reproducing--Digital techniques. I. Patschke, David. II. Title. TK7881.4.K4323 2007 621.389'3--dc22 2007012264
A-R Editions, Inc., Middleton, Wisconsin 53562 2007 All rights reserved. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 1. MusicData
processing. 2. MusicComputer programs.
Contents
List of Figures Preface to the New Edition Chapter One The Basics
Sound and Vibration The Decibel 9 16 19 The Analog Signal Synchronization 1
ix xiii 1
Chapter Two
23
Analog-to-Digital Conversion
45
Chapter Four
57
Chapter Five
63
Chapter Six
75
8mm, 4mm, and Digital Linear Tape (DLT) Storage Systems 91 Fixed-Head Tape-Based Systems 92
101
120
Chapter Eight
127
Chapter Nine
147
Chapter Ten
167
List of Figures
Chapter 1 Figure 1.1 Figure 1.2 Figure 1.3 Figure 1.4 Figure 1.5 Figure 1.6 Figure 1.7 Figure 1.8 Chapter 2 Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Figure 2.10 Figure 2.11 Figure 2.12 Figure 2.13 Chapter 3 Figure 3.1 Figure 3.2 Figure 3.3
A sound source radiating into free space. A musical scale. The musical overtone series. The envelope of an audio signal. A professional sound level meter (courtesy B and K Corporation). Typical sound pressure levels in decibels. The inverse square law. The Robinson-Dadson equal loudness contours.
Waveform sampling and the Nyquist frequency. Waveform sampling at a faster rate. Waveform sampling and aliasing. Voltage assignments to a wave amplitude. Comparison of quantization numbering systems. Offset binary and twos complement methods. A waveform and different types of digital encoding. Block diagram of a PCM analog-to-digital converter. Filter Schematics. Various lter slopes for anti-aliasing. The sample-and-hold process. A multiplexer block schematic. Interleaving.
Block diagram of a digital-to-analog converter. Reshaping the bit stream. Three error correction possibilities.
ix
Figure 3.4 Figure 3.5 Figure 3.6 Figure 3.7 Chapter 4 Figure 4.1 Figure 4.2 Chapter 5 Figure 5.1 Figure 5.2 Figure 5.3 Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Chapter 6 Figure 6.1 Figure 6.2
An 8-bit weighted resistor network converter. A dual slopeintegrating converter. The effects of hold time on the digital-to-analog process. Reconstruction of the audio signal.
The effect of oversampling digital-to-analog converters. PCM and PWM (1-bit) conversion.
Compression/decompression in the digital audio chain. A block diagram of lossless encoding and decoding. A block diagram of an MPEG-1 Layer I (MP1) encoder. A block diagram of an MPEG-1 Layer III (MP3) encoder. A block diagram of an AAC encoder. A block diagram of an AC-3 encoder. ATRAC codec overview.
Figure 6.3 Figure 6.4 Figure 6.5 Figure 6.6 Figure 6.7 Figure 6.8 Figure 6.9 Figure 6.10 Figure 6.11
a. Perspective view of tape wrap around a video head drum; b. top view of tape wrap around a video head drum showing details. a. Track layout on an analog 3/4 helical scan videotape recorder; b. PCM-1630 processor and associated DMR-4000 U-matic recorder (courtesy Sony Corp.). a. Tape wrap on a DAT recorders head drum; b. track layout on a DAT recorder. Channel coding for record modualtion. Frequency allocation for a. an 8mm and b. a Hi8 video recorder. A Hi8mm-based eight-channel multitrack recorder (courtesy Tascam) showing a. the main unit and b. the controller. An S-VHS-based eight-channel multitrack recorder (courtesy Alesis). Cross-fading between the read and write heads on a digital tape recorder. Track layout on a DASH 48- and 24-track digital tape recorder. Cross-fading between the two write heads on a DASH digital tape recorder. a. Transport layout on a Sony PCM-3324A digital tape recorder; b. a 48track multitrack digital tape recorder (photo courtesy Sony Corp.).
LIST OF FIGURES
xi
Cross-fading between the read and write heads on a ProDigi digital multitrack recorder. Pit spacing, length and width for the compact disc. Note how pit length denes the repeated zeroes. Light reected directly back maintains the 0 series while the transition scatters the beam, denoting a change. Dust particles on the substrate are out of focus. Compact disc specications. The compact disc pressing process. Compact Disc Interactive (CD-i) audio formats. a. The front panel of a CD-R machine; b. the rear of the same machine showing the RS-232 interface as well as the ES/EBU, S/PDIF, and analog inputs and outputs (courtesy Tascam). Super CD. The recording system for the MiniDisc. The playback system for the MiniDisc. A professional magneto-optical recorder (courtesy Sony Corp.). MiniDisc specications. A computer hard disk showing sectors and tracks. A computer hard disk stack showing the concept of cylinders. Various ash-memory products, from left to right: Secure Digital, CompactFlash, and Memory Stick (courtesy SanDisk). The assemble edit process. A professional digital editor for tape-based digital recorders (photo courtesy Sony Corp.). a. A 1-Terabyte RAID Level-0 conguration diagram; b. a 500-Gb RAID Level-1 conguration diagram. a. The front panel of an 8-channel A-to-D (and D-to-A) converter; b. the back panel of the same unit. Notice the FireWire connections (photos courtesy Tascam). A compressor software plug-in for a computer-based digital audio workstation. A software instrument plug-in (specically, a Hammond B3 emulator/ virtual instrument) for use with a MIDI controller on a computer-based digital audio workstation.
Figure 7.7 Figure 7.8 Figure 7.9 Figure 7.10 Figure 7.11 Figure 7.12 Figure 7.13 Figure 7.14 Chapter 8 Figure 8.1 Figure 8.2 Figure 8.3 Figure 8.4
xii
Figure 8.10 Figure 8.11 Chapter 9 Figure 9.1 Figure 9.2 Figure 9.3 Figure 9.4 Figure 9.5 Figure 9.6
Diagram of a computer-based digital audio workstation. A DAW processor card for a PCI slot in a personal computer (photo courtesy Avid/Digidesign). A multi-channel A-to-D (and D-to-A) converter that connects directly to a processor card located in a computer-based DAW (photo courtesy Avid/Digidesign). An example of a control surface to manually adjust DAW software (courtesy Tascam). An example of a stand-alone DAW unit (photo courtesy Tascam).
Figure 9.7
Editing screen from Digidesign Pro Tools software. Editing screen from Apple Logic Pro software. Editing screen from Steinberg Nuendo software. Mixing window from Digidesign Pro Tools software. Mixing window from Apple Logic Pro software. a. A portion of 2-track (stereo) material selected in Digidesign Pro Tools software; b. that same portion being duplicated and dragged into a new stereo track. An edit window in Digidesign Pro Tools software showing a. nominal resolution of a waveform, b. zooming in on a portion of the waveform, and c. maximum magnication of the waveform for precise selection. The cross-fade and -3dB down point of a digital edit. The crossfade editing process in Digidesign Pro Tools software: a. selecting the portion of material to crossfade, and determining the precise options for the fade; b. The resulting waveform after the fade. A magnetic tape splice with a linear slope. The crossfade editor in Apple Logic Pro software. a. An example of an input matrix for Digidesign Pro Tools software; b. An output matrix example for the same. A multitrack project opened in Digidesign Pro Tools software. Note that although there are 18 tracks (some of them stereo, even), in this example the hardware supports only up to 8 analog outputs and inputs. An equalization software plug-in for a software-based DAW. A reverb software plug-in for a software-based DAW.
Back when I was in schooland it wasnt so long agobeing in a recording studio was an immense (and immersive) experience. Humongous mixers lled whole rooms, with just enough space left for a couple oor-to-ceiling racks full of graphic equalizers, reverb units, noise reduction, gates, and all manner of sound modication devices. Another wall was lled with large multitrack tape recorders. Im remembering the good old days as I sit on my deck, typing this on my laptop computer that has built into it a humongous mixer; equalizers, reverb units, gates and all manner of sound modication devices; and the capability to record or export my nished audio programs in any number of professional formats. Whether because of the demand for quality and portability or because of its leaner data requirements, audio was at the forefront of the digital entertainment revolution. Audio and music today exist almost exclusively in digital form for three reasons: reduced manufacturing costs, Moores Law, and the Internet. The Internet you know about: a sprawling, worldwide network of computers through which increasing numbers of people get their information and entertainment and do increasing amounts of their business. Moores Law goes something like this: Computers double in speed for the same cost every 18 months. In addition, the price-per-byte of memory continues to fall similarly from year to year. I remember spending $200 for 256 kilobytes of memory for my rst computer (an 8 MHz model). Today, that same $200 would buy about two gigabytesalmost an 8,000x increase. What all this means is that even off-theshelf computers today are more than adequate to record, process, edit, and listen to multiple channels of digital audio. Once available only to the most elite recording institutions, high quality digital audio recording is now accessible to almost everyone. Consequently, unlike years ago, when high-quality equipment and restricted distribution dictated what would be recorded and how it would be heard, today anyone can be an artist, engineer, and label. Some people do well with this accessibility and others do not. Learning the fundamentals wont give you insight for making subjective production decisions, but it will give you a solid basis for making them. Dont get too comfortable with your
xiii
xiv
new high tech equipment and skills though, because the one fact about this industry is that everything will continue to change, like it or not, and faster than you would prefer. That makes understanding the basic concepts and principles underlying your equipment all the more important. And that is what this book hopes to accomplish.
ONE
The Basics
Velocity
The speed of energy transfer equals the velocity of sound in the described medium. The velocity of sound in air at sea level at 70 degrees Fahrenheit is 1,130 feet per second (expressed metrically as 344 meters per second). The velocity of sound depends on the medium through which the sound travels. For example, sound travels
1
a. Sound source
b. Compression
c. Rarefaction
through steel at a velocity of about 16,500 feet per second. Even in air, the velocity of sound depends on the density of the medium. For example, as air temperature increases, air density drops, causing sound to travel faster. In fact, the velocity of sound in air rises about 1 foot per second for every degree the temperature rises. The formula for the velocity of sound in air is: V 49 459.4 F
THE BASICS
Note in Figure 1.1 that as sound moves farther from its source, its waves become less spherical and more planar (longitudinal).
Wavelength
The distance between successive compressions or rarefactions (i.e., the sound pressure level over and under the reference pressure) is the wavelength of the sound that is produced. The length of the sound wave is itself not that useful for our purposes. Wavelength Velocity Frequency
However, the frequency of the sound is important. Because we know the velocity of sound, it is easy to determine the frequency when the wavelength is known. The simple formula can be changed to read: Frequency Velocity Wavelength
Thus, if the distance from compression to compression is 2.58 feet, the frequency is 440 cycles per second, or 440 hertz (Hz). The period of the wave, or the time it takes for one complete wave to pass a given point, can be dened by the formula Period 1 Frequency
Therefore, the period of this wave is 1/440 of a second. As the period becomes shorter, the frequency becomes higher. This sound, with a frequency of 440Hz and a period of 1/440 of a second, is referred to in musical terms as A, that is, the A above middle C on the piano. Although sounds exist both higher and lower and range varies from person to person, well assume generally that the range of human hearing is approximately 20 Hz to 20,000Hz (20kHz).
of the bass clef staff is an octave below 440Hz and has a frequency of 220Hz. An octave relationship is a doubling or halving of frequency. Figure 1.2 shows a musical scale with the corresponding frequencies.
Harmonic Content
Very few waves are a pure tone (i.e., a tone with no additional harmonic content). In fact, only a sine wave is a true pure tone. When an object (e.g., a bell) is struck, several tones at different frequencies are produced. The fundamental resonance tone of the bell is heard rst, followed by other frequencies at varying amplitudes. The next tone is a doubling of the fundamental frequency (an octave) above that, followed by another doubling that is heard as a musical interval of a fth. Each of these harmonics changes amplitude slightly over time, adding to the individual characteristics of the sound. For example, a bell with a fundamental frequency of 64Hz produces harmonics of 128Hz and 192Hz, which is a G, a fth above the second C. Many other harmonics at varying amplitudes are produced, depending on the metallic composition of the bell. These harmonics are arranged in a relationship called the overtone series, and the combination of these harmonies, or overtones, gives sound its specic timbre, or tone coloration. The difference in harmonic content makes an oboe sound different from a clarinet. Although an oboe and a clarinet produce the same note with the same fundamental frequency, the number of overtones and the amplitude of each differs. The overtone series is most often notated in the musical terminology of octaves, fths, fourths, and so on, but actually corresponds to the addition of the fundamental frequency. Therefore, an overtone series based on the note C is, in hertz, 65, 130, 195, 260, 325, 390, 455, 520, 585, 650, 715, and so on. In musical terms this is a series of the fundamental followed by an octave, a perfect fth, a perfect fourth, a major third, a minor third, another minor third, three major seconds, and a minor
C 130 Hz
A 220 Hz
THE BASICS
second (see Figure 1.3). The frequency that is twice the frequency of the fundamental is called the second harmonic even though it is the rst overtone. Confusion often exists concerning this difference in terminology. The third harmonic is the rst non-octave relationship above the fundamental (it is the fth), and as such, any distortion in this harmonic tone is often detected before distortion is heard in the tone a fth below (the second harmonic). Many analog audio products list the percentage of third harmonic distortion found (because it is the most audible) as part of their specications.
4th harmonic 3rd harmonic 2nd harmonic Fundamental
Initial attack
Final decay
Rearranging this formula gives: Frequency 1 T This means that i an instrument has an attack time of 1 millisecond, the equivalent frequency is 1 kilohertz (kHz). This fact is important to remember when trying to emphasize the attack of an instrument. The prudent engineer remembers this when applying equalization, or tonal change, to a signal. The initial decay, which occurs immediately after the attack on most instruments, is caused by the cessation of the force that set the tube, string, or medium into vibration. It is the change in amplitude between the peak of the attack and the eventual leveling-off of the sound, known as the sustain. The length of the sustain
THE BASICS
varies, depending on whether the note is held by the player for a specic period of time (e.g., when a trumpet player holds a whole note) or on how long the medium continues to vibrate (medium resonance) before beginning the nal decay, which occurs when the sound is no longer produced by the player or by the resonance of the vibrating medium. As the trumpet player releases a held note, the air column in the instrument ceases to vibrate, and the amplitude decays exponentially until it is no longer audible. Final decays vary from as short as 250 milliseconds to as long as 100 seconds, depending on the vibrating medium. However, not all frequencies or harmonics decay at the same rate. Most often, the high-frequency components of the sound decay faster than the low-frequency ones. This causes a change in the sounds timbre and helps dene the overall sound of the instrument (or other sound-producing device).
Masking
Many references can be found in the literature about the equal loudness contours (discussed later in this chapter) developed by Fletcher and Munson and later updated by Robinson and Dadson. These curves relate our perception of loudness at varying frequencies and amplitudes and apply principally to single tones. How many times have you sat in a concert hall listening to a recital in which one instrumentalist drowned out another? Probably more than once. When you consider the fact that a piano, even at half-stick (i.e., with the lid only partially raised), can produce a level of around 6 watts, whereas a violinist, at best, produces 6 milliwatts, it is easy to understand why the violinist cannot be heard all the time. Simply stated, loud sounds mask soft ones. The masking sound needs to be only about 6 decibels higher than the sound we want to hear to mask the desired sound. It makes perfect sense that a loud instrument will cover a softer one if they are playing the same note or tone, but what if one instrument is playing a C and another an A? Studies by Zwicker and Feldtkeller have shown that even a narrow band of noise can mask a tone that is not included in the spectrum of the noise itself. For example, a low-frequency sine wave can easily mask, or apparently reduce the level of, a higher sinusoidal note that is being sounded at the same time. This masking occurs within the basilar structure of the ear itself. The louder tone causes a loss in sensitivity in neighboring sections of the basilar membrane. The greatest amount of masking occurs above the frequency of the masking signal rather than below it, and the greater the amplitude of the masking signal, the wider the frequency range masked. However, when a note is produced by an instrument that has a complicated, dense sound spectrum (i.e., a note with a rich harmonic structure), that sound will usually mask sounds that are less complicated (less dense). In fact, many newer buildings that use modular, or carrel, types of ofce space are equipped to send constant low-level random wide-band noise through loudspeakers in their ceilings.
This masking signal keeps the conversations in one ofce carrel from intruding into adjacent ofce space. The level of the masking signal is usually around 45 decibels, and this (plus the inverse square law, which is discussed later) effectively provides speech privacy among adjacent spaces. This effect is used also by some noise reduction systems when, as the signal rises above a set threshold, processing action is reduced or eliminated. In addition, the masking effect is a critical part of the data-reduction systems used in some of the digital audio storage systems discussed later in this book.
Localization
A person with one ear can perceive pitch, amplitude, envelope, and harmonic content but cannot determine the direction from which the sound originates. The ability to localize a sound source depends on using both ears, often referred to as binaural hearing. Several factors are involved in binaural hearing, depending on the frequency of the sound being perceived. The ears are separated by a distance of about 6 1/2 or 7 inches so that sound waves diffract around the head to reach both ears. When the wavelength of sound is long, the diffraction effect is minimal and the comparative amplitude at each ear about the same. However, when the wavelength is short, the diffraction effect is greater and the sound attenuated at the farther ear. Because the sound has to travel farther relative to wavelength, there is a perceptual time difference factor as well. You may have noticed that it is easier to locate high-frequency sounds than low-frequency ones. In fact, low-frequency signals often appear to be omnidirectional because of this effect. We can say that high frequencies (above 1kHz) are localized principally by amplitude and time differences between the two ears. How, then, do we localize low frequencies? With all sound there is a measurable time-of-arrival difference between the two ears when the sound is to one side or the other. As the sound moves to a point where it is equidistant from both ears, these time differences are minimized. With longer wavelengths the time-of-arrival differences are less noticeable because the ratio of the time difference to the period of the wave is large. However, this creates phase differences between the two ears, allowing the brain to compute the relative direction of the sound. Therefore, we can say that low frequencies are located by intensity and phase differences.
THE BASICS
the same note at the same amplitude, you will localize the sound to the right because it is louder. This amplitude difference is due to the inverse square law, which states that for every doubling of distance there is a 6-decibel loss in amplitude in free space (i.e., where there are no nearby reecting surfaces). Now suppose that both players sustain their notes. If player A (on the left) increases his amplitude by 6 decibels, the sound levels will balance, but your ear-brain combination will insist that player B (on the right) is closer to you than player A. Although the levels have been equalized, you perceive the nearer player to be closer. You would think that as long as the levels are identical at both ears, the players would appear to be equidistant from you. According to Haas, Our hearing mechanism integrates the sound intensities over short time intervals similar, as it were, to a ballistic measuring instrument. This means that the ear-brain combination integrates the very short time differences between the two ears, causing the sound with the shortest timing differences to appear louder and therefore closer. Haas used two sound sources and, while delaying one, asked a subject to vary the loudness of the other source until it matched the sound level of the delayed sound. He found that where the delay was greater than 5 but less than 30 milliseconds, the amplitude of the delayed source had to be 10 decibels louder than the signal from the nondelayed source for the two sounds to be perceived as equal. Beyond a delay of 30 milliseconds a discrete echo was perceived, and prior to 5 milliseconds the level needed to be increased incrementally as the delay lengthened.
THE DECIBEL
Earlier in this chapter we discussed the basic characteristics of sound, but one that was conspicuously absent was amplitude. The unit of measure that is normally used to dene the amplitude of sound levels is the decibel (dB). A Bel is a large, rather cumbersome unit of measure, so for convenience it is divided into 10 equal parts and prefaced with deci to signify the one-tenth relationship. Therefore, a decibel is one tenth of a Bel. A sound level meter measures sound in the environment. A large commercial aircraft taking off can easily exceed 130dB, whereas a quiet spot in the summer woods can be a tranquil 30dB. Most good sound level meters are capable of measuring in the range of 0dB to 140dB. You might think that the 0dB level is the total absence of sound, but it is not. Actually, 0dB is the lowest sound pressure level that an average listener with normal hearing can perceive. This is called the threshold of hearing. The 0dB reference level corresponds to a sound pressure level of 0.00002 dynes per square centimeter (dynes/cm2 ), which, referenced to power or intensity in watts, equals 0.000000000001 watts per square meter (W/m2 ). Also referred to in this book is the threshold of feeling, which is typically measured as an intensity of 1W/m2. A typical professional sound level meter is shown in Figure 1.5.
10
As you can see, the decibel must have a reference, which, when we discuss sound levels in an acoustic environment, is the threshold of hearing. In fact, the decibel is dened as 10 times the logarithmic relationship between two powers. The formula for deriving the decibel is: dB 10log PowerA Power B
where PowerA is the measured power and PowerB the reference power. We can use this formula to dene the amplitude range of human hearing by substituting the threshold of hearing (0.000000000001W/m2 ) for the reference power and using the threshold of feeling (1W/m2) as the measured power. The formula, with the proper values inserted, looks like this:
THE BASICS
11
10log
Therefore, the average dynamic range of the human ear is 120dB. Figure 1.6 shows typical sound pressure levels, related to the threshold of hearing, found in our everyday environment. Note also the level called the threshold of pain. It is interesting to note that other values can be obtained using the power formula. For example, if we use a value of 2W in the measured power spot and a value of 1W in the reference power spot, we nd that the result is 3dB. That is, any 2-to-1 power relationship can be dened simply as an increase in level of 3dB. Whether the increase in power is from 100W to 200W or from 2,000W to 4,000W, the increase in level is still 3dB. The inverse square law was mentioned earlier in this chapter. You might surmise that doubling distance would cause a sound level loss of one half, or -3dB. However, you must remember that sound radiates omnidirectionally from the source of the sound. Recall from your high school physics class that the area of a sphere is determined by the formula a 4r2. It follows that when a source radiates to a point that is double the distance from the rst, it radiates into four times the area instead of twice the area. This causes a 6dB loss of level instead of a 3dB loss. The formula for the inverse square law is: 2 2 2 Level drop 10log r 20log r 6dB r1 r1
where r1 equals 2 feet and r2 equals 4 feet. Figure 1.7 shows this phenomenon. Note that at a distance of 2 feet, the sound pressure level is 100dB. When the listener moves to a distance of 4 feet (i.e., twice the original distance), the level drops to 94dB.
12
Level in dB
Rocket engines
Threshold of pain Jackhammer Threshold of feeling Full symphony orchestra @FFF Subway train Heavy truck traffic
Danger Levels
Average full symphony orchestra Power lawn mower Average factory Chamber orchestra
Threshold of hearing
THE BASICS
13
r1 = 2 feet r2 = 4 feet r1
Sound source
r2
The contours are labeled phons, which range from 10 to 120; at 1,000Hz the phon level is the same as the sound pressure level in decibels. Therefore, the phon is a measure of equal loudness. Figure 1.8 has a curve labeled MAF, which stands for minimum audible eld. This curve, equal to 0 phons, denes the threshold of hearing. Note that at low sound pressure levels, low frequencies must be raised substantially in level to be perceived at the same loudness as 1kHz. Frequencies above 5kHz (5,000Hz) need to be raised as well, although not as much. Looking at Figure 1.8, you can see that the discrepancies between lows, highs, and mid tones are reduced as the level of loudness increases. The 90-phon curve shows a variation of 40dB, whereas the 20-phon curve shows one of nearly 70dB. The sound level meter in Figure 1.5 has several weighting networks so that at different sound pressure levels the meter can better approximate the response of the human ear. The A, B, and C weighting networks correspond to the 40-, 70-, and 100phon curves of the equal loudness contours.
14
120
120 110
100
100 90
80
80 70
60
60 50 40 30
40
20 10 0 MAF
20 10
20
1,000
5,000
10,000
Logarithms
As you may have noticed, the formula used to dene the decibel applied logarithms, abbreviated log. Anyone involved in the audio engineering process needs to understand logarithms. In brief, a logarithm of a number is that power to which 10 must be raised to equal that numbernot multiplied, but raised. The shorthand
THE BASICS
15
notation for this is 10x, where x, the exponent, indicates how many times the number is to be multiplied by 10. Therefore, 103 is 10 raised to the third power, or simply 10 to the third: 101 = 10 102 = 100 103 = 1,000 109 = 1,000,000,000 Numbers whose value is less than 1 can be represented with negative exponents, such as 0.1 = 10 -1 0.01 = 10 -2 0.001 = 10 -3 0.000001 = 10 -6 Because we are talking about very large and very small numbers, prex names can be added to terms such as hertz (frequency) and ampere (a measure of current ow) to indicate these exponents. Therefore, for large numbers, 1,000 cycles per second = 103 hertz (or 1 kilohertz [1kHz]) 106 hertz = 1 megahertz or (1MHz) 109 hertz = 1 gigahertz (1GHz) 1012 hertz = 1 terahertz (1THz) For small numbers, ampere (A) = 10-3 A (or 1 milliamp [1mA]) 10-6A = 1 microamp (1A) 10-9A = 1 nanoamp (1nA) 10-12A = 1 picoamp (1pA) Now you can see that the threshold of hearing, dened earlier as 0.000000000000W/ m2, appears in the formula as 1 x 10-12W/m2. It is also helpful to know that when powers of 10 are multiplied or divided, you simply add or subtract the exponents. Therefore, 106 x 103 = 106+3 = 109, and 1012 109 = 1012-9 = 103. Logarithms of numbers also exist between, for example, 1 and 10 and 10 and 100, but we will leave those problems to the mathematicians. Today, the logarithm of any number can easily be found by either looking them up in a table or pushing the log button on a calculator.
16
Because we really want to know the decibel level referenced to volts, the formula should read:
2 dB 10log E A E2 B
To remove the squares from the voltages in the formula we can rewrite the expression as: dB 20log EA EB
THE BASICS
17
Note now that any 2-to-1 voltage relationship will yield a 6dB change in level instead of the 3dB change of the straight power formula. There is always a standard reference level in audio circuits. This reference level, an electrical zero reference, is equivalent to the voltage found across a common resistance in the circuit. Therefore, we can compare levels by noting how many decibels the signal is above or below the reference level.
The dBm
The standard professional audio reference level is +4dBm. The dBm is a reference level in decibels relative to 1 milliwatt (mW). Zero dBm is actually the voltage drop across a 600-ohm resistor in which 1mW of power is dissipated. Using Ohms law (P = E2/R) we nd that the voltage is 0.775 volts RMS. (This value is merely a convenient reference point and has no intrinsic signicance.) The meters that were used on the original audio circuits when this standard was enacted were vacuum tube volt meters (VTVMs). As the demand for more meters grew (as stereo moved to multitrack), a less expensive meter was needed. An accurate meter, which needs a low impedance, would load down the circuit it was measuring and thereby cause false readings. To compensate for the loading effect of the meter, a 3.6K resistor was inserted in the meter path. Now the meter no longer affected the circuit it was measuring, although it read 4dBm lower. When the meter was raised to 0VU (volume units), the actual line level was +4dBm. The notation dBu is often found in the specications in manuals that come with digital equipment. As was just mentioned, we know that the dBm is referenced to an impedance of 600 ohms. (This was derived from the early telephone company standards. Many of our audio standards originated with Ma Bell.) However, most circuits today are bridging circuits instead of the older style matching circuits, and the dBu is used as the unit of measure. Without going deeper into the subject of matching and bridging circuits, we can say that the dBu is equal to the dBm in most cases.
18
Consider an experiment where a calibration tone of 1kHz at 0VU is played back from an analog consumer device operating at a level of -10dBm and from a professional analog device operating at +4dBm. If those signals were both sent to an analog-to-digital converter, the digitized signal level would be the same from both machines. As this signal is played back from the storage medium, the digital-toanalog converter outputs the calibration tone and produces a level of -10dBm or +4dBm at the devices output, depending on the device. The same signal played back on either a consumer or professional device will produce the reference level output at the devices specied line level. If we were to compare the output of the two devices, the +4dBm machine would play back 14dB louder, but this difference is due to the output gain of the playback section amplier. The classic VU meter is calibrated to conform to the impulse response of the human ear. The ear has an impulse response, or reaction time, that is dened as a response time and a decay time of about 300 milliseconds. This corresponds to the reaction time of the ears stapedius muscle, which connects the three small bones of the middle ear to the eardrum. Therefore, any peak information shorter than 300 milliseconds will not be fully recognized by the meter. The VU meter is designed to present the average content of the signal that passes through it.
THE BASICS
19
digital meter means full quantization level; that is, all bits at full value (there is no headroom). Therefore, it is important to note that 0VU on the console (whether it is +4dBm, -10dBm, or another level) does not equal 0dB on the digital meter. In early cases, -12dB was chosen as the calibration point for digital metering, but with the advent of systems with higher quantization levels, -18dB is more often used. To differentiate between peak meters and digital meters, the term dBfs is used (fs stands for full scale). This implies that the meter is on some kind of digital machine where 0dBfs equals full quantization level (all bits are used). Manufacturers have used a variety of standards, and some allow several decibels of level above zero as a hidden headroom protection factor. Perhaps in the future a digital metering standard will be adapted that everyone can adhere to. In the meantime the prudent engineer will read the manual for the piece of equipment in use and be aware of its metering characteristics. Most professional equipment in production today uses the standard of 0VU = -18dBfs.
SYNCHRONIZATION
Although not a basic characteristic or function of sound, time code is an important part of digital systems. Without time code, position locating and synchronization in the tape-based digital domain would be extremely difcult. Time code, as we know it today, was developed by the video industry to help with editing. In 1956, when videotape made its debut, the industry realized that the lm process of cut-andsplice would not work in video. The images that were visible on lm were not so on videotape. Certain techniques (e.g., magnetic ink that allowed you to see the recorded magnetic pulses) were developed, but these did not prove satisfactory. Another technique was to edit at the frame pulse or control track pulse located at the bottom edge of the videotape. This pulse tells the head how fast to switch in a rotating-head system. In the 1960s, electronic machine-to-machine editing was introduced, providing precise machine control and excellent frame-to-frame match-up. However, the splice point was still found by trial and error.
Time Code
A system was needed that would uniquely number each frame so that it could be precisely located electronically. Several manufacturers introduced electronic codes to fulll this task, but the codes were not compatible with one another. In 1969 the Society of Motion Picture and Television Engineers (SMPTE) developed a standard code that became recognized for its accuracy. That standard was also adopted by the European Broadcasting Union (EBU), making the code an international standard. The SMPTE/EBU code is the basis for all of todays professional video- and audiotape editing and synchronization systems.
20
The SMPTE time code is an addressable, reproducible, permanent code that stores location data in hours, minutes, seconds, and frames. The data consist of a binary pulse code that is recorded on the video- or audiotape along the corresponding video and audio signals. The advantages of this are (1) precise time reference, (2) interchange abilities among editing systems, and (3) synchronization between machines. On analog-based tape machines, the code can be stored in two different ways: longitudinal time code (LTC), which is an audio signal that is stored on a separate audio track of the video or audio machine; and vertical interval time code (VITC), which is small bursts of video integrated into the main video signal and stored in the vertical blanking interval (i.e., between the elds and frames of the video picture).
LTC
Longitudinal time code is an electronic signal that switches from one voltage to another, forming a string of pulses, which can be heard as an audible warbling sound if amplied. Each 1-second-long piece of code is divided into 2,400 equal parts when used in the NTSC standard of 30 frames per second or into 2,000 equal parts when used with the PAL/SECAM system of 25 frames per second. Notice how each system generates a code word that is 80 bits long: PAL/SECAM: 2,000 bits per second divided by 25 frames per second equals 80 bits per frame; NTSC: 2,400 bits per second divided by 30 frames per second equals 80 bits per frame. Most of these bits have specic values that are counted only if the time code signal changes from one voltage to another in the middle of a bit period, forming a 12-bit pulse, which represents a digital 1, whereas a full-bit pulse represents a digital 0. Following is one frames 80-bit code: Bits 03, 8, 9 1619, 2426 3235, 4042 4851, 56, 57 6479 10 11 27 Function Frame count Second count Minute count Hour count Sync word Drop frame Color frame Field mark
The remaining eight groups of 4 bits are called user bits, which can be used to store data such as the date and reel number. Three bits are unused. A typical time code number might be 18:23:45:28. This code number indicates a position 18 hours, 23 minutes, 45 seconds, and 28 frames into the event. This could be on the fortieth reel of tape. Time code does not need to start over on each reel.
THE BASICS
21
Bit 10, the drop-frame bit, tells the time code reader whether the code was recorded in drop-frame or non-drop-frame format. Black-and-white television has a carrier frequency of 3.6MHz, whereas color uses 3.58MHz. This translates to 30 frames per second as opposed to 29.97 frames per second. To compensate for this difference, a dened number of frames are dropped from the time code every hour. The offset is 108 frames (3.6 seconds). Two frames are dropped every minute of every hour except in the tenth, twentieth, thirtieth, fortieth, and ftieth minute. Frame dropping occurs at the changeover point from minute to minute. Bit 11 is the color frame bit, which tells the system whether the color frame identication has been applied intentionally. Color frames are often locked as AB pairs to prevent color shift in the picture. As mentioned earlier, user bits can accommodate data for reel numbers, recording date, or any other information that can be encoded into eight groups of 4 bits.
VITC
Vertical interval time code is similar in format to LTC. It has a few more bits, and each of the 9 data-carrying bits is preceded by 2 sync bits. At the end of each frame there are eight cyclic redundancy check (CRC) codes, which are similar to the codes used in digital recording systems. This generates a total of 90 bits per frame. The main difference between LTC and VITC is how they are recorded on tape, LTC being recorded on one of the videotapes longitudinal audio tracks or on a spare track of the audiotape recorder. Some specialized two-track recorders have a dedicated time code track between the two standard audio tracks. Playback and record levels should be between -10dB and +3dB (-3dB is recommended). This allows 12dB of headroom on high-output audiotape operating at a reference level of 370 nanowebers per meter (nw/m), where +4dBm equals 370nw/m of magnetic uxivity. Time code appears similar to a 400Hz square wave with many odd harmonics. Time code is difcult to read at low speeds and during fast wind or rewind. Vertical interval time code was developed for use with 1-inch-tape-width helical scan SMPTE type-C video recorders, which were capable of slow-motion and freezeframe techniques. During these functions, LTC is impossible to read. However, VITC is readable (as long as the video is visible on the screen) because the indexing information for each eld/frame is recorded in the video signal during the vertical blanking interval. When viewed on a monitor that permits viewing of the full video signal, VITC can be seen as a series of small white pulsating squares at the top of the video eld. Normally, VITC is recorded on two nonadjacent vertical blanking interval lines in both elds of each frame. Recording the code four times in each frame provides a redundancy error that lowers the probability of reading errors due to dropouts. Because VITC is recorded as an integrated part of the analog videotape recorders video track, it can be read by the rotating video heads of a helical scan
22
recorder at all times, even in freeze-frame or fast-wind modes. Video technology is explained more fully in Chapter 6. In todays digital video acquisition and editing environment, time code is still a valuable reference. Frequently, audio and video are now being handled digitally on the same computer system, so the need for mechanical synchronization of separate tape machines is lost; however, the importance of internal synchronization is still present. The need to identify each eld of video is addressed by embedding the time code reference during acquisition into the data for each frame, whether it is recorded onto tape or on a data storage device. The time code information is part of the data stream that gets transferred concurrently with the audio and video, intertwining all three elements.
Word Clock
SMPTE time code can only help synchronize devices up to its nite resolution 1/30 th of a secondthough on some equipment it can be used up to 1/100 of a frame. When connecting digital devices that divide time into slices of 1/96,000 th of a second (or smaller), a higher resolution reference is needed to ensure that all the data are being sent and received, and that they are correctly interpreted into the destination device. Word clock differs from time code in that it doesnt stamp each sample point with another tag of data, but rather it is a constant signal that sets the reference sampling rate of the source device in order to avoid data errors and maximize performance in the digital domain. The word clock signal is integrated in the S/PDIF and AES/EBU digital audio signals (discussed in Chapter 10), which allows the destination device to correctly process the digital signal. Some facilities also use a separate, or master word clock device, to dictate the word clock to all studio devices from a single source. If no master word clock is present, then the source device would provide the reference clock to its destination. This ends our discussion of the basic characteristics of sound. An understanding of these concepts will help you with the information yet to come. We have touched only briey on some very important areas, so you should consult the excellent general texts on sound and recording cited in the list of suggested readings at the end of this book.