Audio Theory

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Audio Theory

Prepared by: M.Y.A


M.Y.A
Topics Covered
1. Introduction to Sound
2. Analog Audio
3. Digitizing Audio
4. Sampling and Sample Rate
5. Quantization and Bit Depth
6. Bit-Rates
7. The Nyquist Theorem
8. Digital Audio Formats
9. Common Audio File Formats
10.Audio Compression
11.Hardware Considerations
M.Y.A Introduction to Sound
Before we can begin to understand and analyze the specific features of digital audio, we
need to consider some fundamental questions; what is sound and how does the human
ear perceive variations in pitch, volume and scale?
In order for sound to exist, there needs to be a medium, like air or water or a solid
object, that acts as the transmitter, a motion or disturbance in this medium that creates
waves and some kind of receiver that detects these variations (see Figure 1). Without
these three essential parts, sound cannot exist. Therefore, for humans to hear sound,
pressure waves or changes in air pressure must be generated by some physical vibrating
object, such as musical instruments or vocal chords. Then these changes are perceived
via a diaphragm, the eardrum, which in turn converts these pressure waves into electrical
signals, which are interpreted by the brain. This type of sound production is referred to as
analog audio.

Figure 1
M.Y.A Analog Audio
An Analog Audio signal can be graphically represented as a waveform (see Figure 2).
A waveform is made up of peaks and troughs that are a visual representation of
wavelength, period, amplitude (volume level) and frequency (pitch).

The horizontal distance between two successive points on the wave is referred to as the
wavelength and is the length of one cycle of a wave. A cycle is the distance between two
peaks. The period of the wave refers to the amount of time that it takes for a wave to
travel one wavelength (see Figure 6 on the next page).

Amplitude is half the distance from the highest to the lowest point in a wave.
If the distance is large the volume level is comparatively loud and alternately, if the wave
is small, then the volume level is low.
The number of cycles per second is referred to as frequency (see Figure 3).
Frequency is measured in hertz (Hz), a unit of measure named after Heinrich Hertz, a German
physicist and indicates the number of cycles per second that pass a specified location in the waveform.
M.Y.A Frequency directly relates to the sound’s pitch. The pitch or key is how the brain interprets the
frequency of the sound created and the higher the frequency, or the faster the sound vibrations occur,
the higher the pitch. If the vibrations are slower, then the frequency is low.

Analog signals are continuous and flexible; able to change at varying rates and size which means
that analog audio is relatively unconstrained and unlimited. The flexible nature of analog sound, though
seemingly positive, is in fact its biggest disadvantage, as it is therefore more susceptible to the effects
of extreme changes in audio that cause degradation of sound like distortion and noise.

Figures 4 , 5 and 6 provide a visual guide to the structure of a waveform


H.Akrawi Digitizing Audio
Computers cannot understand analog information. In order for an analog signal to be
understood by a digital device (such as a computer CD or DVD player), it first needs to be
digitized or converted into a digital signal by a device called an Analog to Digital
Converter or ADC (see Figure 7). Then for the human ear to be able to hear a digital
signal, it needs to be converted back to an analog signal. This is achieved using a Digital
to Analog Converter or DAC.

The conversion of an analog signal to a digital one requires two separate


processes: Sampling and Quantization.
H.Akrawi Sampling and Sample Rate

The analog signal is sampled, or 'measured' and assigned a numerical value, which the
computer can understand and store.
The number of times per second that the computer samples the analog signal is called
its Sample Rate or Sampling Frequency. While the basic unit used to measure
frequency or cycles per second is hertz, when sampling audio it is generally measured
in thousands of cycles per second or kilohertz (kHz).

An audio CD, for example, generally has a sampling rate of 44.1 kHz that is forty four
thousand one hundred Hertz (or samples per second), while the AM radio has a sample
rate of 11.025 kHz or eleven thousand and twenty five Hertz. The more samples taken,
the higher the quality of the digital audio signal produced.
Example:
H.Akrawi Play the two provided examples, 44100Hz.wav and
11025Hz.wav to hear the difference in sound
quality between the two different sample rates.
Low sampling rates, below 40 kHz, can result in a
static distortion caused by data loss. This is referred
to as Aliasing (see Figure 8). Aliasing can cause
digitally reconstructed sound to playback poorly. To
avoid the aliasing effect sampling needs to occur at
a high enough rate to ensure that the sound's
fidelity is maintained, or anti-aliasing needs to be
applied when the audio is being sampled. An anti-
alias filter can ensure that nothing above half the
desired sampling rate can enter the digital stream ie
any frequencies above the desired frequency are
blocked. Be aware that using anti-alias filters may,
in turn, introduce further unwanted noise.
Analog audio is a continuous sound wave that
develops and changes over time. After it is
converted to digital audio it is discontinuous, as it
is now made up of thousands of samples per second.
H.Akrawi Quantization and Bit Depth
Once an analog signal has been sampled, it is then assigned a numeric value in
a process called Quantization. The number of bits used per sample defines
the available number of values.
Bit is short for binary digit. Computers are based on a binary numbering
system that uses two numbers; 0 and 1. This differs from the more familiar
decimal numbering system that uses 10 numbers.
This two number system means each additional bit doubles the number of
values available - a 1-bit sample has 2 possible values; 0 and 1 and a 2-bit
sample has 4 possible values; 0 and 0, 1 and 0, 0 and 1, 1 and 1 and so on
(See Figure 9).
H.Akrawi
These binary values are defined as its Resolution or Bit Depth. This method of
measurement is used throughout digital technologies. You may already be familiar with
bit depth in digital graphics, where a 1-bit image is black and white, a web safe or
greyscale image is 8-bit and an RGB image has one byte or 8-bits allocated for each of
the three colors and is 24-bits in total.

Typically, audio recordings have a bit depth of either 8 or 16-bit and even 24-bit on
some systems. An 8-bit sample will allow 256 values, whereas a 16-bit sample will allow
65 536 values. The greater the bit depth, the more accurate the sound reproduction and
the better the sound quality.

An audio Dynamic Range is the difference between the lowest and highest points of a wave
and is measured in decibels (dB). The larger the dynamic range the greater the risk of
distorted sound. Audio files with a large dynamic range tend to require greater bit depth to
maintain sound quality.

An 8-bit sample, with 256 values can recreate a dynamic range of 48 dB (decibels), which is
equivalent to AM radio, whereas a 16-bit sample can recreate a dynamic range of
96 dB, which is the equivalent of CD audio quality.
H.Akrawi The dynamic range of the average human ear is approximately 0 to 96 dB (120dB is the
pain threshold), so it is no coincidence that the standard bit depth for CD quality audio is
16-bit.
H.Akrawi Bit-Rates
The number of bits used per second to represent an audio recording is defined as Bit-
Rate. In digital audio bit-rates are defined in thousands of bits per second(kbps).
The bit-rate is directly associated with a digital audio file’s size and sound quality. Lower
bit-rates produce smaller file sizes but inferior sound quality. Higher bit-rates produce
larger files but are of a better sound quality. An uncompressed audio track's bit-rate and
approximate file size can be calculated using the following formulas:

When calculating bit-rate it is important to remember that:


• 8 bits = 1 byte
• 1024 bytes = 1 Kb or a Kilobyte
• 1024 kilobytes = 1 Mb or a Megabyte The amount of 1024 is often
(see Figure 9 to calculate file size.) rounded down to 1000 if
strict accuracy is not required.
H.Akrawi The Nyquist Theorem

Audible Frequency refers to the range of frequencies that are detectable by the average
human ear. There is a direct correlation between the sample rate and the highest audible
frequency perceived by the ear. The relationship between sample rate and the highest
audible frequency is referred to as the Nyquist Theorem. The Nyquist Theorem, named
after Harry Nyquist, a Bell engineer who worked on the speed of telegraphs in the 1920s,
is a principle that is used to determine the correct sampling rate for a sound.

Essentially, the Nyquist Theorem states that a sound needs to be sampled at a rate that is
at least twice its highest frequency in order to maintain its fidelity or sound quality.
Therefore, a sample taken at 44.1kHz will contain twice the information of a sample taken
at 22,050 kHz. Put simply, this means that the highest audible frequency in a digital
sample will be exactly half the sampling frequency.
Average human hearing, at best, covers a range from 20 Hz (low) to 20 kHz (high), so a
sample rate of 44.1 kHz should theoretically cover most audio needs. It is also the
standard for CD audio, which requires near optimum sound quality. Therefore, the higher
the sample rate, the better the quality of sound that is reproduced. However, this also
means that the higher the sample rate, the greater amount of audio data produced and
consequently the larger the file size. This means that there is a direct correlation
between the sample rate, the quality of sound and the file size of the audio file.
H.Akrawi An example of how this affects the quality of digital audio is illustrated by the example
provided in Figure 10. A music track that has an optimum frequency of approximately
20 kHz, the highest audible frequency perceived by the average human ear, needs to be
sampled at 44.1 kHz in order to maintain CD quality sound fidelity. However, if the same
track is sampled at a rate lower than 44.1 kHz e.g. 30 kHz, then according to the
Nyquist Theorem, the range between 15 kHz and 20 kHz will be lost and therefore the
sound quality will deteriorate.
The reason for sampling below the recommended rate of the Nyquist theorem, would be
where the sample rate is determined by the transmission technology. For example,
telephone wires and the bandwidth allocated to radio transmission, where low data rates
and storage space are considered over sound quality.

Figure 10: The Nyquist Theorem rules states that a


waveform must be sampled twice. The positive peak and
the negative peak must both be captured in order to get a
true picture of the waveform.
Sampling rates are directly linked to the desired sound quality produced, therefore, different audio
types and delivery methods require different sampling rates (see Figure 11). Many applications do
H.Akrawi not require a wide frequency range to produce an ‘acceptable’ level of sound quality. The highest
audible frequency in the human voice is approximately 10 kHz which is equivalent to a sample rate of
20 kHz. Telephone systems, however, rely on the fact that even with the highest audible frequency of
4kHz (a sample rate of 8kHz), the human voice is perfectly understood.
Sampling rates for radio broadcasts are also confined within frequencies that suit the required quality
of the sound produced. AM radio has been broadcast since the early 1900s and in the 1920s it was
allocated to a specific frequency. Due to the limited technology of the period, in relation to the
capabilities of radio and electronics, the frequencies for AM radio were therefore relatively low.
Edwin Armstrong developed FM radio in the 1930s. His intention was to produce high fidelity and
static free broadcasts, therefore requiring higher frequencies. Although FM radio was available earlier,
it was not really popular until the 1960s.
The sampling rate used for CD is 44.1kHz or 44 100 samples per second. This relates directly to the
Nyquist Theorem whereby in order to produce high quality sound, the sample rate must be at least
twice the maximum audible frequency signal. So for a CD to produce audio up to a maximum
frequency of 20 kHz, which is the upper limit of human hearing, then it requires a sampling rate of
40Khz. The standard sample rate for CD, however, is set at 44.1kHz.
H.Akrawi Digital Audio Tape (DAT)
Developed in the late 1980s the DAT is still used by some sound recording studios for
both direct recording or data backup. They resemble small cassette tapes in appearance
but they have the capacity to record up to 2 hours of audio. The DAT recording process is
similar to cassette recording but the quality of recording can be compared to CD quality
or higher, with 3 possible sampling rates; 32kHz, 44.1kHz and 48kHz. DAT Recording is
also discussed later in the Hardware Considerations section of the notes.

Stereo and Mono


Audio is typically recorded in either Mono or Stereo.
A stereo signal is recorded using two channels and when played through headphones will
produce different sounds in each speaker. This allows for a more realistic sound because
it mimics the way that humans hear, therefore giving us a sense of space.

Mono signals, on the other hand, have identical sounds in each speaker and this creates a
more unnatural sound - ‘flat’ sound. This is a major consideration when digitising audio,
in that it will take twice as much space to store a stereo signal compared to mono signal.
H.Akrawi Digital Audio Formats
An audio file consists of two main components; a header and the audio data. The header
stores information in relation to Resolution, Sampling Rate and Compression Type.
Sometimes, a wrapper is also used which adds information about things such as license
management information or streaming capabilities (see Figure 12).

Digital audio files can be found in a huge variety of file formats but basically these files
can be divided into two main categories:

1. Self–Describing

2. RAW
H.Akrawi Self-Describing formats are usually recognized by their file extension. The extension,
which is part of the file name, will refer to the type and structure of the audio data within
the file and it instructs the user and the computer in relation to how to deal with the
sound information.

RAW formats are files that are not compressed. They rely on the sound software to
correctly interpret the sound file by reading the data or code of the header component.

File formats are used for different purposes and they vary in terms of file sizes created.
Therefore, when choosing an audio file format, its function and eventual context need to
be considered. This is particularly important when working with audio files for the web.
H.Akrawi Common Audio File Formats

1. Wave File Format (.wav)


This is a Windows’ native file format for the storage of digital audio data. Due to the
popularity of Windows, it is one of the most widely supported audio file formats on the
PC. WAV files are usually coded using the PCM – Pulse Code Modulation format. PCM
is a digital scheme for translating analog data. WAV files are uncompressed and therefore
have large file sizes. It is a RAW format that is often used for archiving or storage. The
audio data within the wave file format is stored in a chunk, which consists of two sub-
chunks; a fat chunk that stores the data format and a data chunk that contains the
actual sample data.
The WAV format supports a variety of bit depths and sample rates as well as supporting
both mono and stereo signals.
H.Akrawi 2. Audio Interchange File Format (.AIFF)
This is an audio file format that is a standard audio format used on Macintosh systems,
although it can be used on other platforms. Like the WAV file format, the audio data
within an AIFF file format uses the Pulse Code Modulation method of storing data in a
number of different types of chunks. This is a Binary file format that is quite flexible, as it
allows for the storage of both mono and stereo sampled sounds. It also supports a variety
of bit depths, sample rates and channels of audio.

3. MPEG – Encoded Audio (.MP3)


MPEG audio is a standard technology that allows compression of an audio file to
between one-fifth and one-hundredth of its original size without significant loss to sound
quality. The MPEG audio group includes MP2, MP3 and AAC (MPEG-2 Advanced Audio
Coding).
The most common, however, is MPEG 2 Layer 3, which has the file extension MP3. MP3
compression makes it possible to transmit music and sound over the Internet in minutes
and can be downloaded and then played by an MP3 Player. There are several free MP3
Players, but many are not streaming and if they are streaming, they use different, often
incompatible, methods of achieving the playback. MP3 files can be compressed at
different rates but the greater the compression the lower the sound quality.
H.Akrawi MP3 technology uses a lossy compression method, which filters out all noise that is not
detectable to the human ear. This means that any ‘unnecessary’ information is deleted in
the compression process, which results in a file that is a fraction of the original WAV file
but the quality remains virtually the same. The main disadvantage of MPEG compression
in software, is that it can be a really slow process.

4. Real Audio (.RA, .RM)


Real Audio is a proprietary form of streaming audio (described later) for the web from
Progressive Networks’ RealAudio that uses an adaptive compression technology
that creates extremely compact files compared to most other audio formats. The resulting
bit rate can be optimised for delivery for various low-to-medium connection speeds. Real
Audio either requires a Real Audio server or the use of metafiles, otherwise the files
won’t download and play.

Real Audio is a good choice for longer audio clip sounds because it lets you listen to them
in ‘real-time’ from your Web browser and the sound quality of the high bandwidth
compressions is good. Real Audio players can be included with a web browser or can be
downloaded from the web.
H.Akrawi 5. MIDI – Music Instrument Digital Interface
MIDI, or Musical Instrument Digital Interface, is not an actual audio file format but
rather a music definition language and communications code that contains instructions to
perform particular commands. Rather than representing musical sound directly, MIDI files
transmit information about how music is produced. MIDI is a serial data language,
composed of MIDI messages, often called events, that transmit information about pitch,
volume and note duration to MIDI-Compatible sound cards and synthesizers.
Messages transmitted include:
• Start playing (Note ON)
• Stop playing (Note OFF)
• Patch change (eg change to instrument #25 - nylon string guitar)
• Controller change (eg change controller Volume to value from 0 to 127)
It was initially developed to allow sequencers to control synthesisers. Older
synthesisers were Monophonic, that is, they were only able to play one note at a time.
H.Akrawi Sequencers could control those synthesisers by voltage and a trigger or gate signal that
told you if a key was up or down. Contemporary synthesisers are Polyphonic, enabling
them to play many notes at once, which is more complex. A single voltage was not
enough to define several keys so the only solution was to develop a special language; the
Midi. It has much smaller file sizes than other audio file formats, as it only contains player
information and not the actual direct sound. The positives of the MIDI are its small file size
but the disadvantage is the lack of direct sound control.

To play MIDI files you need two things:


• Either a MIDI plug-in or a MIDI helper application and
• A MIDI device, which can take the form of a soundcard, an external MIDI playback box
or MIDI keyboard, or a software based MIDI device, such as the set of MIDI sounds that
comes with the current version of QuickTime.

These are the most common audio file formats in the current market but in the past,
computers that had sound capabilities developed their own proprietary file formats.

The following is a list of same of the current proprietary file formats:


• .SFR – Sonic Foundry
• .SWA – Shockwave
• .SMP – Turtle Beach
H.Akrawi Audio Compression
Compression is the reduction in size of data in order to save space or transmission
time. Generally, compression can be divided into two main categories: Lossy and
Lossless compression. The main objective of both of these compression techniques is
to decrease file size, however, this is the only similarity between these two compression
types.
Text documents can be compressed at extremely high percentages of the original file size
eg on average 90% but audio files can only be compressed to approximately 25 – 55% of
the original file size. Although, this compression percentage may not seem ideal, it is
very useful when reducing audio file sizes that need to be transferred over the internet or
for archiving audio files.

Lossless audio compression (eg Monkey’s Audio) is similar in concept to using a


program like WinZip to compress a document or program. The information within the
audio file is minimized in terms of file size, whilst still maintaining the fidelity of the
original data. This means that the compressed file can be decompressed and still
maintain the identical data of the original file; with no loss to the audio quality.
H.Akrawi Lossy audio compression (eg MP3), on the other hand, does not maintain the identical
fidelity of the original audio file and in fact, does not compress all of the audio data. Lossy
compression methods analyse the audio data in the file and then discards any information
that seems ‘unnecessary’ in order to reduce the file size. This discarded information is not
usually discernible by the human ear and therefore does not alter the ‘perceived’ quality
of the audio.

Any compressor will achieve varied ratios of compression depending on the amount and
type of information to be compressed and there are many different file formats available
for both Lossless and Lossy audio compression.The web is the most obvious location
where audio compression becomes of paramount importance. Speed and efficiency are
the two things that the web relies on in terms of effective data transfer from the Internet
pipeline to the end user’s machine. Therefore, the smaller the file size the faster the data
is transferred.
There are several ways you can reduce the size of an audio file for delivery on the
web. The first and most obvious method would be to consider the length of the track.
H.Akrawi
There will be a significant difference for example between 1 minute of recorded audio, as
opposed to 40 seconds (see Figure 14). The next consideration would be the number of
channels; does the track need to be in stereo or could it be converted to a mono
recording. By converting the file to only one channel you have already effectively reduced
the file to a half of its original size and a half of the download time.
Another way to reduce the file size is to change the bit depth from a 16-bit track, for
example to an 8-bit track. The final way to reduce the size of an audio file is to alter the
sample rate. The key in creating digital audio files for the web is to experiment with the
various recording settings, in order to find an effective balance between sound quality,
performance and file size (See Figure 14).
H.Akrawi Hardware Considerations
1. Video Capture Cards

A video capture card is used together with a computer to pass frames from the video to
the processor and hard disk. When capturing video, ensure that all programs not in use
are closed, as video capture is one of the most system intensive tasks that can be
performed on a computer.

Most capture cards include options of recording with a microphone or line level
signal. A Microphone Level Signal is a signal, which has not been amplified and has a
voltage of .001 (one millivolt). Not surprisingly, microphones usually generate
microphone level signals. A Line Level Signal is a preamplifier and has a voltage of 1.0
(one full volt) generally created by mixing decks, *Video Tape Recorders (VTR), tape
players and DAT players etc. If your capture card has the option, you will be able to
decide which type of signal you are recording. Your capture card may have two different
types of connectors. The microphone input is usually (except when using Macintosh
system microphones) a 3.5 mini jack stereo connector. The line input is usually a
stereo RCA connector or some times three-pin XLR connector.
* Video Tape Recorders VTRs are professional recording and playback machines which use magnetic tape rolls.
H.Akrawi 2. Metering and Monitoring

Your capturing software should also allow you to see a graphic representation of sound
levels – it should display meters. There are different types of meters, which use a
variety of measurements and color codes. Regardless of metering systems used, you
should always use the meter to ensure that the incoming sound does not exceed the
recording abilities of the capture card. Unlike analog systems, which due to the
electrical nature of the signal and the recording medium, allow for sounds to be recorded
at levels that clip or peak, digital systems don’t allow for this. Digital recorders can
only record levels within their range capabilities. If the incoming level exceeds the
maximum level, clipping (distortion) will occur. The result of this is distortion of the
digital sound when played back.
3. Sound Cards and Sound Considerations

H.Akrawi A sound card is a peripheral device that attaches to the motherboard in the
computer. This enables the computer to input, process and deliver sound. Sound cards
may be connected to a number of other peripheral devices such as:
• Headphones
• Amplified speakers
• An analog input source (microphone, CD player)
• A digital input source (DAT, CD-ROM drive)
• An analog output device (tape deck)
• A digital output device (DAT, CD recordable CD-R) (see Figure 15)
The core of the sound card is the audio processor chip and the CODECs. In this
context, CODEC is an acronym for COder/DECoder. The audio processor manipulates the
H.Akrawi digital sound and depending on its capabilities, is responsible for converting sample rates
between different sound sources or adding sound effects. Although the audio processors
deal with the digital domain, at some point, unless you have speakers with a digital input,
you will need to convert the sound back into analog.
Similarly, many of the sound sources that you want to input to your computer will begin
as analog and therefore need to be converted into digital. A sound card therefore needs
some way to convert the audio. DACs (digital to analog converters) and ADCs (analog to
digital converters) are required to convert these audio types and many audio cards have
chips that perform both of these functions. They are also known as CODECs due to their
capability to encode analog to digital and decode digital to analog.
The other factors that can influence the functionality and usability of the sound card is
the Disk Driver, along with the number and type of input and output connectors (see
Figure 16).
H.Akrawi 4. DAT Recording

DAT (Digital Audio Tape) is used for recording audio on to tape at a professional level
of quality. A DAT drive is a digital tape recorder with rotating heads similar to those found
in a video deck (see Figure 17). Most DAT drives can record at sample rates of 44.1
kHz, the CD audio standard and 48 kHz.
Recording on DAT is fast and simple. It is as simple as choosing what you want, setting
the levels and pressing record. DAT has become the standard archiving technology in
recording environments for master recordings. Digital inputs and outputs on professional
DAT decks allow the user to transfer recordings from the DAT tape to an audio
workstation for precise editing. The compact size and low cost of the DAT medium makes
it an excellent way to compile the recordings that are going to be used to create a CD
master.
5. Mini Disk Players

H.Akrawi MiniDisc was developed by Sony in the mid eighties as portable equipment that combine
the storage qualities of CD with the recordabilty of cassettes. They are very cost effective
and run on power or on re-chargeable batteries, which last for approximately 14 hours of
play time.
While CD-ROMs and DVDs use optical technology and floppys and hard drives use
magnetic technology MiniDisc uses a combination of both to record data. Therefore care
should be taken to protect minidisks from strong magnetic fields. Just like a computer’s
hard drive, the audio data is recorded in digitally and in fragments - this is called Non-
Linear recording.
MiniDisc’s use sample rates of 48Khz, 44.1Khz or 36Khz. They uses compression to
enable them to record the equivalent to a full sized CD on to the 64mm disc. This
compression is called ATRAC (Adaptive Transform Acoustic Coding) incorporates noise
reduction and has a compression ratio of 1:5. Similar to MP3 it reduces data by only
encoding only frequencies audible to the human ear.
M.Y.A 6. Microphones

Computers that have built in microphones are not usually considered to be high-fidelity
devices. When dealing with audio production, the adage ‘garbage in garbage out’ applies.
In essence, nothing can fix poorly recorded sound. If your audio is going to be
compressed, or its sample rate and bit depth are reduced, then it is very important to
record clear, dynamic sounds. Choosing a good microphone is very important. There are a
variety of microphones available on the market, each offering different sound qualities
that are outlined in the following section, but firstly, let’s discuss how microphones work.

You might also like