DIGITIZATION AND EDITING
Sampling and Quantization
Digital sampling is the term used in digital audio technology for the process of converting a continuously varying electrical voltage into a sequence of numbers for computer storage. An analog-digital (A-D) converter samples the relative voltage of the analog signal at equally-spaced moments of time, and then stores each value sequentially. The number of times the analog signal is evaluated over one second is termed the sampling rate, which is measured in terms of frequency. The audio sampling rates most commonly encountered in multimedia audio are 11025, 22050 and 44100 Hz; meaning that the analog signal is evaluated either 11050, 22050 or 44100 times a second. The process of assigning a digital numerical value proportional to the voltage level at each sampled moment is termed quantization; typically, 8- or 16-bit digital words are used to represent each sampled voltage level.
A good analogy for the difference between an analog signal and digital sampling is the comparison between a traditional clock with hour, minute, and second hands, and a digital clock with a numeric display. On a traditional clock, time is represented continuously by the motion of the second hand. This is equivalent to the continuous analog voltage variation that is found at the output of a microphone or a loudspeaker. Contrasting this, a digital clock display changes in increments of the smallest value of time shown (usually every second). The display doesn’t indicate anything “between the cracks;” the passage of time is displayed only in increments of the sampling rate of once a second.
To understand how sampling and quantization work together, consider the analogy of a black and white movie camera. This will sample an “analog” visual scene at a rate of 24 frames a second; nothing that occurs in the time interval in-between the frames will be captured. Furthermore, within each sampled frame of film, the color spectrum is quantized into a particular grayscale value from black to white.
Another way of understanding quantization is in terms of a questionnaire used in a survey, for instance to determine how you feel about a political candidate or snack food. Complex “analog responses” in the form of a qualitative opinion are seldom solicited. Instead, one is given a discrete set of quantized responses from which to choose. For instance, a “true-false” or “yes-no” questionnaire is equivalent to a quantization granularity of 2. The possibility of “maybe” is lost because the resolution is too coarse with such a “two-alternative forced choice#8221; paradigm.
Now, let’s apply these concepts to audio. Recall from Figure 5.1 that the overall process involves analog-digital conversion for storage and subsequent digital-analog conversion for playback. The implication is that the signal during playback will have only as much detail during D-A conversion as the amount of detail used in recording via A-D conversion. Just as you can never extract color out of a black and white film (you can only artificially “colorize” it), you can never get any frequency or intensity information out of a recorded audio signal that was not captured in the initial A-D conversion.
In Figure 6.1, a 100 Hz analog sine wave in is shown in red; the analog x and y axis values are also in red, at the top and at the right of the figure. The equivalent digital values are shown in blue x and y axis values at the bottom and the left of the figure. One period of a 100 Hz waveform takes 0.01 seconds to complete; the variation in intensity of the analog voltage is shown as ranging from ±1. The blue lines represent equally-incremented sequential values of the analog waveform, resulting from sampling the waveform at a rate of 2000 Hz. Twenty “snapshots” (sample values) of the analog signal’s intensity are measured during .01 second, which are (more than) sufficient for the digital audio system to accurately reconstruct the waveform upon subsequent D-A conversion.
FIGURE 6.1. An analog waveform (red) and its digital representation (blue).
In Figure 6.1, the signal is quantized into the range of numbers available from a signed-integer, 16-bit representation. This means we can represent the upper peak voltage value of the analog sine wave at 1.0 volts with the largest signed integer, 32767, and the lower extreme with –32767 (or –32767 in 2's complement encoding). Variations within the voltage range ±1 at proportional digital values in the range ±32767. For instance, if an analog voltage amplitude at a single moment in time was 0.25, then the nearest integer digital value would be 8192 (0.25*32767 = 8191.75, 8192 when rounded up). Figure 6.2 below shows the stream of numbers that result. Note the similarity between the symmetrical pattern of the numerical representation and the graphical representation of the waveform.
FIGURE 6.2. Sampled version of the sine wave shown in Figure 6.1. At a sampling rate of 2,000 Hz, 20 samples per cycle of a 100-Hz sine wave would be obtained, corresponding to the blue values on the y axis of Figure 6.1.
The following examples will
let you quickly hear the effect of different sampling rates:
A bit is
a binary integer that can have the value of 0 or 1; the range of numerical
values is 2 to the power of the number of bits. Now consider a hypothetical
2-bit A-D converter, which would have a quantization range of (2^2 =)
4 values. Each sampled analog voltage value is assigned one of the 4 different
combinations possible with two bits: 00 01 10 11. If we have a voltage
range of ±1 as in Figure 6.1, the following relationships between
the voltages and their quantized values shown in Figure 6.3 would apply:
6.3. 2-bit Quantization.
With an 8-bit system, there are
(2^8 =) 256 possible values; see Figure 6.5 below. Note that the range
of each voltage is smaller since there are a larger range of numbers that
can be assigned. This is adequate for representing some sounds, such as
speech or electronic sounds in a game, but no more than adequate;
16-bit quantization is vastly superior and is the standard for
“CD quality sound.”
FIGURE 6.4. Decreasing the quantization of
a visual image: 8, 4 and 2 bit color.
6.5. 8-bit quantization.
6.6. Disk storage requirements for one second
of linear (uncompressed) audio.
The trade-off between disc storage
and sound quality needs to be considered for any large project; soundfiles
have a way of taking over most of your available hard disc space. Note
that a minute of stereo "CD quality" sound (at 16-bit quantization,
44.1 kHz sample rate) requires about 10.1 megabytes of storage. Some software
also keeps a backup copy of the soundfile being edited so that one can
“undo” edits, thereby occupying additional storage space.
FIGURE 6.7. A sine wave, viewed by zooming in on the soundfile waveform display.
The same sine wave as in Figure 6.7, zoomed out.
FIGURE 6.9. RCA pair connectors.
FIGURE 6.10. Software
Once a test recording has been made, the waveform can be checked for whether or not there is any signal or if the peak value has been exceeded. It will be obvious when some sort of waveform is present. If nothing seems visible, and you’ve checked your connections, try zooming in on the y axis on a portion of the waveform where the sound source was silent, or amplifying a portion of the silent section. Most systems will have some residual noise present if the microphone or other device is actually hooked up (see Figure 6.11).
6.11. Zooming in by amplifying a selected
section of a waveform, to determine if system noise is reaching the input.
This illustration indicates that some sort of noise is reaching the A-D
from an external source. If amplification had resulted in no change, then
one could conclude that no signal would be reaching the A-D.
Waveform clipping was introduced
previously in Chapter 2 (see Figure 2.5) in the discussion of dynamic
range. The dynamic range of the recording system must be accommodated
via proper setting of the input level, so that the dynamic
range of the input signal—including its peak values—do
not exceed the range of the recording device. As shown in previously in
Chapter 2 (Figure 2.4), the challenge of recording and playback is to
accommodate mismatched dynamic ranges from one medium to another. Practically
speaking, an analog or digital compressor
can be used to narrow the dynamic range of a sound source; compression
is discussed in detail in Chapter 7.
FIGURE 6.12. The flat top of a peak signal, indicative of a clipped waveform with a non-normalized sound file.
Cropping the waveform involves isolating the usable portion of the recording. There should be no “dead space” at the beginning or ending of the soundfile. First, zoom out and mouse over what visually appears to be the usable portion of the waveform, and then listen to the result (see Figure 6.13). Once the region has been isolated, the idea is to not cut off any of the first sound’s initial attack, nor any of the last sound’s decay.
Click here for an example of a speech soundfile that has been cropped too narrowly, trimming off some of the attack;
Click here for an example of a soundfile that has been cropped just at the start of the speech; and
Click here for an example of a soundfile that has been cropped so that the sound of inhaling before the word is preserved. This may or may not be desirable for the final recording.
The “fine tuning” of the beginning and ending of the moused-over section can be accomplished by zooming in, and then adjusting the selection slightly, listening as you proceed. Some software packages have a trim or crop function that deletes any part of the sound file that hasn’t been selected. Otherwise, two separate steps are needed to delete the dead spaces; mousing over those sections, and then using the delete key (or its equivalent).
FIGURE 6.13. Cropping the waveform by isolating the usable portion of the recording.
Note that using normalization on a set of soundfiles doesn’t mean that they will have the same loudness. This is because the peak intensity values of two soundfiles may be the same, but their rms values could be widely different. If a soundfile has a intermittent transient at peak intensity (e.g., caused by momentarily shorting of the input connection) the normalization will based on this spurious value. Normalization is also not a substitute for a getting as much level in the initial recording as possible. The process amplifies all of the signal and the noise in the soundfile.
Click here to listen to a normalized recording of a soundfile that was made with an relatively “hot” input level; and click here to listen to a normalized recording of a soundfile that was made with an inadequate input level. Note that this second example contains much more noise in the recording.
In many cases, it is best to normalize
at a later stage of production. For instance, if you have 10 narration
soundfiles and you want them to be more or less equally loud, it’s
sometimes better to 1) balance out non-normalized files for equal loudness
by ear and a software gain adjustment, 2) find the file with the highest
peak value and normalize that file only, 3) determine the amount of gain
used to normalize that file, and then 4) apply that gain level to all
of the rest of the files.
Figure 6.14 shows a waveform display of the spoken word "Lovesickness". Click here to listen to the sound. By editing within the middle of the word, we can transform the sound into the word "sickness."
What we will attempt to do is to
edit between the point of the "v" sound in love and the "s"
sound that begins "sickness." The first step will be to mouse
over various sections of the waveform and then play them back. In Figure
6.14, we have written the approximate location of the different sounds.
Note the pattern of the waveform seems to change with the particular vowel
Now say the word "love" slowly and observe what happens with your mouth and tongue as you say the word. The "l" part involves touching the tongue to the back of the teeth, and the "o" occurs in the release (please click here to listen to this). This particular phoneme is quite distinct from the "v" part of the sound, where the upper teeth are brought into contact with the bottom lip (please click here to listen to this). Compare this to what happens when you say the word "lumber" slowly; the "lum" part is the same "lu" sound but the lips are brought together rather than making contact with the teeth. Now look again at the waveform in Figure 6.15, where we have zoomed in on the transition point. You can see that it looks similar in the Lu and V sections. This is because when you say the word "lovesick" you spend less time on the v sound, instead proceeding directly onto the "ss" sound. But it is very obvious where the "s" sound begins since it is more noise like.
This is obviously the region at which to make our splice. But to determine an exact location, the best technique is to find the zero crossing point of the waveform. Figure 6.16 illustrates this in detail. Splicing at a zero crossing point avoids clicks, and allows merger with the beginning of another word whose start point has been cut at a zero crossing point.
FIGURE 6.14. The waveform “Lovesickness”, zoomed out. Click here to listen to it.
FIGURE 6.15.The waveform “Lovesickness”, zoomed in.
FIGURE 6.16. Editing waveforms at the zero crossing point avoids clicks and allows the end of one waveform to be more easily spliced to the start of another.
Every computer hardware and software system will have peculiarities unique
to that system, and most provide adequate information within their manuals.
Be sure to practice editing and recording before jumping right into a
major project, using both eyes and ears, so as to make a connection between
the visual display and the actual sound.