Steve Aukstakalnis e David Blatner
Suono virtuale
Da "Silicon Mirage - The Art and Science of Virtual Reality", di Steve Aukstakalnis e David Blatner, 1992, Peachpit Press Inc., Berkeley, USA

Il capitolo, tratto da uno dei più diffusi manuali di introduzione alle realtà virtuali, offre un'agile panoramica sullo stato dell'arte nella generazione con tecniche digitali del suono tridimensionale presentando i principali laboratori impegnati nella ricerca, i maggiori problemi da risolvere, le possibili applicazioni, i progressi verso la creazione di uno spazio sonoro virtuale.

Computer-Generated Sound

While over the past thirty years a host of computer graphics experts have been pushing the boundaries of what computers can display, audio experts have been stretching computers in order to create more realistic (and surrealistic) sounds. In fact, if you listen to popular music radio stations, you may have heard their progress without even knowing it. Synthesizers and audio recording are now so advanced that we can create, edit, and play back sound through an almost entirely digital process. And digital means perfect.

Perfect sound means that you cannot tell the difference between the original and the copy; nor the difference between the copy and a computer-synthesized version. In this section, we'll explore how we can capture and create sound, and how computers are opening a whole new realm of three-dimensional sound. Then we'll look at the application of these sound fields in virtual realities.

Stereo Sound

A channel of sound carries information for one speaker. For example, each speaker in your home sound system or in a pair of headphones receives one channel of sound from the amplifier. In a monoaural (mono) system, like that on an old record or from a television program, there is only one channel of sound; it can be sent to two or more speakers, but the speakers all get the same information. Stereo sound, on the other hand, carries two channels of sound: the left channel and the right channel.

Imagine that you're wearing a pair of headphones and listening to a recorded tape of someone speaking. If the majority of the sound is coming from the left side, the voice appears to be on your left (see "Interaural intensity differences" earlier in this chapter). If the two channels' volume levels switch, the voice appears to be on your right. If the sound is coming equally from both speakers, the sound appears to be directly in the middle of your head. Note that if sound is only coming from one side or the other, the brain no longer localizes the sound as coming from the left, right, or "inside."

3-D Sound

With basic recording and playback techniques, such as most stereo sound, any sense of sound localization is constricted to directly to the left, directly to the right, and somewhere in between. This "somewhere in between" may sound like it's coming from all around; however, when discussing three-dimensional sound localization, there's very little difference between sound appearing inside the head and sound appearing everywhere at once.

How, then, can we create a sound that appears from a particular place in our environment? How can we create a sound that appears to be behind us or above us or anywhere else in three-dimensional space? This problem is much like creating a three-dimensional visual effect even though we only have two eyes. All the subtle cues described in "Sound Localization" earlier in this chapter, such as interaural intensity differences and acoustic shadow, must be used to carefully craft the two sound channels that reach our ears.

Note that we've casually been using the word "appear" when talking about sound. Even though we don't see sound, VR researchers commonly speak as though we do. For example, sound is "displayed" with audio displays. We don't know where this tendency to use visual metaphors for aural experiences started, but it was long before virtual reality. In fact, you're probably familiar with comments such as, "Do you see what I'm saying?" and " Watch what happens when I turn the volume up."

Recording Sound in 3-D

Fred Wrightman and Doris Kistler have conducted extensive testing in displaying three-dimensional sound at the University of Wisconsin. A typical experiment goes as follows. A subject is seated in an anechoic (nonechoing) chamber. Next, small probe microphones are placed deep within each of the subject's ears, close to the eardrum. A tone is then played through one of 144 speakers positioned around the person's head, and the sound is recorded through the microphones (see Figure 4-10). This recording captures the tone after it has been affected by the head, the pinna, and the auditory canal. Another tone is played and recorded, and then another, and so on.

The second stage of the experiment is removing the microphones and playing the recorded tones back to the subject through a pair of headphones. As if by magic, the tones actually appear as though they are coming from the three-dimensional environment around the subject! Note that when a subject listens to a recording made from their own ears, they have no problem localizing the sound. However, when a recording made from one person's ears is played back for a difference person, the listener experiences a large number of inaccuracies in localization. Once again, this shows that our pinna and how we hear three-dimensional sound are truly as personal as our fingerprints.

Of course, recording and playing back three-dimensional sound isn't limited to simple tones. Take, for example, the demonstration that Chris Currell, a leader in computer-generated sound and the founder of Virtual Audio Systems, conducts. He leads someone into a room and seats them in a chair, placing a pair of high-quality headphones on their head and a blindfold over their eyes. He then proceeds to walk around the room, telling the person about the system, how it works, and how the demo will proceed. All of a sudden, the door opens and four or five people loudly come into the room, interrupting the demonstration. Obviously the demo can't continue, so Chris tells the person to take the blindfold and headphones off. And, much to the subject's surprise, the room is empty; In fact, the room had been empty ever since the blindfold had been put on, and the entire discourse and interruption have been recorded.

While this recording/playback process sounds like the answer to the dream of three-dimensional virtual sound, it really isn't. Just as movies (even 3-D movies) aren't really three-dimensional and interactive, a recorded audio tape can never truly be interactive either. In reality, when you turn around, sounds that were behind you are now in front of you. But when listening to a prerecorded audio tape with headphones, sounds that are "behind" you always sound behind you, no matter where you turn or look.

Virtual Sound

The trick to creating a truly three-dimensional, interactive sound space is, of course, to use computers to generate sounds in real time rather than rely on sounds that are prerecorded. Most of us rely so heavily on our visual system that it's hard to even consider what a sound space is. However, imagine a room with a table and couch in it. On the table is a small radio, tuned to your favorite station. Wherever you are in the room, some of the sound reaches you directly from the radio's speaker and some of the sound reaches you after rippling through the air, bouncing off the walls and the couch.

In a virtual reality, you must be able to move anywhere in that room and maintain the sense that the music is coming from the radio on the table. For example, if you turn away from the radio, the sound should be behind you, and so on. To do this, the computer must use a combination of position/orientation tracking and exquisitely difficult math.

Creating an Audio Earprint

We learned earlier that the curves and folds of the pinna are as unique as fingerprints. As acoustic energy reaches our ears, the pinnas color it, helping us to localize the sound in our environment. Researchers, including Klaus Genuit and Hans Gierlich, of Germany, and the aforementioned Fred Wrightman and Doris Kistler, have created mathematical models that represent the various sound modifications we rely on to hear sound in three dimensions. These models, which we think of as audio "earprints," are called head-related transfer functions (HRTFs).

Researchers can feed these head-related transfer functions, which are developed using techniques similar to the 144-speaker sphere described above, into a computer in the form of mathematical equations. The computer then acts as a filter: digitized sounds are generated by the computer or come in from another source, they get filtered using the appropriate HRTFs, and then are sent on to the headphones or speakers.

Convolvotron. Some of the most substantial work in this area has been conducted by NASA's Ames Research Center in collaboration with Scott Foster's Crystal River Engineering, of Groveland, California. The result of their research is a set of computer plug-in boards called the Convolvotron. The Convolvotron is an extremely powerful audio digital signal processor (DSP) that changes (convolves) an analog sound source using the HRTF to create a three-dimensional sound effect.

Sound that is computer synthesized or from an external source (like a compact disk) can be filtered through the Convolvotron and placed in space around the listener. For example, you could place a virtual wailing saxophone in one corner of a room and a drum in another part of the room, and then move them around. The threedimensional audio cues are generated by the Convolvotron's filters and controlled by computer software.

Note that only 144 sound positions are measured in the audio sphere experiment. While it would appear that the Convolvotron could only create sounds in those same 144 positions, in fact any number of positions can be synthesized by interpolating values between each of the positions using linear weighting functions. In this way, the sound-space resolution is significantly higher than 144 positions in 360-degree space, and sound that moves from one place to another moves smoothly rather than jerking from one spot in space to another.

The Convolvotron can simulate up to four sound sources, either moving or static. A Polhemus magnetic positioning system (see "Position/Orientation Tracking" in Chapter 2, Virtual Immersion) tracks the user's movement, and information is passed to the computer to adjust the sound appropriately. The effect is that the three-dimensional audio environment is held stable as the user moves within it, and the combination of the three-dimensional sound and the position/orientation tracking creates an extremely realistic environment for the listener.

Virtual Audio Processing System. Virtual Audio Systems is creating a system for what will probably be the first mass-market application of three-dimensional sound: entertainment. Its Virtual Audio Processing System (VAPS) mixes the worlds of noninteractive binaural recording and Convolvotron-like signal processing to generate both live and recorded three-dimensional sound fields. Its literature states that "VAPS is used for recording music, sound effects, or dialog for a stereo format such as compact disk, VCR tape, video disk or broadcasting."

The concept is that in the not-so-distant future we will be able to listen to music and watch movies that include three-dimensional audio effects. A plane flying overhead will really sound as if it's overhead; an MTV guitarist might walk offscreen behind your couch; or a symphony might sound as if you were right onstage, near the violin section one moment and near the timpani the next. One of the amazing things about three-dimensional sound is that you can use the cocktail party effect in a way that you can't with simple stereo sound. As we mentioned earlier, if you record a party from the middle of a crowded room and try and listen to it later, all you'll hear is a barrage of sound; you can't pull conversations out of the crowd, the way you could if you were actually there. However, when the sound you're listening to is recorded or processed using a system like VAPS, you can actually direct your attention to a single conversation, just as if you were actually among a group of people. With a little imagination, you can see how the film and music business might be clamoring to move in this direction.

Once again, note that these recorded three-dimensional sounds are noninteractive. If you get up out of your chair while listening to a 3-D CD while wearing headphones, the saxophone player you heard behind you, remains behind--even if you turn to "face" him. The audio environment, although 3-D, does not remain stable vis-a-vis your position in it. However, with position tracking and real-time computer-generated sound, stable virtual sound spaces could be possible.

Virtual Audio Systems president Chris Currell claims that he is able to take a step further in the three-dimensional audio process and do away with headphones in some situations. By using transaural cross-talk cancellation techniques, his system (as well as some others, such as those developed by Roland Corporation) can stop right-channel information coming out of an ordinary stereo speaker from entering the left ear and vice versa. The more speakers used, the better the three-dimensional effect; but even two speakers will work. Nonetheless, headphones ultimately produce the best sound because they actually hold down the pinna and play the sound directly into the ear, avoiding double-coloration effects.

Complex Acoustic Environments

In most of the above experiments, binaural recordings made with the probe microphones were created within anechoic rooms. But when was the last time you were in an anechoic room? Researchers realized that if the goal was to create realistic threedimensional sound, then they must study and recreate the complex acoustic cues found in reverberant environments. In other words, if you want to create sounds like those heard inside a room, the subtle echoes that the sound waves make must be taken into account.

Researchers have taken two paths to reach this effect. The first is to create head-related transfer functions through a similar probe-microphone recording process. The HRTFs that result are considerably more complex than the simple anechoic versions and carry with them proportionally larger computational requirements. Nonetheless, when you throw enough computing power at them, even these equations can be solved fast enough for real-time simulation.

Recently, a second method has come into use, similar to the three-dimensional computer-graphics method of ray tracing. Ray tracing is a process of tracing the path of light as it bounces off reflective surfaces. In ray tracing graphic images, each pixel in the image is traced to a corresponding point on an object and is then reflected to a light source, another reflective object, or off into space. A similar process is used in creating virtual sound .

In ray tracing sound, extra sound sources are placed behind reflective surfaces (like walls, ceilings, and so on). The computer then figures out how loud these reflective sound sources should be at any given moment. The result is that sound appears to come from the reflector surface.

Both visual and aural ray tracing are extremely computation intensive. So as computers get more powerful and the algorithms for creating these three-dimensional illusions are refined, our ability to create dynamic three-dimensional environments will increase significantly. For example, researchers like Crystal River Engineering's Scott Foster can now create rooms and move walls around, giving the impression that the ceiling is lowering or that the room is getting longer, and so on. In the future, complex three-dimensional models, such as an entire office building or a busy street, will be modeled along with the extremely subtle yet powerful sound cues that accompany them.

The Move Toward Sound

Five years ago, Foster thought it would be at least ten years before he'd be able to create interactive virtual sound so convincing as to be indistinguishable from reality. Three years later, the technology had moved so fast that he cut several years off his projection. Now researchers are finding that they can fool some of the people some of the time, and the prognosis for fooling the rest of us in the near future is good.

The development of three-dimensional interactive sound is farther along than its visual counterpart. Not only has the ability to create realistic (and hyperrealistic) sounds come a long way, but so has their inclusion in common technology. Personal computer manufacturers are recognizing the importance of sound and are beginning to incorporate it in their core operating systems. As time passes, there is little doubt that computer-generated sound will become as integral to computer technology as voice and music became to film almost seventy years ago.