Capturing Audio Invisibly, Using Sound to Guide the Visual Experience, and Dealing with Multiple Delivery Formats

FotoKem company Margarita Mix has added 360-degree sound-mixing capabilities to its slate of audio post services in Santa Monica and Hollywood. What does that mean? Right now it involves bringing a lot of know-how to bear on an emerging post-production challenge — mixing sound for a spherical environment that can only be properly experienced while wearing a head-mounted device (HMD). Different audio formats and varying creative approaches to the VR experience multiply the possibilities. We asked Senior Technical Engineer Pat Stoltz, who recently worked with director Art Haynie of Big Monkey Films on an Eagles of Death Metal concert film that used the Dolby Atmos VR audio format, about emerging best practices in VR audio workflow. 

(To hear from director Art Haynie on some of the decisions he made when shooting and mixing the concert, read our companion article.) 

Think Carefully About Audio Capture. It's easy to forget that shooting 360-degree experiences means you can no longer hide your crew and your equipment just outside the frame. In a 360-degree video, there is no frame. "You can't have a boom operator there in the room because he's going to be seen," Stoltz says. "Everything is seen in VR. So you have to hide the cables, and you have to hide the microphones, and you have to hide the people who are on set directing and recording. If the audio is left to be recorded by the on-board microphones on the cameras, 90 percent of the time that will be completely unusable. So there needs to be greater attention to putting wireless microphones on the actors and acquiring a good quality recording on set. You can't have extra cameras everywhere, cable runs have to be hidden or wireless, and any people supporting the production, from the scriptwriter to the sound mixer, have to be hidden from view."

Audio Cues Can Be as Important as Visual Cues. "Audio has to be thought of during the scripting of the piece that you're working on," Stoltz advises. "You have to use it as part of the whole scene. You're going to want to draw the gaze of the viewer to particular areas of the 360-degree environment. How do you do that? You can do it through visual stepping stones that take you over to a specific area, or you can do it via audio cues. You hear something behind you, so you turn to see what you just heard, and then the visual cue kicks in. So you have to think about audio as a very creative aspect of your production."

Margarita Mix VR overlay

Audio Workflow for VR Is Evolving. Stolz says Margarita Mix works primarily with an equirectangular picture file, which is a flattened representation of a 360-degree video. That is projected on-screen with a superimposed grid with reference markers indicating where 0-, 90-, and 180-degree points would fall in the spherical space. "You'll place your objects in the mix with that in mind," Stolz says. "Unfortunately, you can't mix while wearing an HMD because you can't see what you're doing. So you place things in software based on quadrants [of the sphere], or if you have an actor there, you can just place their dialog directly over the performer. And once you get that mixed in and the levels all set, you go to a different computer that's locked to Pro Tools or whatever DAW you're using, put on the goggles, and view it and dial it in there. It's a cumbersome, back-and-forth workflow, but it's coming along." Stolz imagines a future where mixers wear augmented-reality goggles, like Microsoft's still-in-development HoloLens, that would allow them to see the spherical content and their mixing console simultaneously. 

VR Deliverables Are … Complicated. The Ambisonic B-format — a representation of a 360-degree sound field — can be delivered in one of two streaming standards: FuMa or AmbiX. (The difference between the two is the order of the four channels of the audio mix: AmbiX uses WYZX while FuMa uses WXYZ.) Earlier this year, Facebook acquired Two Big Ears, which uses a proprietary .tbe file that requires an embedded decoder for audio metadata on the playback computer or device, as does Dolby Atmos VR. And a new format from the German government-backed Fraunhofer Institute, Cingo, is on the horizon. The multiplicity of standards is not ideal, but Stoltz hopes it will get simpler. "For Jaunt, you're going to need Dolby Atmos. For Facebook, you're going to need a .tbe file. And for YouTube or Vimeo and other streaming services, you're going to need a FuMa or AmbiX file," he explains. "Currently, you have to wrap those individually [in separate MP4 containers]. For the future, I'm hoping you can wrap all four of those into that original MP4 and deliver that, and whatever software you have will automatically read the correct file. But right now we have to specifically wrap for whatever format we're mixing for. "

Head-Tracking or Non-Head-Tracking? One of the benefits of Dolby Atmos VR, as well as the new Fraunhofer format, is support for what's known as higher-order Ambisonics, which includes the ability to specify, via metadata, that certain sounds should track along with the viewer's head movements while others should remain stationary in the mix. "Let's say you're snowboarding down a mountain, wearing headphones with light music playing," Stolz suggests. "When you rotate your head, you don't want the music to rotate and change orientation. You want it to stay in your ears. But other skiers going by you, or people talking? When you turn your head, you want that orientation to change. That's the difference between head-tracking and non-head-tracking. But if you're delivering a FuMa or AmbiX file, you don't get the non-head-tracking option. Everything is head-tracking, all the time."

If audio for VR remains complex, it's the job of experts (like pro sound mixers) to simplify — to cut through the white noise of all the various formats and options and clearly define the creative options that are open to the filmmakers they collaborate with. Stolz says it's worth the effort. "It really takes the creative embellishing of a visual piece to the next level," he says. "When you, as a viewer, put those goggles on, you are the director. No two people are going to have the same experience. You're always going to be looking somewhere else, and hearing different things. Creatively, the director and mixer have to focus on what they want their viewer to experience, and how to guide them to what they want them to focus on at any given time in the piece.

"Before, you had editors that cut the visual piece, and the audio just drew you in. You sit back and watch it as a third-person. But when you put goggles on for VR, that is your reality. And audio plays an important role in representing the reality you are in."

To hear from director Art Haynie on some of the decisions he made when shooting and mixing the concert, read our companion article.