Digital Audio Clock Accuracy When Recording Audio for Video with Long Takes

From time totime, I get a call from a customer who recorded audio-for-picture for a concert or reality show with “prosumer” audio gear using MetaCorder and wonder why their audio drifts when post-synced to picture. This article is an attempt to explain why recording audio for picture requires a high level of accuracy, often more than is available from prosumer audio gear, and ways to achieve it.
One of the challenges when designing any location multi-track rig is balancing ergonomics, stability, audio quality and timebase accuracy. The nature of reality television production, with its long, unscripted recording, requires a special emphasis be placed on timing accuracy. This is due not just because these long takes need to be post-synchronized to picture, but also because of the nature of the Broadcast Wave file itself and how it stores and represents time code as an abstract calculation.

Digital audio is recorded by sampling the level of the analog audio waveform at regular intervals. The rate at which the audio is sampled is of course known as the “sampling rate” and the standard for film
and video production is generally 48 KHz. If we do a little math, we can calculate that with a sampling rate of 48 KHz, the audio must be sampled 48 times every thousandth of a second (ms). The result of inaccuracies in the sample rate clock is audio drift. When audio alone is recorded, minor drifts are virtually undetectable. When audio must be synchronized against picture, however, even minor drifts can create chaos in the editing room, with the result being that for any given shot there will be more or less audio, depending on the nature of the drift.

Compounding this problem is the way in which time code is stored in the Broadcast Wave file: unlike “analog” linear time-code, which is recorded continuously on its own track, BWF files have a “time stamp” at the beginning of the file, which is the number of samples past midnight (00:00:00:00). When timecode is played back on a digital workstation, time code is calculated by adding the number of samples contained in the time stamp to the number of samples that have been played back in the file at that given point. If the audio was recorded with a drifting timebase, then the time code will always be drifting relative to the camera time code, and this drift will increase the farther into the audio file the workstation plays back. This will be true regardless of the accuracy of the time code clock – the time code is represented accurately only once, when the record button is pressed. The longer the take (and therefore the more drift), the more inaccurate the time code will be when the file is played back.

When you multiply the amount of audio and video recorded by the number of cameras and number of tracks being recorded in a given reality production, even small drifts become enormously costly (at least in terms of time) to constantly correct. Of course, other things go wrong during production that can cause time code issues as well, and in my opinion that makes it all the more imperative to start with a completely stable timebase for the audio recording.

Let’s look at how drift plays out with a high end pro-sumer mixer/firewire audio interface, such as a Mackie Onyx 1640 or PreSonus StudioLive. Both of these products have excellent value, combining good sound quality and ease of use with low price. However, they lack one important feature: The ability to sync to an external clock such as wordclock. The end user is therefore required to use the device’s internal sample rate clock. The component that drives the sample rate in any digital audio interface is a crystal controlled oscillator, and its accuracy is measured in Parts Per Million, or ppm. Every manufacturer has to make compromises to get their product to market, and since this interface was designed mainly for music recording, the crystal typically chosen for this purpose is spec’d with an accuracy of 50 ppm, which is good for even high end music recording. But when using this product to record sound for picture, that number tells a very different story:

An oscillator with an accuracy of 50 ppm translates to a timebase drift of about .05 ms, or 2.4 samples per second. Remember, MetaCorder rigs are often left recording for a few hours at a time, but for the sake of simplicity, let’s say that the rig is making a one hour recording. Multiplied out, 2.4 samples per seconds becomes 8,640 samples per hour. Since there are about 1,600 samples per video frame, this equates to 5.4 frames per hour of drift – that’s both audio and time code drift. Of course, some individual units may be more accurate than 50 ppm (the spec indicates the maximum oscillator drift), but without the ability to sync to an external source, the Mackie and Presonus mixers tie the customers hands.

There are a few ways to insure audio recordings made will be accurate enough for recording with picture:

1) Use a master wordclock generator with high accuracy and low jitter. Two examples are the Rosendahl Nanosync HD and the Brainstorm Electronics DCD-8. Both devices also have the added benefit of being able to sync from not just external word clock but external video sources as well. The Nanosync can natively generate video sync signals and timecode (ensuring perfect phase accuracy between video, timecode and word clock, while the DCD-8 can optionally generate video sync ‘ features perfect for multicamera video shoots.The Rosendahl Nanosync HD specifies a crystal of 0.5 parts per million, which translates to a drift of .054 frames per hour, or 1 frame in 18 hours – 100 times more accurate than the typical mixer with built in firewire interface. The Rosendahl or Brainstorm would then supply wordclock to the audio interface ‘ just remember to set the device to external sync!
2) Along the lines of option number one, you can use a professional audio recorder designed for film and television production to supply wordclock. The Sound Devices 788T, for example, is specified with a crystal capable of being tuned to 0.2 ppm.

3) Use an audio interface designed with film and television applications in mind. The Metric Halo 2882, ULN-2 2d interfaces, for example are specified with an accuracy of 5 ppm (and are often more accurate in practice). The RME Fireface 800 with the TCO (video and Time Code) option is another example, and is the only interface that can natively generate a sampling rate of 47952 and 48048 ‘ useful in some film workflows.

Remember, no matter what your workflow is, the most important element is to test it. For some productions, any audio drift can be dealt with by simply varispeeding the audio to match the picture. Other productions may find that solution intolerable, and require frame accurate audio and timecode.