Gecko:MediaStreamLatency
Minimizing the end-to-end latency of WebRTC is very important. Minimizing the latency of Web Audio, for example in the case where a script wants to start playing an AudioBuffer immediately, is also very important. What do we need to do with MediaStreamGraph to support this?
On this page I'm trying to work out estimates of the current latency of our MediaStreams setup, and how we can improve the situation.
Contents
Current State
The MediaStreamGraph runs every 10ms (modulo scheduling issues); let's call this T_graph. Let's assume the graph processing time is T_proc per tick --- hopefully just a few ms, so let's say 3ms. On my laptop the cubeb callback in nsBufferedAudioStream runs every 25ms, let's call this T_cubeb.
On the WebRTC "send" side, we have a microphone producing callbacks periodically (T_mike, 4ms in current code, I think). (I'll ignore driver processing delay, which we can't control.) That data is immediately appended to the input queue of a SourceMediaStream. At the next MediaStreamGraph tick, that data is moved to the output side of the SourceMediaStream or, probably soon, a wrapping TrackUnionStream. At that moment it can be observed by a MediaStreamListener. Let's define the overall send latency T_send as the longest delay between a sample being recorded and reaching the network stack via a MediaStreamListener. Assuming pessimal timing of the MediaStreamGraph tick, that's (for the first sample in a chunk)
T_send = T_mike + T_graph + T_proc
In my example, this is 4 + 10 + 3 = 17ms. The ideal latency would be just T_mike = 4ms.
(Note that we are using a TrackUnion, and that will cause mic data to be delayed further if there's any buildup of data due to underruns or clockrate mismatches. bug 884365 will bypass the MediaStreamGraph on the input side for any stream sourced via getUserMedia, so we should be close to T_mike (4ms). Also note that T_mike may be considerably more than 4ms; on Android it might be as much as 40ms or more. However, we have no direct control of T_mike.)
On the "receive" side, at each MediaStreamGraph tick we pull the latest data from the network stack (via NotifyPull), and copy it to the end of the output nsAudioStream. The worst-case latency before data is picked up by the libcubeb callback (ignoring scheduling problems) happens when a sample has to wait T_graph for the graph to tick, then wait for graph processing, and then we append the sample to the nsAudioStream buffer and have to wait for at most T_cubeb for it to be consumed and played. (I'll ignore driver processing time, which we can't control.)
T_recv = T_graph + T_proc + T_cubeb
In my example, this is 38ms. The ideal latency would be to pull the latest audio from the network stack in each libcubeb callback, i.e. T_cubeb = 25ms.
Playing an AudioBuffer has the same latency analysis.
Desirable State
I propose running MediaStreamGraph processing every 10ms normally but *also* running it during the libcubeb callback to "catch up" to produce the rest of the data needed when the libcubeb callback fires. On Windows 7, putting a sleep(10ms) in the libcubeb callback doesn't seem to hurt. On the receive side, this will mean that the data returned to the libcubeb callback will always include the latest audio available from the network stack, reducing the latency to T_cubeb + T_proc = 28ms.
Note: There are issues with this due to the WebRTC/NetEQ API - jesup
In fact we can do a little better by moving our NotifyPull calls to run as late as possible during MediaStreamGraph processing. If the network audio is not being processed, the NotifyPull(s) can happen after any WebAudio processing has been done. This effectively reduces T_proc to almost nothing. Let's call it 1ms, so we've reduced our latency to 26ms, almost the minimum allowed by libcubeb.
Script playing AudioBuffers won't be so amenable to that optimization since we need to process all messages from the DOM, so we'll take the full T_proc and have a latency of 28ms.
On the send side, to minimize latency we really should change the input API to be pull-based. The current code for Windows actually pulls from Win32 every 4ms on a dedicated thread. If we let the MediaStreamGraph do the pull instead, we can reduce latency to T_graph and T_proc. As above we can minimize T_proc for streams that aren't involved in other processing, so we can get a latency of T_graph + T_proc = 11ms on the receive side. To do better, we'd have to lower T_graph or identify cases where we can provide a "fast path" that gives the network stack consumer access to samples as soon as they're queued for the SourceMediaStream. However, I think we should focus on trying to achieve the goals already laid out here.
Latency targets per platform
Linux
We have two backends in cubeb:
- ALSA
- Pretty rudimentary. Most distros are not using ALSA directly anymore. However, some people choose to do so
- PulseAudio
- Advertised as good as ALSA (I could confirm that on my setup, but read below). This gives us automatic latency adjustments in case of underrun. We can get actual latency (as opposed to the requested latency), so we can react.
Both have been tested on a thinkpad w530 and on a Thinkpad t420, the latency is great (sub 10ms). I just got a 300 euros netbook I can test with, but I need a 32bits build to check. I am sure that minimal latency relies heavily on good interaction between: - The soundcard kernel driver - ALSA - Pulse
and I believe my setup happens to have a good driver (plus fast CPU and such), and this will be completely different with different hardware and software environment.
I know that some Pulse releases in the while are buggy and make the latency grow, we should make sure to be able to detect that.
Other browsers achieve 40ms, we want to be able to be at least as good.
Windows
We have only one backend in cubeb, that uses winmm, which is not intended at all to be low-latency. We need a new backend, that would be using the WASAPI, so we can achieve good performances.
WASAPI is in the 30ms range, and can go lower if: - we are running on decent audio hardware - we use the windows API to increase the audio thread priority
It is possible to go lower if we request exclusive access to the hardware, which is something WASAPI exposes in its API.
On my Thinkpad t420, I could bring the latency down to 512 frames at 48kHz (around 10ms) (using a DAW that has a WASAPI backend). I tried to increase the CPU load, and I could make it underrun a bit, but nothing terrible (it would be unnoticed when doing a WebRTC call). At 1024 frames, I could not detect underrun by ear.
WASAPI is available on Windows Vista, Windows 7, Windows 8. People using a stock Windows XP with a normal sound card won't get great latencies. The only way around it is to a super low level API that takes exclusive access of the hardware (basically bypassing the system's mixer and talking directly to the kernel). There seem to be two options for this: DirectSound with a hardware mixer or Direct Kernel streaming (DirectKS Sample Application http://www.microsoft.com/en-us/download/details.aspx?id=18989) (http://en.wikipedia.org/wiki/Windows_legacy_audio_components#KMixer). These apis are used by Winamp, Foobar2000 etc.
MacOS
No problems at all, we can bring the latency super low (12.5ms is what we do at the moment, lower is doable). No glitches noticeable, latency is stable and everything works fine under high CPU load without underruns.
Android
For Android < 4.1 won't be able to do anything great, see <http://code.google.com/p/android/issues/detail?id=3434>. 100+ms are expected, nothing we can do to lower that.
For Android > 4.1, on some devices (currently, Galaxy Nexus, Nexus 4, Nexus 10, that is, high-end devices where specs are controlled by Google), you can achieve around 8ms, using what is called FastMixer. Basically, you do a trade of between using resampling (that is, you are bound to use the hardware's preffered samplerate), putting effects, and other goodies. You also have to use a certain buffersize. We don't care about that, because we just want a PCM interface, so it is all great.
Good thing is, we can detect at runtime if the device can run at low latency, so we can do it right now.
B2G
No idea.
Plan of action
Those are the things we can do to get better overall latency:
- Write a WASAPI backend for cubeb. This is needed so we can bring Windows' latency close to Linux's and Mac's. Important because most of the users are on Windows, and because our current backend, while working on every single Windows Desktop OS released after 1995, is neither CPU efficient nor low-latency capable.
- Drive the MSG with the cubeb callback. This folds the MSG latency into the callback (giving us 0-10ms latency improvement). More importantly, this allows us to stop using the BufferedAudioStream that adds theoretically anywhere between 0 and 1 full second of latency, depending on how full it is. In practice, it adds as much latency as the delay between the moment the first |BufferedAudioStream::Write| and the first callback is called, because the callback is very likely to consume data as fast as it is written, otherwise the situation would not sustainable. We can get around (T_bas + T_msg)ms doing that, where T_bas is the latency of the BufferedAudioStream (around 120ms on my system when running a simple WebAudio demo, plus the 10ms of the MSG). Some open question that don't have answers yet:
- How do we deal with Android having terrible latency? The callback won't get called every 10ms. Maybe we can get around that by running multiple iteration of the graph in a single callback until the frame count request for the callback is satisfied, and put the remaining frames in a little buffer.
- Can something down the graph (gUM, WebRTC) block (as in, waiting on a monitor or something)? We want to avoid that.
- Implement [1], so that authors can do tradeoffs between having battery life/cpu consumption/latency. We would put the same attribute on the AudioContext or something.
- Figure out the correct way to have the time difference between the graph time and the audio presentation time, which would be approximately:
fixed_output_latency = driver_buffer_frames * driver_buffer_count / samplerate
fwiw, some people on the webaudio mailing list requested that to have better video/audio sync [2].
[1]: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2013-April/039338.html
[2]: https://www.w3.org/Bugs/Public/show_bug.cgi?id=20698#c19
Bugs to file
TODO