Talk:MediaStreamAPI
Some basic feedback:
Contents
I don't understand what the 'streamaudio' attribute is for
In the examples it seems to be used to tell the original video element not to play its own audio, and instead audio is streamed (and optionally filtered) to a separate audio element, while the original video still plays. If that's the design, isn't a similar attribute needed to hide the video if that's what's being processed?
roc: No. You can use CSS (e.g. display:none) to hide video. There is no CSS property for muting audio.
Would the attributes' use be more obvious if they were called 'muteaudio' and 'mutevideo'? Is there not already CSS for that?
roc: There's already a 'muted' attribute, but I don't want to use that here because it's often exposed directly to the user. For example if the user is playing a video with some effects processing, the "mute" UI for the video needs to alter the audio that feeds into the effects processor.
It seems like a complicated way to do simple things. Maybe this is an argument for the mediaresource element, as a non-displaying source. Could the graph output be fed back to the original element for display? Could that be the default?
roc: I don't understand how "feeding back the graph output" would work. <mediaresource> seems like overkill for now so I've removed it for now. I've added a call captureStream() that automatically sets the streamaudio attribute (which I've renamed captureAudio).
There's no way for workers to signal latency to the scheduler
If the worker just writes out timestamped buffers, it can dynamically signal its latency, and the scheduler can make its own decision about what constitutes an underrun. However, in realtime contexts (conferencing and games) it is helpful to optimize for the lowest possible latency. To assist this, it's helpful if the workers (and internal elements like codecs and playback sinks) can advertise their expected latency. The scheduler can then sum these over the pipeline to determine a more aggressive minimal time separation to maintain between the sources an sinks.
One possible resolution is just to use an aggressive schedule for all realtime streams, and leave it to developers to discover in testing what they can get away with on current systems.
roc: I'll add API so that Worker processors can set their delay.
Processing graph only works on uncompressed data
This depends quite a bit on the details of the StreamEvent objects, which are as yet unspecified, but it sounds as if there's a filter graph, with data proceeding from source to sink, but with only uncompressed data types. This is a helpful simplification for initial adoption, but it also severely limits the scope of sophisticated applications.
Most filter graph APIs have a notion of data types on each stream connection, so encoders, decoders and muxers are possible worker types, as well as filters which work on compressed data; anything that doesn't need to talk directly to hardware could be written in Javascript. As it stands, the API seems to disallow implementations of: compressed stream copying and editing for things like efficient frame dup/drop to maintain sync, keyframe detection, codecs written in javascript, and feeding compressed data obtained elsewhere into the graph.
roc: I think this could be added to the framework outlined here. I don't plan to address these use-cases at this time.
<mediaresource> as an HTML element
There is no reason to have a media resource object in the DOM as an element. It is not a presentation element, and it is used only by the JavaScript. It is recommended to implement a media resource as a regular object:
interface MediaResource { ... }
It's usage will be similar to the XMLHttpRequest: call the open method with async and url parameters. The object can have additional canPlayType to avoid request with unsupported media types.
It makes sense to have the MediaResoure (and Audio) object available in Worker's global scope. That will let a worker process to load the media, process, and then play.
roc: I've removed mediaresource for now, until we have compelling use-cases that require it.
Graph cycles
Blocking all the streams involved in a cycle until the cycle is removed seems like an undesirable behavior. It will mean that in cases where a bug introduces a cycle in a dynamically-constructed graph, the end user simply gets no audio, or worse, part of the audio he expects is missing but the rest is playing. This will make it easy for these sorts of bugs to slip by, and make it difficult to debug them. I would argue that forming a cycle should throw an error immediately so that the point where the cycle is formed can be found easily, and a developer can then walk back and examine the graph to understand why he has formed a cycle. kael
roc: I think you're right, for the cases where the cycle is created due to an API call. Depending on how things evolve there might be cases where we can't detect at the time of an API call that a cycle will be created, and in that case we would have to block instead.
roc: Well, based on your other feedback I added a 'start time' to the addStream call and now we have a situation where a graph cycle can be created after the addStream call has returned. So I'm going to not have that throw exceptions for cycles after all. We'll have to use Web Console error reporting to inform developers that a cycle has been detected.
Miscellany
Having a readonly attribute named 'rewind' strikes me as confusing. It reads like a verb. It being a relative value is also kind of awkward; if the end user is always going to subtract it from the current stream offset, it'd be much better if we just actually *exposed* the 'real time' as an attribute, that already factors things in. People who want a relative value can then subtract the 'currentlyMixingTime' from the 'currentRealTime', or something like that.
roc: we could expose two times, but in most cases the author would just have to compute the difference and use that. I'm not sure that's an improvement.
The way that writeAudio simply 'writes' at the current location of the audio cursor makes me nervous. I feel like it doesn't do enough to make it clear exactly what's going on in the various use cases (rewound stream, prebuffering samples before they're played, etc), especially because we attempt to promise that all streams will advance in sync with real time. Maybe this is okay because rewinding is opt-in, though.
roc: I would prefer not to add an extra parameter just so that the API can throw if the author passes in the wrong value. If we can find another use for it, sure. And yeah, rewind is an advanced feature so I don't care so much if it's tricky to use.
The way in which multichannel data is laid out within the sample buffer does not seem to be clearly specified, and there's lots of room for error there. Furthermore, if what we go with is interleaving samples into a single buffer (ABCDABCDABCD), I think we're leaving a lot of potential performance wins on the floor there. I can think of use cases where if each channel were specified as its own Float32Array, it would make it possible to efficiently turn two monaural streams into a stereo audio mix without having to manually copy samples with javascript. Likewise, if we allow a 'stride' parameter for each of those channel arrays, interleaved source data still ends up costing nothing, which is the best of both worlds. This is sort of analogous to the way binding arrays in classic OpenGL works, and I feel like it's a sane model.
roc: I think I'll just go with non-interleaved for now. That's what Chrome's Web Audio API does. We want to restrict the input format as much as possible to make it easier to write processing code, e.g. we don't want author processing code to have to deal with arbitrary strides.
I don't like seeing relative times in APIs when we're talking about trying to do precisely timed mixing and streaming. StreamProcessor::end should accept a timestamp at which processing should cease, instead of a delay relative to the current time. Likewise, I think Stream::live should be a readonly attribute, and Stream should expose a setLive method that takes the new liveness state as an argument, along with a timestamp at which the liveness state should change. It'd also be nice if volume changes could work the same way, but that might be a hard sell. Explicit timing also has the large benefit that it prevents us from having to 'batch' audio changes based on some certain threshold - we can simply accept a bunch of audio change events, and apply them at the desired timestamp (give or take the skew that results from whatever interval we mix at internally). This is much less 'magical' than the batching we suggest, and it also is less likely to break mysteriously if some other browser vendor does their batching differently.
roc: I'll change it to use absolute times. I don't think liveness needs a time parameter since it's not something you're likely to change dynamically. The batching will still be needed though; it's not "magical", the idea that HTML5 tasks are atomic is normal for the Web platform.
It would be nice if we could try to expose the current playback position as an attribute on a Stream, distinct from the amount of samples buffered. The 'currentTime' attribute is ambiguous as to which of the two 'current times' it actually represents, so it would be nice to either make it clear (preferably with a more precise name, but at least with documentation), or even better, expose both values.
roc: Actually I don't want to expose either! We should have a global "media time" instead that people can use as a base for their time calculations. Maybe this should be the same as the animation time...
In many of the examples provided, we're capturing the stream of an <audio> tag, and then our worker is presumed to spit it right out to the soundcard. I would feel happier about this API if the <audio> tag represented two components - an audio input stream (the content used as the src of the <audio>) and an audio playback stream (the samples being sent to the soundcard when the <audio> element is 'playing'). This would make it much clearer when the captured stream is actually going out to the soundcard, and when it's being consumed silently and used for some other purpose. This also makes it clearer whether the user's volume setting (exposed in the <audio> ui) will affect the audio data - the samples themselves will be unscaled by volume, but anything written to the <audio> tag's associated 'output stream' will be scaled by that tag's current volume. kael
roc: All the examples actually play their output through an <audio> element. I think the proposal pretty much already works the way you want it to!