User:David.humphrey/Audio Data API 2

From MozillaWiki
Jump to: navigation, search

Defining an Enhanced API for Audio (Draft Recommendation)

Abstract

The HTML5 specification introduces the <audio> and <video> media elements, and with them the opportunity to dramatically change the way we integrate media on the web. The current HTML5 media API provides ways to play and get limited information about audio and video, but gives no way to programatically access or create such media. We present a new extension to this API, which allows web developers to read and write raw audio data.

Authors
Other Contributors
  • Thomas Saunders
  • Ted Mielczarek
  • Felipe Gomes
Status

This is a work in progress. This document reflects the current thinking of its authors, and is not an official specification. The original goal of this specification was to experiment with web audio data on the way to creating a more stable recommendation. The authors hoped that this work, and the ideas it generated, would eventually find their way into Mozilla and other HTML5 compatible browsers. Both of these goals are within reach now, with work ramping up in this Mozilla bug, and the announcement of an official W3C Audio Incubator Group chaired by one of the authors.

The continuing work on this specification and API can be tracked here, and in the bug. Comments, feedback, and collaboration are all welcome. You can reach the authors on irc in the #audio channel on irc.mozilla.org.

Version

This is the second major version of this API (referred to by the developers as audio13)--the previous version is available here. The primary improvements and changes are:

  • Removal of mozSpectrum (i.e., native FFT calculation) -- will be done in JS now.
  • Added WebGL Arrays (i.e., fast, typed, native float arrays) for the event framebuffer as well as mozWriteAudio().
  • Native array interfaces instead of using accessors and IDL array arguments.
  • No zero padding of audio data occurs anymore. All frames are exactly 4096 elements in length.
  • Added mozCurrentSampleOffset()
  • Removed undocumented position/buffer methods on audio element.
  • Added mozChannels, mozRate, mozFrameBufferLength to loadedmetadata' event.

Demos written for the previous version are not compatible, though can be made to be quite easily. See details below.

API Tutorial

We have developed a proof of concept, experimental build of Firefox (builds provided below) which extends the HTMLMediaElement (e.g., affecting <video> and <audio>) and implements the following basic API for reading and writing raw audio data:

Reading Audio

Audio data is made available via an event-based API. As the audio is played, and therefore decoded, each frame is passed to content scripts for processing after being written to the audio layer--hence the name, AudioWritten. Playing and pausing the audio all affect the streaming of this raw audio data as well.

Consumers of this raw audio data register two callbacks on the <audio> or <video> element like in order to consume this data:

<audio src="song.ogg"
       onloadedmetadata="audioInfo(event);"
       onaudiowritten="audioWritten(event);">
</audio>

The LoadedMetadata event is a standard part of HTML5, and has been extended to provide more detailed information about the audio stream. Specifically, developers can obtain the number of channels and sample rate per second of the audio. This event is fired once as the media resource is first loaded, and is useful for interpreting or writing the audio data.

The AudioWritten event provides two pieces of data. The first is a framebuffer (i.e., an array) containing sample data for the current frame. The second is the time (e.g., milliseconds) for the start of this frame.

The following is an example of how both events might be used:

var channels,
    rate,
    frameBufferLength,
    samples;

function audioInfo(event) {
  channels          = event.mozChannels;
  rate              = event.mozRate;
  frameBufferLength = event.mozFrameBufferLength;
}

function audioWritten(event) {
  var samples = event.mozFrameBuffer;
  var time    = event.mozTime;

  for (var i=0, slen=samples.length; i<slen; i++) {
    // Do something with the audio data as it is played.
    processSample(samples[i], channels, rate);
  }
}
Complete Example: Visualizing Audio Spectrum

This example calculates and displays FFT spectrum data for the playing audio:

Fft.png

<!DOCTYPE html>
<html>
  <head>
    <title>JavaScript Spectrum Example</title>
  </head>
  <body>
    <audio src="song.ogg"
           controls="true"
           onloadedmetadata="loadedMetadata(event);"
           onaudiowritten="audioWritten(event);"
           style="width: 512px;">
    </audio>
    <div><canvas id="fft" width="512" height="200"></canvas></div>

    <script>
      var canvas = document.getElementById('fft'),
          ctx = canvas.getContext('2d'),
          fft;

      function loadedMetadata(event) {
        var channels          = event.mozChannels,
            rate              = event.mozRate,
            frameBufferLength = event.mozFrameBufferLength;
         
        fft = new FFT(frameBufferLength / channels, rate),
      }

      function audioWritten(event) {
        var fb = event.mozFrameBuffer,
            signal = new Float32Array(fb.length / channels),
            magnitude;

        for (var i = 0, fbl = fb.length / 2; i < fbl; i++ ) {
          // Assuming interlaced stereo channels,
          // need to split and merge into a stero-mix mono signal
          signal[i] = (fb[2*i] + fb[2*i+1]) / 2;
        }

        fft.forward(signal);

        // Clear the canvas before drawing spectrum
        ctx.clearRect(0,0, canvas.width, canvas.height);

        for (var i = 0; i < fft.spectrum.length; i++ ) {
          // multiply spectrum by a zoom value
          magnitude = fft.spectrum[i] * 4000;

          // Draw rectangle bars for each frequency bin
          ctx.fillRect(i * 4, canvas.height, 3, -magnitude);
        }
      }

      // FFT from dsp.js, see below
      var FFT = function(bufferSize, sampleRate) {
        this.bufferSize   = bufferSize;
        this.sampleRate   = sampleRate;
        this.spectrum     = new Float32Array(bufferSize/2);
        this.real         = new Float32Array(bufferSize);
        this.imag         = new Float32Array(bufferSize);
        this.reverseTable = new Uint32Array(bufferSize);
        this.sinTable     = new Float32Array(bufferSize);
        this.cosTable     = new Float32Array(bufferSize);

        var limit = 1,
            bit = bufferSize >> 1;

        while ( limit < bufferSize ) {
          for ( var i = 0; i < limit; i++ ) {
            this.reverseTable[i + limit] = this.reverseTable[i] + bit;
          }

          limit = limit << 1;
          bit = bit >> 1;
        }

        for ( var i = 0; i < bufferSize; i++ ) {
          this.sinTable[i] = Math.sin(-Math.PI/i);
          this.cosTable[i] = Math.cos(-Math.PI/i);
        }
      };

      FFT.prototype.forward = function(buffer) {
        var bufferSize   = this.bufferSize,
            cosTable     = this.cosTable,
            sinTable     = this.sinTable,
            reverseTable = this.reverseTable,
            real         = this.real,
            imag         = this.imag,
            spectrum     = this.spectrum;

        if ( bufferSize !== buffer.length ) {
          throw "Supplied buffer is not the same size as defined FFT. FFT Size: " +
                bufferSize + " Buffer Size: " + buffer.length;
        }

        for ( var i = 0; i < bufferSize; i++ ) {
          real[i] = buffer[reverseTable[i]];
          imag[i] = 0;
        }

        var halfSize = 1,
            phaseShiftStepReal,	
            phaseShiftStepImag,
            currentPhaseShiftReal,
            currentPhaseShiftImag,
            off,
            tr,
            ti,
            tmpReal,	
            i;

        while ( halfSize < bufferSize ) {
          phaseShiftStepReal = cosTable[halfSize];
          phaseShiftStepImag = sinTable[halfSize];
          currentPhaseShiftReal = 1.0;
          currentPhaseShiftImag = 0.0;

          for ( var fftStep = 0; fftStep < halfSize; fftStep++ ) {
            i = fftStep;

            while ( i < bufferSize ) {
              off = i + halfSize;
              tr = (currentPhaseShiftReal * real[off]) - (currentPhaseShiftImag * imag[off]);
              ti = (currentPhaseShiftReal * imag[off]) + (currentPhaseShiftImag * real[off]);

              real[off] = real[i] - tr;
              imag[off] = imag[i] - ti;
              real[i] += tr;
              imag[i] += ti;

              i += halfSize << 1;
            }

            tmpReal = currentPhaseShiftReal;
            currentPhaseShiftReal = (tmpReal * phaseShiftStepReal) - (currentPhaseShiftImag * phaseShiftStepImag);
            currentPhaseShiftImag = (tmpReal * phaseShiftStepImag) + (currentPhaseShiftImag * phaseShiftStepReal);
          }

          halfSize = halfSize << 1;
	}

        i = bufferSize/2;
        while(i--) {
          spectrum[i] = 2 * Math.sqrt(real[i] * real[i] + imag[i] * imag[i]) / bufferSize;
	}
      };
    </script>
  </body>
</html>
Writing Audio

It is also possible to setup an audio element for raw writing from script (i.e., without a src attribute). Content scripts can specify the audio stream's characteristics, then write audio frames using the following methods:

mozSetup(channels, sampleRate, volume)

// Create a new audio element
var audioOutput = new Audio();
// Set up audio element with 2 channel, 44.1KHz audio stream, volume set to full. 
audioOutput.mozSetup(2, 44100, 1);

mozWriteAudio(buffer)

// Write samples using a JS Array
var samples = [0.242, 0.127, 0.0, -0.058, -0.242, ...];
audioOutput.mozWriteAudio(samples);

// Write samples using a Typed Array
var samples = new Float32Array([0.242, 0.127, 0.0, -0.058, -0.242, ...]);
audioOutput.mozWriteAudio(samples);

mozCurrentSampleOffset()

// Get current position of the underlying audio stream, measured in samples written.
var currentSampleOffset = audioOutput.mozCurrentSampleOffset();

Since the AudioWritten event and the mozWriteAudio() method both use Float32Array, it is possible to take the output of one audio stream and pass it directly (or process first and then pass) to a second:

<audio id="a1" 
       src="song.ogg" 
       onloadedmetadata="loadedMetadata(event);"
       onaudiowritten="audioWritten(event);"
       controls="controls">
</audio>
<script>
var a1 = document.getElementById('a1'),
    a2 = new Audio(),

function loadedMetadata(event) {
  // Mute a1 audio.
  a1.volume = 0;
  // Setup a2 to be identical to a1, and play through there.
  a2.mozSetup(event.mozChannels, event.mozRate, 1);
}

function audioWritten(event) {
  // Write the current frame to a2
  a2.mozWriteAudio(event.mozFrameBuffer);
}
</script>

Audio data written using the mozWriteAudio() method needs to be written at a regular interval in equal portions, in order to keep a little ahead of the current sample offset (current sample offset of hardware can be obtained with mozCurrentSampleOffset()), where a little means something on the order of 500ms of samples. For example, if working with 2 channels at 44100 samples per second, and a writing interval chosen that is equal to 100ms, and a pre-buffer equal to 500ms, one would write an array of (2 * 44100 / 10) = 8820 samples, and a total of (currentSampleOffset + 2 * 44100 / 2).

Complete Example: Creating a Web Based Tone Generator

This example creates a simple tone generator, and plays the resulting tone.

<!DOCTYPE html>
<html>
  <head>
    <title>JavaScript Audio Write Example</title>
  </head>
  <body>
    <input type="text" size="4" id="freq" value="440"><label for="hz">Hz</label>
    <button onclick="start()">play</button>
    <button onclick="stop()">stop</button>

    <script type="text/javascript">
      var sampleRate = 44100,
          portionSize = sampleRate / 10, 
          prebufferSize = sampleRate / 2,
          freq = undefined; // no sound

      var audio = new Audio();
      audio.mozSetup(1, sampleRate, 1);
      var currentWritePosition = 0;

      function getSoundData(t, size) {
        var soundData = new Float32Array(size);
        if (freq) {
          var k = 2* Math.PI * freq / sampleRate;
          for (var i=0; i<size; i++) {
            soundData[i] = Math.sin(k * (i + t));
          }
        }
        return soundData;
      }

      function writeData() {
        while(audio.mozCurrentSampleOffset() + prebufferSize >= currentWritePosition) {
          var soundData = getSoundData(currentWritePosition, portionSize);
          audio.mozWriteAudio(soundData);
          currentWritePosition += portionSize;
        }
      }

      // initial write
      writeData(); 
      var writeInterval = Math.floor(1000 * portionSize / sampleRate);
      setInterval(writeData, writeInterval);

      function start() {
        freq = parseFloat(document.getElementById("freq").value);
      }

      function stop() {
        freq = undefined;
      }
  </script>
  </body>
</html>

DOM Implementation

nsIDOMNotifyAudioMetadataEvent

Audio metadata is provided via custom properties of the media element's loadedmetadata event. This event occurs once when the browser first aquires information about the media resource. The event details are as follows:

  • Event: LoadedMetadata
  • Event handler: onloadedmetadata

The LoadedMetadataEvent is defined as follows:

interface nsIDOMNotifyAudioMetadataEvent : nsIDOMEvent
{
  readonly attribute unsigned long mozChannels;
  readonly attribute unsigned long mozRate;
  readonly attribute unsigned long mozFrameBufferLength;
};

The mozChannels attribute contains a the number of channels in this audio resource (e.g., 2). The mozRate attribute contains the number of samples per second that will be played, for example 44100. The mozFrameBufferLength attribute contains the number of samples that will be returned in each AudioWritten event. This number is a total for all channels (e.g., 2 channels * 2048 samples = 4096 total).

nsIDOMNotifyAudioWrittenEvent

Audio data is made available via the following event:

  • Event: AudioWrittenEvent
  • Event handler: onaudiowritten

The AudioWrittenEvent is defined as follows:

interface nsIDOMNotifyAudioWrittenEvent : nsIDOMEvent
{
  // mozFrameBuffer is really a Float32Array, via dom_quickstubs
  readonly attribute nsIVariant    mozFrameBuffer;
  readonly attribute unsigned long mozTime;
};

The mozFrameBuffer attribute contains a typed array (Float32Array) and the raw audio data (float values) obtained from decoding a single frame of audio. This is of the form [left, right, left, right, ...]. All audio frames are normalized to a length of 4096. Note: this size may change in future versions of this API in order to more properly deal with sample rate and channel variations.

The mozTime attribute contains an unsigned integer representing the time in milliseconds since the start.

nsIDOMHTMLAudioElement additions

Audio write access is achieved by adding two new methods to the HTML media element:

void mozSetup(in long channels, in long rate, in float volume);

void mozWriteAudio(array); // array is Array() or Float32Array()

void mozCurrentSampleOffset();

The mozSetup() method allows an <audio> element to be setup for writing from script. This method must be called before mozWriteAudio can be called, since an audio stream has to be created for the media element. It takes three arguments:

  1. channels - the number of audio channels (e.g., 2)
  2. rate - the audio's sample rate (e.g., 44100 samples per second)
  3. volume - the initial volume to use (e.g., 1.0)

The choices made for channel and rate are significant, because they determine the frame size you must use when passing data to mozWriteAudio(). That is, you must pass either pass an array with 0 elements--similar to flushing the audio stream--or enough data for each channel specified in mozSetup().

The mozSetup() method, if called more than once, will recreate a new audio stream (destroying an existing one if present) with each call. Thus it is safe to call this more than once, but unnecessary.

The mozWriteAudio() method can be called after mozSetup(). It allows audio data to be written directly from script. It takes one argument:

  1. array - this is a JS Array (i.e., new Array()) or a typed float array (i.e., new Float32Array()) containing the audio data (floats) you wish to write. It must be 0 or N (where N % channels == 0) elements in length, otherwise a DOM error occurs.

The mozCurrentSampleOffset() method can be called after mozSetup(). It returns the current position (measured in samples) of the audio stream. This is useful when determining how much data to write with mozWriteAudio().

All of mozWriteAudio(), mozCurrentSampleOffset(), and mozSetup() will throw exceptions if called out of order.

Additional Resources

A series of blog posts document the evolution and implementation of this API: http://vocamus.net/dave/?cat=25. Another overview by Al MacDonald is available here.

Obtaining Code and Builds

A patch is available in the bug, if you would like to experiment with this API. We have also created builds you can download and run locally:

NOTE: the API and implementation are changing rapidly. We aren't able to post builds as quickly as we'd like, but will put them here as changes mature.

A version of Firefox combining Multi-Touch screen input from Felipe Gomes and audio data access from David Humphrey can be downloaded here.

JavaScript Audio Libraries

  • We have started work on a JavaScript library to make building audio web apps easier. Details are here.
  • dynamicaudio.js - An interface for writing audio with a Flash fall back for older browsers.

Working Audio Data Demos

A number of working demos have been created, including:

NOTE: If you try to run demos created with the original API using a build that implements the new API, you may encounter bug 560212. We are aware of this, as is Mozilla, and it is being investigated.

Demos Working on Current API

Demos Needing to be Updated to New API

Third Party Discussions

A number of people have written about our work, including: