SpeechRTC - Speech enabling the open web

From MozillaWiki
Jump to: navigation, search


SpeechRTC and Mozilla

Speech recognition on any modern handsets is almost a standard feature, and the target of SpeechRTC is bring it to Firefox OS and other Mozilla products by creating a scalable and flexible platform with focus on delivering great experience to users and empowerment of developers finally offering full support to Web Speech API and other tools.

Mozilla Hacks blog post

This is the current status of Web Speech API support on browsers

Starting point

SpeechRTC is already used on two published Firefox OS apps[1], and proved to run both online as offline on any device with 1.3, even unagi devices. So the fast track is first integrate it with FxOS as some OS level integrated apps, to build the foundations and then release the Web Speech API to developers on sequence.

The Client

Online Mode: On online mode, Audio is captured encoded on Opus through MediaRecorder and streamed through websockets to a nodejs application at the server that handle the connection with the decoder. There are also methods to change language model when necessary.

Offline Mode: On offline mode, audio is captured as pcm from gUm and streamed to a web worker "thread", who treats it, and handle its processing with the decoder api ported to js by emscripten. As online mode, offline has language model switch support. Despite this proved to work, the ideal approach is run the decode on a separate cpp process and communicate with it through IPC, to make it run even on phones with constrained cpu.

The Server

In online mode, the nodejs application that receives audio and grammar from peers is responsible to handle the connection with the voice server, who then decode opus to pcm and pass it to the decoder when dealing with recognition, or switch the language model when requested to. Some argued with me on the past about also running the decode on node, but I decided that would be better to the project decouple it, since we may need to use different decoders and voice servers that may not run on javascript.

In offline mode, currently the decode is made on a webworker, but the ideal is to run it on a separate standalone process and communicate through IPC. This can reduce the overhead dramatically of running on javascript and enable even the $25 phone to have offline speech recognition.

The speech decoder

Decoder

Third-party licensing is extremely costly (usual unit is millions) and lead to an unwanted dependency. Write a decoder from scratch is tough, and requires highly specialized and difficult to find engineers.

The good news are that exists great open source toolkits that we can use and enhance. I am a long time supportert and contributor of CMU Sphinx that have a number of quality models on different languages openly available. Plus pocketsphinx can run very fast and accurate when well tuned for both FSG and LVSCR language models.

For LVSCR we can also consider Julius and benchmark it since he has great proved results.

Automatic retrain

We should also build scripts to automatically adapt the acoustic model per user with his own voice, to constantly auto-improve the service individually for him but also for the service as overall.

Privacy

Some argued with me about privacy on online services. At the ideal screnario, actually online recognition is required only for LVSCR, while FSG can be handled offline if architected correctly. I think letting users to choose or not to let us use his voice to improve models is how other OSes handle this issue.

Offline and online

The same speech server can be designed to run both online as offline, letting the responsibility to handle transmission to the middleware that handle the connections with the front.

Web Speech API

After we build boths online as offline backends on scalable way, we connect it with the already ready Web Speech API on Gecko, and release the api to developers and automatically starts to support every web app already developer with Web Speech API support that currently only runs on Chrome.

GSOC Progress

Bug tree on Bugzilla

Builds

Weekly summary

  • Week 1
    • Bonding and discuss with mentor about the architecture adopted and introduction to Gecko
  • Week 2
    • Setup of environment to Firefox compilation and debug tools
    • Start of pocketsphinx integration with Gecko
  • Week 3
    • Integrating pocketsphinx with Gecko and Web Speech API layer
  • Week 4
    • Pocketshpinx already integrated with Gecko. Coding the integration with web speech api C++ layer
  • Week 5
    • Pocketsphinx integrated and first decodes already happening. Still working to finish full integration, generate grammars, profiling and etc..
  • Week 6
    • Pocketsphinx integrated and decoding in-file with language model switch on Linux and Mac
  • Week 7
    • Patching build and packaging system to generate builds for B2G and Desktop.
  • Week 8
    • Patch pocketsphinx to load grammar in-memory and tests on Dekstop and Flame.
  • Week 9
    • Tests on different devices and accents, and update models and pocketsphinx sources.
  • Week 10
    • Patch pocketsphinxrecognition service to switch grammars and decode speech entirely in-memory on a thread
  • Week 11
  • Week 12
    • Work with the reviewers to approve the patch for landing
    • Write the mochitests
    • Change SpeechGrammarList to use Promises

Trello

We currently have a board on Trello with a live task status: https://trello.com/b/UWXblmKb/webspeech-api

Github

Follow the repo: https://github.com/andrenatal/gecko-dev-speech

Mindmap

Mindmap with big perspective

[1] Demos, Links and references