HTML5 Speech API
Contents
What and Why?
The HTML Speech Incubator group has proposed the implementation of speech technology in browsers in the form of uniform, cross-platform APIs that can be used to build rich web applications. The API itself contains 2 components -
- Speech Input API
- Text to Speech API
The incubator group is still discussing the APIs and their features. The latest draft can be found here : [1]
The group's mailing lists can be found here: [2]
1.Speech Input API
The speech input API aims to provide an alternative input method for web applications, without using a keyboard or other physical device. This API can be used to input commands, fill input elements, give directions etc. It is based on SpeechRequest[[3]]
The API consists of 2 main components -
- A Media Capture API to capture a stream of raw audio. Its implementation is platform dependent. Mac, Windows and Linux are to be supported first, eventually adding support for android.
- A streaming API to asynchronously stream microphone data to a speech recognition server and to get the results back. This will be similar to how XMLHttpRequest is implemented. The api should be able to support both local and remote engines or a combination of both depending on the network connection available.
Security/Privacy issues
- A speech input session should be allowed only with the user's consent. This could be provided using a doorhanger notification.
- The user should be notified when audio is being recorded possibly using a record symbol somewhere in the web browser UI itself like the URL bar or status bar.
API Design -
The API will look like the interface described in the SpeechRequest proposal.
- The developer should be able to specify a grammar(using SRGS) for the speech which is useful when the set of possible commands is limited. The recognition response would be in the form of an EMMA document.
- The developer should be allowed to set a threshold for accuracy and sensitivity to improve performance.
- The developer should be able to choose what speech engine to use.
- The developer should be able to start, stop, handle errors and multiple requests as required.
2.Text To Speech API
The text to speech API will be based on google's proposal([4]).This API can be used for speech translation, turn by turn navigation, dialog systems etc.
API Design -
- The API will introduce a new element <tts> that extends HTMLMediaElement. It will be similar to how the <audio> and and <video> tags are implemented.
- A playback UI should allow the user to start, stop and disable text to speech. The current spoken word can be highlighted.
- The developer should be able to specify the language, position, start/stop playback and handle errors programmatically.
- The API should itself be independent of the underlying speech synthesizer. If speech synthesis is not supported, appropriate text should be displayed.
- What speech engines is yet to be decided.
Tentative Schedule
First Half
(May 24th - June 7th) - Implementing the media capture API
(June 8th - June 14th) - Implementing the SpeechRequest API using google's speech reco server along with unit tests for the same.
(June 15th - June 30th) - Low activity due to exams.
(July 1st - July 8th) - Finish whatever coding is left of Speech Request
(July 8th - July 13th) - Tying up loose ends, documentation, code review. By the end of this period, i would like to have the Speech Input API working perfectly.
(July 13th - July 16th) - Mid-term evaluations
If time permits, I'll look at native speech engines and how they can be implemented.
Second Half
(July 17th - July 24th) - Research and decide on possible speech synthesis engines
(July 25th - Aug 8th) - Work on the API implementation and unit tests.
(Aug 8th - Aug 15th) - Tying up loose ends, documentation, code review.
(Aug 16th - Aug 22nd) - Bug fixing and miscellaneous tasks. Committing code to mozilla repos and google code.
(Aug 23rd) - Firm pencil down date.