WebAL - Interactive audio for browsers

From Wikiid
Revision as of 08:08, 11 January 2011 by SteveBaker (Talk | contribs) (What's wrong with using the HTML5 <audio> tag/API?)

Jump to: navigation, search

This document is a proposal to the standards-makers and browser manufacturers to produce a "WebAL" audio subsystem within web browsers - analogous to the WebGL graphics API standard.

At time of writing, there is really only one mechanism supported within web browsers for producing sounds: The HTML5 <audio> tag. Failing that, one must resort to plugins - the most likely being Flash.

What's wrong with using the HTML5 <audio> tag/API?

If all you need to do is to replay a piece of music, streamed from your server - then nothing. The HTML5 audio markup and corresponding JavaScript API appear strongly oriented to streaming large audio files over the Internet, a task for which it is reasonably well suited.

However, when it comes to producing compelling games, simulations and other interactive content for the web, the demands placed on the audio system are vastly different and HTML5 audio becomes nearly useless. Pushing the limits of what <audio> can do exposes numerous fatal flaws:

  1. The markup/API described in the HTML5 specifications is loosely described. There is much that goes unsaid (eg How many sounds can you play at once?).
  2. Not one of the mainstream browsers actually implements 100% of what the specification describes.
  3. Within the subset of the standard that these browsers do claim to support - there are many bugs which have gone un-fixed for far too long (over a year in many cases).
  4. There appears to be no support forum of any kind where HTML5 audio experts can be found to answer questions. After two months of strenuous efforts and several seemingly-promising contacts, I have yet to have a single communication with anyone who knows anything about the standard or its implementation.
  5. Even if the specified API were more tightly described and implemented perfectly, it would still be inadequate for aggressively interactive applications such as games.

Together, this speaks of an unloved corner of HTML5 that has been implemented to the minimal extent needed to stream single music tracks - with zero ongoing support.

If there are to be games on the Internet without Flash - we need to take drastic action.

What do games and simulations need?

At the barest minimum, an interactive application will typically need the following features:

  • The ability to have some kind of background or "ambient" sound track (eg Music, chirping cricket, the sound of the ocean) looping without a break. Firefox provides no viable mechanism for this other than to inform you that a track has finished playing in order that you can re-trigger it. But that process takes time - during which there will be an unacceptable break in the audio. Chrome does honor the "looping" command in the HTML5 spec - but it does so with a considerable break in the sound track. An acceptable implementation would have to guarantee automatic looping such that the first sample of the sound is played immediately after the last with no delay whatever.
  • The ability to trigger a short sound with almost no latency. The HTML5 audio system has a complex set of commands you can use to control how a sound is "preloaded" - however, by some bizarre logic, the one truly important option...to completely preload the sound into memory and keep it there...is missing! Both Firefox and Chrome appear to stream data no matter what - so there is always a considerable delay between "pulling the trigger" and hearing the gun go "BANG!". Given the tiny amount of memory that a short sound sample might occupy (compared to, say a photograph or a WebGL texture), this is an unforgivable omission.
  • The ability to reliably play some number of sounds simultaneously. The <audio> specification makes no mention whatever about what happens if you try to play multiple sounds at once - much less provide a means to find what the maximum number actually is (it is surely not infinite!) - or how you control the use of available numerical precision and range during the mixing of multiple sounds. Browsers do seem to be able to play multiple sounds - but it's sproradic and unspecified. Games can manage the number of sounds they are playing - but they need control over that in order that (for example) an ambient cricket chirp doesn't mask the sound of your gun going off when you pull that trigger.

Additionally, a "high end" application would greatly benefit from:

  • The ability to control the frequency of replay and volume of sounds dynamically in order to simulate (for example) doppler shift.
  • The ability to control reverb/echo of sounds in order to adjust the audio to the virtual space in which it's being played.
  • The ability to place monophonic sounds anywhere in the stereo or 5.1 surround-sound space.
  • MIDI-file support (a much more bandwidth-efficient way to support game-music that also permits dynamically created music that can shift to suite the game state).

What is proposed here

Clearly we need something entirely new to the browser world. The <audio> tag would be exceedingly difficult to repair and enhance at this point. Agreeing on a new specification would be a nightmarish task if we had to develop it from scratch and get everyone to agree to it.

What is needed is the adoption of an existing, widely-accepted and "open sourced" sound specification - much along the lines that WebGL was developed from OpenGL-ES by the Khronos group and various other interested parties (Mozilla, Apple, Google, etc). The parties involved wouldn't have to make a million tiny decisions - simply agreeing that the existing API is what we want is sufficient to allow rapid progress.

Ideally, we would mirror the approach of taking an existing standard, "wrapping" it with JavaScript bindings and tweaking it for the web's needs for security and networkability. That model has proven extremely successful for WebGL - and we should emulate it here.

I call this hypothetical API "WebAL" (Web-based Audio Library). I propose that it be based on the existing and widely used OpenAL library.

Why OpenAL and not something else?

There is an existing standard that Khronos manage called "OpenSL" (SL==Sound Library) - and it has a version called "OpenSL-ES" that is intended for the mobile marketplace (just as OpenGL-ES is for graphics). However, unlike OpenGL, OpenSL is not widely used - although OpenSL-ES is becoming popular for some cellphone applications. OpenSL also lacks many of the higher level features present in the "wish list" for games and simulation, above.

A much more popular (and practical) standard for the desktop is "OpenAL" (AL==Audio Library). This standard has been around for a very long time and is widely implemented and used in hundreds of commercial games and simulations across PC's and game consoles. OpenAL is probably the number one choice for these applications - and it implements every one of the "Wish List" items above.

In discussion with the OpenAL people, it seems that the OpenAL 1.1 specification is moderately well formalized - although not as well as OpenGL or OpenSL - but it's good enough to make a superb starting point. Because we know that it is widely used, we also know that it is feature-complete - unlike <audio>. OpenSL has probably never been used in a commercial game - and it would be a much harder sell to get game developers to use it. OpenAL is also a higher level specification than OpenSL-ES. Features like doppler shift and spatialized stereo can be built on top of OpenSL-ES, but they aren't a part of it. Doing those things in software in JavaScript would be impractical at best - so these things do need to be included into the API.

The "Soft OpenAL" implementation is claimed to be easily portable onto an OpenSL or OpenSL-ES implementation as it already uses many different 'back end' interfaces. There are also hardware-accelerated implementations of OpenAL for several different sound cards.

So:

  1. Take the OpenAL specification, and apply Khronos Group's level of formality to it - without changing much of what it is or how it works.
  2. For PC-based applications, have the browser either find a "native" OpenAL driver (such as Creative sound cards support) - or use the "Soft OpenAL" library.
  3. For embedded systems such as cellphones, layer OpenAL on top of OpenSL-ES.
  4. Provide JavaScript bindings for all of the OpenAL API.
  5. Provide loaders for at least Ogg/Vorbis and "Raw" audio formats.
  6. Provide interfaces from the "typed array" mechanism in WebGL to enable blocks of raw audio to be efficiently accessed or created via JavaScript.
  7. Tie down any security issues such as when audio is loaded from a site other than the one originating the web page - just as is already managed for WebGL textures.
  8. Build example programs and a test suite as we go.
  9. Do the initial phases quickly - and get early versions into daily builds of FireFox and WebKit so that developers can beat on it to find the holes.

In short, repeat - as closely as possible - the work done on WebGL to produce a "WebAL".

What about the <audio> stuff?

The existing audio tag works reasonably well for playing long sounds - such as music tracks - that benefit from streaming. Unlike WebGL, which is built as a layer atop the existing <canvas> system, the situation with <audio> would be reversed. Browser writers would be well-advised to rework the half-finished/broken audio tag support - and instead build it on top of the foundations provided by OpenAL/SL. The existing audio features should be considered a specialization of WebAL. This would permit things like the placement of streaming audio sources on moving objects - or placing them out in the stereo/surround-sound field.

Who does this? Who pays for it? When will it happen?

I have no idea.

I would hope that this would be a natural follow-on for the groups who have come together so successfully to build WebGL - and that the cooperative mechanisms that have achieved this feat could be extended or replicated to solve the audio problem.

We need this soon. If we are to have a future of high-end interactive applications on the web - as promised by WebGL on the graphics side, then audio support cannot be far behind. Fortunately, I believe that the OpenAL API is sufficiently similar to OpenGL that we could put together a draft standard and a rough implementation in short order. OpenAL was specifically designed to be as similar to OpenGL as possible. Many of the issues that have surrounded the philosophy of WebGL would carry perfectly over to WebAL. Both would use the same 4x4 matrix support - both would have loaders that work from a URL - both would use the same 'typed array' mechanisms. Issues of where we stand vis-a-vis extensions are already understood.

The OpenAL specification

Version 1.1 is the latest:

 http://connect.creativelabs.com/openal/Documentation/Forms/AllItems.aspx 

There are additional places where extensions to the spec reside.

Implementations of OpenAL

The main implementations are in MacOS X, OpenAL-Soft (OpenSourced - any platform), and the Creative driver for Windows.

Conclusion

I hope everyone who reads this can understand the need - and find this proposal as compelling as I do.

Interested Parties

  • Myself: steve@sjbaker.org - Professional graphics engineer working in games and simulation for 25 years. Also was an early proponent of OpenAL. Wrote the "ALUT" companion library.
  • The OpenAL developer list: openal-devel@opensource.creative.com
  • The WebGL developer list: public_webgl@khronos.org