Media Capture API: Helping Web developers directly import image, video, and sound data into Web apps

Last March, we released a prototype implementation of the audio portion of a working draft of the W3C Media Capture API on HTML5 Labs. This prototype publicized some proposed API enhancements described in section 6.1 of Microsoft’s HTML Speech XG Speech API Proposal. We have now updated the prototype to include the image and video capture features described in the proposal to support scenarios we’ve heard are important for Web developers, as well as incorporating your feedback on audio.

As more and more consumers use mobile devices to take still pictures, videos, and sound clips, Web developers increasingly need support to capture and upload image, video, and sound from their Web sites and applications. A usable and standardized API for media capture means Web sites and apps will be able to access these features in a common way across all browsers in the future.

During this past year, the effort to standardize media capture has intensified. The WebRTC working group was formed and combined scenarios that support basic video and audio capture with the ability to share that media in real-time communication scenarios. A broad interest in both of these scenarios from industry partners and browser vendors alike shifted focus away from the Media Capture API and brought the WebRTC draft spec to the forefront.

This past November, we took our experience with the development of this prototype and interest in media capture for the browser to the W3C's technical plenary meeting (TPAC). Travis Leithead shared some of our feedback with the Device APIs (DAP) Working Group and we continued existing discussions within the HTML Speech Incubator Group. One result of our engagement was the formation of a media capture joint task force in order to bring the best of local media capture and real-time communication scenarios together. We are actively participating in the task force and support the getUserMedia approach to capture.

With the release of this prototype, we give Web developers early access to photo, video and audio media capture APIs in the browser. We anticipate evolving the prototype to share implementation feedback and experience with the new media capture task force. The end goal remains to create the best possible standard for the benefit of the whole Web community.

Let’s also look back at our earlier proposals, explain why we believe the scenarios are still important, and why we implemented them in this new version of the prototype:

Privacy of device capabilities

The prototype allows enumeration of the capture device's capabilities (its supported modes). In the old W3C Device Capture API spec, privacy-sensitive information about the device could be leaked to an application because the navigator.device.capture.supported* properties could be accessed without user intervention. Our prototype moves these APIs to an object that is only available after the user gives permission. The W3C's current getUserMedia API does not support enumeration of device capabilities, but we believe it is valuable to Web developers and should be done in a similar privacy-sensitive manner.

Multiple Devices

The W3C's current getUserMedia API is designed to support multiple devices, either via hints to the API or through user preference. This is an improvement for a scenario that was not supported in the old W3C Device Capture API spec.

Our current prototype also supports a conceptually similar design: navigator.device.openCapture() which returns asynchronously with the capture device the user prefers (through preference or UI).

Direct Capture

In the old Media Capture API spec, the Capture.capture[Image|Video|Audio] operations launch an asynchronous UI that returns one or more captures. This means the user has to do something in the Web app to launch the UI and then initiate the capture, which makes it impossible to build capture UI directly into the Web application. Not only would this be unusable for a speech recognition application, but it is also places unnecessary user interface constraints on other media capture scenarios.

Our prototype and the current getUserMedia capture API directly capture from the device and return a Blob. Note that for privacy reasons some user agents will choose to display a notification in their surrounding chrome or hardware to make it readily apparent to the user that capture is occurring, together with the option to cancel the capture.

Streaming

For applications like speech recognition, captured audio needs to be sent directly to the recognition service. However, the current getUserMedia API design only supports capturing to Blobs, which delay the ability to process the recorded data.

Our prototypeallows starting a capture asynchronously and returning a Stream object containing the captured data. Support for Streams would also be useful in video recording scenarios. For example, using a capture stream, an app could stream a recording to a video sharing site, as it is recorded.

Preview

In the case of video capture, live preview within the application is important and something that was missing from the old Media Capture API spec.

Both our prototype and the W3C's getUserMedia API, allow a preview of the recording to be created with URL.createObjectURL(). This URL can then be used as the value for the src attribute on an <audio> or <video> element.

End-Pointing

For applications like speech recognition, it's important to know when the user starts and stops talking. For example, if the app starts recording but the user doesn't start talking, the app may wish to indicate that it can't hear the user. More importantly, when the user stops talking, the app will generally want to stop recording, and transition into working on the recognition results. This sort of capability may also be of some use during non-speech scenarios, to provide prompts to users who are recording videos.

Neither the old Media Capture API, nor the current getUserMedia approach support end-pointing.

In order to support key speech/voice scenarios, we recommend adding end-pointing capability. The prototype provides this feature and allows Web developers to experiment and provide feedback on these capabilities which will be useful feedback for the W3C.

Looking Forward

We are supportive of the getUserMedia API, and note that it incorporates many of the points of feedback previously submitted. To avoid confusion about the future direction of media capture at the W3C, the DAP working group has officially deprecated the old Media Capture API, which now redirects to the media capture joint task force's current deliverable.

In addition to a prototype plugin that exposes the modified APIs, we have added to this package a functional demo app that makes use of them.

Building this prototype and listening to your feedback will help Microsoft and the other browser developers build a better and more interoperable Web. We look forward to continuing this discussion in W3C and helping to finalize the specifications.

—Claudio Caldato, Principal Program Manager, Interoperability Strategy Team

Comments

  • Anonymous
    December 09, 2011
    Finally some news on this subject. I've waited with developing mobile apps till something like this came along. Now, if only all mobile browsers implemented this tonight..

  • Anonymous
    December 09, 2011
    The comment has been removed

  • Anonymous
    December 09, 2011
    The comment has been removed

  • Anonymous
    December 09, 2011
    @Marcus Unfortunately, the internet is just a very small portion of where video is used. Outside the browser, the only viable option is H264 if you want to do something meaningful with video. That format has settled down as the industry-norm. We don't want to be thrown back to the dark ages, there are too many format wars out there already.

  • Anonymous
    December 09, 2011
    Does XG stand for anything? Something Generation? Extended Generality? X-Treme Gerontology? Tell me it's X-Treme Gerontology.

  • Anonymous
    December 09, 2011
    XG is the super secret accronym that stands for "Incubator Group" in W3C parlance. If you are among the initiated, you just know. If you aren't, it would take almost three clicks (and a better attention span than a teaspoon) to find out. Which is why it's still super secret.

  • Anonymous
    December 09, 2011
    Awesome, great news! And please keep insisting on stream based API, we need some movement into live media on a web, not jsut plain old youtube style progressive download html5 media tags. Thanks.

  • Anonymous
    December 09, 2011
    Could this in some way allow direct copy/paste of images into rich web editing pages? For example, taking a screenshot and pasting it into an email in OWA would be great and immediately useful. If you're not working on a technology that would allow this, please get cracking!

  • Anonymous
    December 10, 2011
    The comment has been removed

  • Anonymous
    December 10, 2011
    You neeed to give a shot at visual studio 2011b (beta is free at download.microsoft.com) to create some dynamic movies using ecma-javascript. JS is far more diversified than action script!

  • Anonymous
    December 10, 2011
    @Marcus How dare you suggesting WebM as a format over h.264 WebM is not even a standaard. The V8 video codec in WebM is a propriety format fully owned by Google. h.264 is a real standaard. It was developed by an industry standard consortium and handed to the independant ISO/IEC standards organization. It is fully unacceptable that WebM should be used on the web whilst Google has not handed the VP8 codec to an independant standards organization. and even then we should understand that webM is a less efficient codec and has not support in hardware decoding and as such the consumption of WebM video uses a lot more energy than h.264 video. If all of the internet would use webM in the same quality video as we use now then it would cost more than 1000 MWh a year extra energy for computers just to play the WebM video. Also WebM would require extra server and bandwith cost which is also environment unfriendly. WebM is waste.

  • Anonymous
    December 10, 2011
    @A_Zune: h.264 is an industry standard, no question about it. It is, however, unusable in a web environment due to its restrictive licensing theme. On the other hand, WebM's licensing allows it to be used in any browser, for any content in any use. See the GIF controversy a while back for a precedent, before a time when any browser was open-source; now that Chrome and Firefox represent more than 45% of the web browser market, you can forget about ever enforcing h.264 as a media standard. The complexity of h.264 is also troublesome: most mobile devices for example only accept its Baseline profile, which is not much better (if not worse) than h.263 (yeah, DivX) while desktops can push out and play the Main profile (the real deal in h.264 is CABAC). On the other hand, WebM's single profile isn't much more computationally intensive than h.264's Baseline, while approaching Main's quality. Moreover, its origins in VP8 make it very appropriate for streaming capture: it is actually better at still, independent shots than h.264. Last, and to give you an idea of how "good" h.264 VS WebM is on the Web, I took a 5 minutes video made using a camcorder on a tripod, containing movements, different lightings, and some panning and zooming, sized at 480x360 px, and packed it using the leading implementation of h.264 (x264) and the latest released version of WebM, with comparable quality settings:

  • h.264 Baseline took up 17 Mb, with compression taking 2 minutes on a 2 GHz dual core making it a 475 Kbit / second stream. While it is usable as an upstream rate in ideal conditions, raising the video's resolution would start pushing it. The CPU time shows that a dedicated chip or a good CPU can handle it.
  • h.264 Main took up 8 Mb, with compression taking almost 4 minutes making it a 224 Kbit/second stream. Half of the above, but for twice the CPU power. And, as said, it can't be read by mobile devices.
  • WebM took up 8.5 Mb, with compression taking 2.7 minutes making it a 238 Kbit/second stream. It is thus computationally close to Baseline, while taking barely more bandwidth than Main. Shown side by side, all 3 videos had very similar quality, with ringing or washouts only visible by frame freezing and zooming on all 3 in differing places depending on the strengths and weaknesses of the formats. h.264's intraframes had a tendency to be sharper, but WebM's keyframes were cleaner. Note that on such a sensitive thing as video, results can vary a lot; but, real time capture restricts what can be done by quite a lot - using special effects incrustation in real time for example, may show h.264 in a better light. Not everybody is up to doing that though, judging by what people upload to Youtube. And, anyway, the implementation will probably end up being codec-independent due to the fact that it would need to be rewritten from scratch the day a new, better codec comes out.  --- Resubmitted: IE team, you should really ask for a better working comment system.
  • Anonymous
    December 11, 2011
    The comment has been removed

  • Anonymous
    December 11, 2011
    The comment has been removed

  • Anonymous
    December 11, 2011
    Can we please get Microsoft to focus on the underlying matter that is most important. h.264 is not viable for the open web - stop pretending it will ever fly as a viable option and focus on a format that will. As for your media capture thing (whatever this article post was about) please indicate that it will support an open format for Audio and Video - being SILENT in this blog's comments is equivalent to admitting you have no intention of meeting the needs of developers that are creating sites and applications for the web far and wide on desktops, tablets and smartphones. Keep in mind that although Microsoft still has almost 50% of the desktop browser share (globally), MICROSOFT HAS LESS THAN 1% OF THE MOBILE AND TABLET MARKET and therefore has ZERO power in trying to push undesired formats and codecs on developers worldwide. Hear us now - loud and clear WE HAVE NO INTEREST in h.264 for video formats. Please indicate ASAP what your intentions are for supporting a video format appropriate for the web. Thanks, Signed All Web Developers!

  • Anonymous
    December 11, 2011
    I'll start using WebM as soon as Google reimburses me for the $5000 in investments in equipment (vcr, cameras, video, software) I have to replace them for their WebM technology.

  • Anonymous
    December 11, 2011
    ...and I add to the previous comment: And second, after reimbursing me, gives me, in a legally binding way, protection from all possible fines, penalties and other hassels that may arise from using patented technology in their WebM technology. Harry

  • Anonymous
    December 11, 2011
    Neil, When someone signs as "All Web Developers!", people will be less inclined to take you seriously. With that said, I support your cause. Here is similar case which make me lose faith in humanity: arstechnica.com/.../is-apple-is-using-patents-to-hurt-open-standards.ars

  • Anonymous
    December 11, 2011
    The comment has been removed

  • Anonymous
    December 11, 2011
    The comment has been removed

  • Anonymous
    December 11, 2011
    You can argue all you want until you're blue in the face on WebM vs. h.264 etc. but you're missing the point. 1.) The Web needs a sustainable format. 2.) h.264 CAN'T BE IT! - IT DOESN'T MEET THE NEEDS of the OPEN WEB 3.) h.264 has already failed, by only being supported in 1 browser! (en.wikipedia.org/.../HTML5_video) there isn't and won't be any cross browser support for this format 4.) WebM is at least a lot closer to the ideal web format (it might not be perfect, but heck! there's people that read this blog that have the influence to make it work... just get in a room and work out a solution already!) 5.) The Internet is now stuck waiting, AGAIN! - Microsoft is the only company and IE is the only browser that are not actively, publicly working towards the common goal of an interoperable format. 6.) Microsoft was told during IE9 beta development that h.264 would not fly at all and would harm HTML5's success - here we are now in the midst of HTML5's rise to fame and we are handcuffed by Microsoft's unwillingness to accept open standards. Thanks Microsoft! Not only did you make us suffer with IE6, but now you make us suffer with IE9+. Stop breaking "the future" of the Internet Microsoft! vic

  • Anonymous
    December 11, 2011
    Harry Richter, I'm glad I made you laugh. But unfortunately, Microsoft HAS turned into a patent troll. I guess all these suits it received over the years thought it the wrong lesson. Case in point: the suit against B&N. I refer you to: www.geekwire.com/.../microsoft-cites-new-patents-vs-android (there are many other sources, this came up after a quick search). In that view, EVERY browser is infringing on these precious patents. It means, Microsoft willing, every user browsing the web would pay Microsoft for this patented privilege. A note to Microsoft employees reading this forum. Does no one object to this? Or are you afraid for your job. This is 2011 you know.

  • Anonymous
    December 12, 2011
    The comment has been removed

  • Anonymous
    December 12, 2011
    The comment has been removed

  • Anonymous
    December 12, 2011
    The comment has been removed

  • Anonymous
    December 12, 2011
    Re: Comments on the Blogs Comment system. We have been listening and I realize it's taken a very long time, but today we've rolled out another set of fixes around the comments issues you have been reporting. Hopefully you've seen a lack of duplicate comments the past few weeks. Today we've pushed out bits to fix the submission of the comments along with the time comments can take to load. If you still see issues, please use the comment form here (blogs.msdn.com/b/seanjenkin/contact.aspx) and report them to me. I'm not watching every IE blog to know when you note issues so please help yourself by getting the feedback to the correct place. :) Thanks!

  • Anonymous
    December 12, 2011
    Thanks @Sean Jenkin - as you are likely aware the IE blog has had issues posting comments for as long as anyone can recall.  I'm glad to hear someone is paying attention and attempting to fix it. It is also good to see an official contact for the comment form issue blogs.msdn.com/b/seanjenkin/contact.aspx as up until now we (a) didn't know there was a contact to use and (b) our only option was to use this comment form which just created a vicious catch 22 cycle. now before I click post... the "Vegas Baby" part of me wants to place a wager.  I can see that the post button has not been replaced with a simple form.submit(); (the quick and easy foolproof fix) so I'm going to go all in on... "we thought we fixed it but in reality Community Server is just not a good tool for serious blogging" - and thus expecting a failure. [CTRL]+[C] (saving a copy before clicking submit), crossing fingers, and...

  • Anonymous
    December 12, 2011
    The comment has been removed

  • Anonymous
    December 12, 2011
    The comment has been removed

  • Anonymous
    December 12, 2011
    The comment has been removed

  • Anonymous
    December 13, 2011
    @meni 12 Dec 2011 11:31 AM: I'd say agressive defense of their IP.(using patent system as designed and maintained; that's why it needs some reform - And considering Google's attitude, this situation was quite predictable... No, Google didn't help their partners at all by ignoring potential patent disputes and relevant patents) BTW:Check out these articles http://arst.ch/rul (Apple may be using patent troll to do its legal dirty work) or http://arst.ch/ru5 (Is Apple using patents to hurt open standards?)

  • Anonymous
    December 13, 2011
    The comment has been removed

  • Anonymous
    December 13, 2011
    The comment has been removed

  • Anonymous
    December 15, 2011
    The comment has been removed

  • Anonymous
    December 15, 2011
    The comment has been removed

  • Anonymous
    December 16, 2011
    The comment has been removed