Architectural Framework For Browser-Based Real-Time Communications
Architectural Framework For Browser-Based Real-Time Communications
Architectural Framework For Browser-Based Real-Time Communications
Jonathan Rosenberg Matthew Kaufman Magnus Hiie Francois Audet Skype September 21, 2010
Introduction
Real-time communications (RTC) remains one of the few if only classes of desktop applications that is not yet possible using the native capabilities of the web browser. These applications run natively on the desktop, or are powered by plugins. The functionality provided by these desktop clients is rich and complex ranging from user interface, to real-time notifications, to call signaling and call processing, to instant messaging and presence, and of course the real-time media stack itself, including codecs, transport, firewall and NAT traversal, security, and so on. Given the breadth of functionality in todays desktop RTC clients, careful consideration needs to be paid to how that functionality manifests in the browser. What functionality lives within the browser itself? What functionality lives on top of it either in client-side Javascript or within servers? What protocols are spoken by the browser itself? What protocols can be implemented within the Javascript? What protocols need to be standardized, and which do not? Pictorially, the question is what protocols, APIs, and functionality reside within the box marked Browser RTC Functionality in Figure 1. Indeed, the central question is what functionality resides in that box, as the functionality will ultimately dictate the protocols that interface to it, and the APIs which control it.
coupled components, each of which performs some aspect of the real-time processing. Each component has APIs which allow that component to be configured (with sensible defaults where appropriate), along with APIs that allow applications to gather information and statistics about the performance of that module.
Servers
On-the-Wire Protocols
HTTP/ Websockets
Javascript/HTML/CSS
Other APIs
RTC APIs
Browser
On-the-Wire Protocols
Native OS Services
The modules would include the codec itself, the acoustic echo canceller (AEC), the jitter buffer, audio and video pre-processing modules, and network transport components (including encryption and integrity protection of media) which speak specific transport protocols (such as the Real-Time Transport Protocol (RTP)). The media component model is purposefully minimalistic. It opts for maximizing the functionality that lives outside of the browser itself within Javascript or servers. In particular, only functionality which is real-time which cannot be done using Javascript or server functionality resides within the browser itself. As explained in [Benefits of the Media Component Model], this facilitates innovation, differentiation, and development velocity all of the key characteristics that have made the web what it is. As an example, a codec component implementing SILK [SILK ID] might be represented by a Javascript object with properties that mirror the configuration settings of the codec itself the sample rate (one of narrowband, mediumband, wideband or super-wideband), the packet rate (number of frames per packet), the bitrate (which can vary between 6 and 40kbps), a slider that adjusts the packet loss resilience, a Boolean which indicates whether inband FEC should be used, and another Boolean which indicates whether to apply silence suppression. Of course, all of these parameters might have
reasonable defaults so that non-expert programmers can just make it work. However, an advanced programmer could force a mode or change a setting as needed. After all, the SILK codec itself makes these parameters tunable exactly because there is no one right value; the correct setting depends on the application scenario and needs of the developer.
Web/SIP Server
SIP
Web/SIP Server
HTTP/WebSockets
HTTP/WebSockets
JS/HTML/CSS
JS/HTML/CSS
Browser
Media
Browser
In this example, a call is placed between two different providers. They use a SIP-based interface to federate between them. However, each of their respective browser-based clients signals to its server using proprietary application protocols built ontop of HTTP and Websockets. For example, provider A might offer simple calling services, and have a very simple web services interface for placing calls: http://calling.providerA.com/[email protected]&myIP=1.2.3.4:4476 Which takes only the called party and local IP/port as arguments. Provider As server infrastructure some combination of web and SIP servers built in any way it likes uses the identity of the target, along with previously-known information on the capabilities of the callers browser learned through a webservices registration, to generate a SIP INVITE. This arrives at provider Bs server infrastructure, which alerts its browser-based client of the incoming call. Provider B might be an enterprise service provider,
and offer much richer features and signaling. Provider B uses a websocket interface to the browser, providing it the identity of the caller, the list of available codecs, and so on. Bs service provider offers web-services based APIs for answering the call, declining it, sending to voicemail, redirecting to another number, parking it, and so on. APIs within the browser allow each side to instruct the browsers to send media, including selection of media types and codecs. In this model, there is no SIP in the browser. It is our view that SIP has no place within the browser. SIP is an application protocol providing call setup, registration, codec negotiation, chat and presence, amongst other features. For each and every new feature that is desired to run between a SIP client and a SIP server, a new standard must be defined and then implemented. The feature set is indeed vast, considering the wealth of potential endpoints, ranging from simple consumer voice only clients, to richer videophones, to voice and video multiparty conferencing (including content sharing), to low-end enterprise phones, to high end executive admin phones, to contact centers endpoints, and beyond. Each of those requires more and more SIP extensions in order to function. As an example, the BLISS working group in IETF was formed [BLISS charter] to tackle some basic business phone features including line sharing, park, call queuing, and automated call handling. Each of these individual features requires one or more specifications, and needs to be designed to meet the needs of all of the participants in the process. There are two important consequences of this. First, the requirement of standardization acts as a huge deterrent to innovation. Indeed, in many ways, it is anathema to the very notion of how the web is supposed to work. In the web model, the provider can define arbitrary content to render to users, craft arbitrary UI, and define arbitrary messaging from the browser back to the server, all without standardization or change to the web browser. Google does not need to wait for the browsers to implement IMAP in order to provide mail service. Facebook does not need the browser to have XMPP or SIP to enable presence and instant messaging. Why is call processing any different? Why should Skype or any other real-time communications provider be constrained by standardized application protocols? Each provider should be able to design and innovate what it needs, and not be constrained by the functionalities of the application protocols burned into the browser. The second consequence is that interoperability will suffer dramatically. Interoperability between SIP clients and SIP servers is relatively poor; working only for basic call setup, teardown, and basic features. Important concepts like configuration remain poorly standardized and almost never interoperate. The web has certainly had interoperability problems, but nothing like those seen between SIP phones and servers, where in many cases features simply do not and cannot work. Interoperability is improved when there are fewer standards and not more. Instead of adding SIP and its myriad extensions to the browser (the current SIP hitchhikers guide [RFC5411] includes 140 references, most of which are SIP extensions) , application providers can use the tools that are already there HTTP and websockets, and then define whatever signaling functions they desire ontop of that, without interoperability consequences.
SIP remains important as a glue between service providers, and between server infrastructure within provider networks. However, in a web context, there is simply no need for SIP support in the browser.
Enabling Innovation
One of the reasons why the Web has been successful as a user interface platform is the short turnaround time to deploy new versions of web-based services. Often, these new versions are experiments that vary small details which are important to make the service successful. It is the fine granularity of user interface elements in HTML and related technologies that allow this experimentation with details.
As there is no agreed-upon configuration of real-time audio/video communication technologies that always delivers the best result, we think that it is essential to give the application developers the same benefit of short turn-around time and ability to experiment with details. Therefore, the real-time communication primitives offered by user agents to web applications/services should be fine-grained enough to allow for enhanced configurations and possibly new scenarios. Also, these interfaces to the primitives should allow gathering real-world data in enough detail on how the primitives are operating, to enable the feedback loop of deploy-measure-reconfigure-redeploy. One of the areas where perhaps the most innovation can be expected is signaling - one only needs to look at the plethora of standards around SIP. Proposing user-agent vendors to implement all these standards is a sure way to make the common denominator across user agents marginal. Instead, the browser already has a programmability model (JavaScript) that can handle all these use cases, and more, provided the programming environment has access to the underlying media components as we propose here. Drawing again parallels from user interface development, there is an undecided problem of what should be executed by the user agent, and what by the web servers (e.g. validation). Similar gray boundary between the client and the server exists in the field of real-time communications. Therefore we propose to leave standardization of signaling out of scope for this activity, and let the web service providers define signaling as they see fit.
An alternative approach for adaptive multi-bitrate video streaming was recently adopted by the Flash Player. The video object simply has an API for receiving bits to be played back. The script engine (and thus the script author, usually through the use of a pre-existing library) becomes responsible for determining which bits to download and which bits to pass to the video object. This enables adaptive multi-bitrate HTTP streaming video, but it also enables any number of other uses, many of which were not even contemplated by the providers of that API. It also means that upgrades to this logic come in the form of new script libraries, and not in the form of an upgrade to the Flash Player itself. We advocate a similar approach here whenever it is possible. With the exception of the passing of realtime data to and from the media components (which we believe must communicate directly in order to meet real-time latency constraints) we advocate placing all of the logic outside of the browser itself and instead into the hands of the page author through JavaScript APIs. These APIs may be more complex to use for some cases, but they minimize the implementation effort on the part of the browser vendor and can provide functionality that has not yet been contemplated. An example of this might be the peer-to-peer NAT traversal problem. Rather than having an API for browser, please use ICE [RFC5245] to open a connection to another peer we would instead have APIs like browser, please send an ICE-compatible STUN [RFC5389] probe to the following candidate address. This allows the actual logic, the sequencing, the choice of what to implement at the client and what to offload to the server, to be in the hands of the JavaScript developer. We expect that libraries to implement common functionality (such as ICE, which could be built ontop of this) will become readily and freely available, and so in short order the extra work required for a page author to work with these lower level APIs becomes insignificant.
Interoperability means working with reality, and not just standards. As such, it is important that browsers support basic RTP transport for voice and support the G.711 codec. Furthermore, they should interoperate with network-based session border controllers, which are the most commonly deployed technique for NAT traversal in existing networks. They should also support security, based on SRTP [RFC3711].