[Search] [pdf|bibtex] [Tracker] [Email] [Diff1] [Diff2] [Nits]

Versions: 00 01                                                         
Network Working Group                                          E. Burger
Internet-Draft                                  Cantata Technology, Inc.
Expires: December 7, 2006                                   June 5, 2006

          Media Server Control Language and Protocol Thoughts

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at

   The list of Internet-Draft Shadow Directories can be accessed at

   This Internet-Draft will expire on December 7, 2006.

Copyright Notice

   Copyright (C) The Internet Society (2006).


   IP mutli-function Media Server control is a problem that has slowly
   bubbled up in importance over the past four years.  A driver in the
   IETF is the requirements generated by the XCON framework.  Many
   approaches have been proposed.  Some of these proposals are device-
   controlled-oriented, such as H.248.  Others are server-oriented,
   using SIP and application-oriented markup.  Before rushing headlong
   into a framework for a solution, it is time to step back and try to
   understand just what the scope of the problem is.  Once consensus is
   reached, we can then move forward with a framework for a solution.

Burger                  Expires December 7, 2006                [Page 1]

Internet-Draft                MSCL Thoughts                    June 2006

   This document describes a number of existing approaches and proposals
   to solve the Application Server - Media Server protocol problem,
   their characteristics and benefits and drawbacks.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Factors  . . . . . . . . . . . . . . . . . . . . . . . . . . .  4
     2.1.  Media Resource Model . . . . . . . . . . . . . . . . . . .  4
     2.2.  Number of Protocol Messages for a Given Operation  . . . .  5
     2.3.  Network Topology . . . . . . . . . . . . . . . . . . . . .  5
     2.4.  Protocol Layer Integrity . . . . . . . . . . . . . . . . .  6
     2.5.  Computer Science Issues  . . . . . . . . . . . . . . . . .  7
     2.6.  Deployment Scale . . . . . . . . . . . . . . . . . . . . .  9
     2.7.  Compatibility with SIP Model . . . . . . . . . . . . . . . 10
     2.8.  Security Issues  . . . . . . . . . . . . . . . . . . . . . 10
   3.  Transport Protocols  . . . . . . . . . . . . . . . . . . . . . 11
     3.1.  Pure Device Control  . . . . . . . . . . . . . . . . . . . 11
     3.2.  Pure SIP . . . . . . . . . . . . . . . . . . . . . . . . . 11
     3.3.  SIP With TCP Side Channel  . . . . . . . . . . . . . . . . 12
     3.4.  SIP With INFO  . . . . . . . . . . . . . . . . . . . . . . 13
     3.5.  SIP With SUBSCRIBE/NOTIFY  . . . . . . . . . . . . . . . . 14
     3.6.  SIP With MEDIA . . . . . . . . . . . . . . . . . . . . . . 14
   4.  Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
     4.1.  H.248  . . . . . . . . . . . . . . . . . . . . . . . . . . 15
     4.2.  MSCML  . . . . . . . . . . . . . . . . . . . . . . . . . . 15
     4.3.  MOML/MSML  . . . . . . . . . . . . . . . . . . . . . . . . 18
   5.  Recommendations  . . . . . . . . . . . . . . . . . . . . . . . 20
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 21
   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 21
   8.  Informative References . . . . . . . . . . . . . . . . . . . . 22
   Appendix A.  Contributors  . . . . . . . . . . . . . . . . . . . . 24
   Appendix B.  Acknowledgements  . . . . . . . . . . . . . . . . . . 24
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 25
   Intellectual Property and Copyright Statements . . . . . . . . . . 26

Burger                  Expires December 7, 2006                [Page 2]

Internet-Draft                MSCL Thoughts                    June 2006

1.  Introduction

   An IP multi-function Media Server is a network server that provides
   media processing services to the network.

   There are two models for media resource servers.  One models the
   media resource server as a box of low-level resources, such as RTP
   mixers, transcoders, audio play and record resources, video play and
   record resources, tone detection and generation resources, and
   resources to connect, or "plumb" the resources together.  The other
   model is that of a server that offers announcement services,
   interactive voice response (IVR) services (including speech
   recognition and speech synthesis modalities), interactive video
   response (IVVR) services, basic mixing services, and enhanced mixing

   In general, when we say "multi-function Media Server", we are
   referring to the server model.

   As the IP Media Server evolved from a box of low-level resources into
   a first-class server in the Internet, the protocol interfaces to
   control the IP Media Server evolved, as well.  When people thought of
   the media server as a box of low-level resources, device control
   protocols like H.248 [1] seemed appropriate.  At the time, the
   primary model for control of a media server was from a "SoftSwitch",
   or Media Gateway Controller.  The principal application was for
   playing announcements and collecting a small number of digits.  The
   Media Gateway Controller already implemented a device control state
   machine to control the Media Gateways.  Moreover, the Media Gateway
   Controller implemented some form of Gateway Control protocol to
   control the Media Gateways.  Thus it was logical to assume that a
   device control protocol, more specifically H.248 from the IETF
   perspective, would be appropriate for media resource control.

   Although the "SoftSwitch" (traditional telephony) model (and market)
   was an early driver for the need for media resources, within two
   years it was clear that the primary consumer of media resources would
   be Internet-oriented applications.  Developers create and deploy
   these applications on Internet Application Servers, using Internet
   and Web tools and protocols.  These Application Servers have no need
   to control Media Gateways, and thus do not generally have
   implementations of device control protocols such as H.248.  Moreover,
   Application Servers were much more likely to have HTTP [2] and SIP
   [3] and use stimulus-markup, client-server application architectures.

   RFC3087 [4] introduced the concept of addressing services as if they
   were users in SIP.  This meant that it was possible to address
   specific resources from an application simply by sending the session

Burger                  Expires December 7, 2006                [Page 3]

Internet-Draft                MSCL Thoughts                    June 2006

   to a "user" at a media server.  However, RFC3087 did not provide any
   mechanism to achieve Internet-wide interoperability.  What was needed
   was some sort of naming convention to address the various services
   available at the media server.  The netann [5] specification provides
   such a naming convention.

   Recalling the functions of a multi-function IP Media Server, the
   netann specification is directly sufficient for announcements and
   simple conferencing.

   For Interactive Voice Response (IVR), VoiceXML [6] provides a
   standard method for defining voice (and now video) dialogs.  However,
   there is a need to inform the IP multifunction media server that the
   request is for the VoiceXML service and the URI of the initial
   document.  The netann specification provides this definition.

   What is missing is a method for enhanced conference control.

   By enhanced conference control, we mean facilities for creating sub-
   mixes, recording the mix or a leg, playing media into a mix or leg,
   altering the gain on a leg or the mix as a whole, defining which
   media is eligible for the mix, and so on.

   To date, there have been several proposals, experimental protocols,
   and de facto standards to address the enhanced conference control
   problem.  Factors influencing these protocols include the
   application's media resource model (raw resources versus service
   server), the desire to leverage existing protocol infrastructure
   (such as using SIP Registrars for resource discovery, SIP Proxies for
   resource location, scale, and availability), and the expectations of
   Internet-scale deployment sizing.  The following sections examine
   these factors and then look at the various proposals to address them.

   As a side note, two XML-based, SIP-transported media server control
   markup languages command approximately 100% of the market: MSCML [16]
   and MOML [17].

2.  Factors

2.1.  Media Resource Model

   As the Introduction indicated, many new applications use the Internet
   model for media resources.  That is, applications request media
   services from an Internet-oriented, IP multi-function Media Server.
   However, some legacy applications, as well as application developers
   more comfortable with a telco-oriented approach, would like to model
   the media processing function as a set of low-level resources.

Burger                  Expires December 7, 2006                [Page 4]

Internet-Draft                MSCL Thoughts                    June 2006

   There is no question that with a low-level model, one has the full
   flexibility to address any possible requirement.  For example,
   creating a sidebar conference is simply the manipulation of some
   mixer resources and plumbing the selected RTP streams (possibly
   through transcoders) to the mixer resources.  Likewise, one can
   accomplish playing a prompt to a leg by disconnecting the leg from
   the mixer, allocating a media player, plumbing the media player to
   the RTP port that represents the leg, directing the media player to
   play the prompt, then deallocate the media player, and finally re-
   plumbing the RTP stream to the mixer.

   Conversely, with an Internet server model, applications request media
   manipulation using protocols appropriate for applications.  For
   example, media streams are addressed using application constructs,
   such as SIP dialog identifiers.  Rather than specifying a sidebar by
   manipulating RTP streams directly, the application specifies which
   legs the Media Server is to place into a sidebar.  In fact, as we
   will show below, one can specify complex topologies, such as Agent/
   Supervisor/Mark, with fewer messages than using a device control

2.2.  Number of Protocol Messages for a Given Operation

   The number of protocol messages required for a given set of
   operations is a factor that can potentially affect the scale of the

   Too many messages can result in bandwidth problems at the media
   server control interface, packet handling problems at either the
   media server or application server, and stack processing problems at
   either the media server or application server.

   Conversely, optimizing on number of messages can result in complex
   protocols with a very large number of verbs.  This is often in
   conflict with engineering principles such as offering a simple
   protocol with a small number of verbs.

2.3.  Network Topology

   In determining the control mechanism, we need to examine the control
   topology.  Namely, will there be a one-to-one mapping of Application
   Servers to Media Servers?  Will there be a one-to-many mapping of
   Applications Servers to Media Servers?  Will there be a many-to-one
   mapping of Applications Servers to Media Servers?  Or, can there be a
   many-to-many mapping of Application Servers to Media Servers.
   Answers to this question helps determine the question as to whether
   there should be a single control channel per Media Server, single
   control channel per Application Server, single control channel per

Burger                  Expires December 7, 2006                [Page 5]

Internet-Draft                MSCL Thoughts                    June 2006

   session, or single control channel per leg.

   Since control channels consume operating system resources, fewer
   control channels use fewer operating system resources.  Of course,
   overall system resource utilization is more complex than simply how
   many channels there are at a given node.  For example, on most
   operating systems, message routing is done in kernel space with
   pointer manipulation.  However, once in application space, message
   routing is often done with buffer copying.

   Another aspect influencing the cardinality of control channels is
   protocol layer integrity.  We will examine this point in the next

2.4.  Protocol Layer Integrity

   There are many fundamental principles driving the IETF model of
   layered protocols.  For example, a single TCP socket uses less system
   resources that ten thousand TCP sockets.  Given that, why do we have
   FTP, TELNET, SMTP, NNTP, MGCP, etc.?  It would appear to be much more
   efficient to establish a single TCP socket between the hosts and
   multiplex the different protocols over that socket.  One of the
   reasons we do not do this is that while we would save on memory and
   kernel processing on the TCP socket, we end up spending memory and
   kernel processing resources on demultiplexing the TCP stream to
   direct the stream to the appropriate application process in user

   Likewise, one could multiplex a given protocol over a single channel.
   In this case, the decision comes down to programming model.  For
   example, in the FTP case, it is easier to manage the media and
   control separately over separate channels.  Many implementations of
   FTP has the server FTP daemon spawning separate FTP server processes
   to handle requests.  In this way the FTP server process can be quite
   simple and straightforward.

   Another approach has multiple requests physically multiplexed to a
   single port, but establish separate logical sessions.  One protocol
   that uses this model is SIP.  All requests go to a single port
   (usually 5060), yet in the protocol data unit (PDU), we have a dialog
   identifier that identifies which dialog the message belongs to.

   The control channel per session model maintains protocol layer
   integrity by allowing the kernel to do appropriate routing of
   requests to the application.

   Multiplexing the control channel requires special considerations.

Burger                  Expires December 7, 2006                [Page 6]

Internet-Draft                MSCL Thoughts                    June 2006

   If there is a limit of a single control channel at the Media Server,
   then, by definition, there can be only a single Application Server
   controlling it.  This works in a device control model, such as H.248
   [1], where a Media Gateway Controller controls an entire Media
   Gateway.  In order to allow multiple clients to control the server,
   one must "virtualize" the server.  That is, the server presents what
   looks to the client as an entire, self-contained server, while in
   fact those self-contained servers are actually logical partitions of
   the physical server.

   Depending on the server function, such partitioning may be easy or
   extremely complex.  Let us consider the case of a SIP Application
   Server.  A SIP Application Server, or Back-to-Back User Agent
   (B2BUA), looks to the world like a whole bunch of SIP User Agent
   Servers.  This is not too difficult to manage, as the SIP User Agent
   Servers all generally look alike.  On the other hand, consider a SIP
   Media Server.  The SIP Media Server often has a fixed number of
   different types of resources, such as announcement players,
   conference bridges, recorders, and so on.  Partitioning these
   resources can be exceedingly complex.

   Some applications benefit from a single control channel model.  For
   example, the classic SoftSwitch model and the current IMS model
   assume that all media processing requests go through a single network
   element that, in the words of TRON, is a "Master Control Program."
   While many from the telco world are comfortable with having a large,
   centralized system, many in the IETF have found time and time again
   that a single central server rarely meets the requirements for
   Internet scale.  Other methods, such as server farms and alternate
   return contact addresses, enable theoretically infinite scale.

2.5.  Computer Science Issues

   Two issues to consider when using a device control protocol are how
   long it takes to create an application and the quality of the work
   product.  Two factors influencing these issues are the program length
   and cyclomatic complexity.

   There is an interesting result through 30 years of programmer
   productivity studies.  It turns out that with the exception of the
   introduction of compilers, visual editors, and visual debuggers,
   programmer productivity has been relatively constant, at 10 to 50
   lines of code delivered per day.  Thus, reducing the number of lines
   of code required for a given function is an important tactic to
   achieve the goal of improving either the time-to-market or robustness
   of an application.  This is one of the reasons why we code in Java,
   C++, VB, etc., instead of assembly language.

Burger                  Expires December 7, 2006                [Page 7]

Internet-Draft                MSCL Thoughts                    June 2006

   Cyclomatic complexity measures the number of branches and function
   calls in a given application.  Again, 15 years of research have shown
   a strong correlation between cyclomaitc complexity and the difficulty
   of test and liklihood of bugs in fielded code.  This is an intuitive
   result: more branches means more test cases, or the collary, that
   more branches means more code that testing will miss.  However, the
   emperical results are more impressive: the higher the cyclomatic
   complexity, the more errors found in the field.

   Here is a concrete example of how this plays out in practice.  iSCSI
   [7] defines how one can, over IP, read and write blocks on a disk.
   One could then ask, "Why do we access data bases using data base-
   oriented protocols, like TDS [8]?"  After all, one can do all the
   manipulation one needs for a data base application at the disk block
   level.  Moreover, one can virtualize the target disk, so the
   application does not have to have direct control over physical disk

   We would offer the answer is obvious.  Data base application
   developers think and operate at the table access level.  They don't
   care about disk blocks, B-Trees, indices, and so on.

   One could argue that supplying a client library that hides the data
   base-centric operations from an application would hide the low-level
   nature of a disk access protocol from the application.  That is, it
   would present an application-layer interface to the application.  We
   offer here that protocol layer integrity comes to play here, as well.
   In particular, embedding data base code in the client means that one
   cannot have any data base innovation at the server.  Everything
   occurs at, and is bound to, the client.

   Clearly there is a need for a low-level disk access protocol.  That
   is what drove the iSCSI effort.  However, application developers need
   a file access protocol like NFS [10]; data base application
   developers need a high-level data base access protocol; mail
   application developers need a mail transfer protocol like SMTP [11];
   and so on.

   A similar situation exists in the media processing milieu.  The IETF,
   with the ITU-T has created a media gateway control protocol, H.248
   [1].  Although designed for the media gateway control problem, H.248
   has capabilities for controlling arbitrary media functions, albeit at
   a very low level.  H.248, and, THE MODEL IT REPRESENTS, assumes a
   master/slave, low-level device control programming model.  This is
   analogous to direct disk block manipulation for data access, as
   represented by iSCSI.  Features accessible via H.248 or protocols in
   the style of H.248 include audio players, audio recorders, RTP
   termination and origination, mixers, tone detectors and generators,

Burger                  Expires December 7, 2006                [Page 8]

Internet-Draft                MSCL Thoughts                    June 2006

   and plumbing primitives.

   High-level media processing protocols have been proposed, modeling a
   media resource server as just that, a server that offers multimedia
   processing functions.  Services offered by media servers include IVR,
   conference mixing, announcements, interactive video, and so on.

   Consider the choice of terms: a H.248 device offers "features" while
   a media server offers "services".  Section 3 examines the different
   protocol proposals in detail.

2.6.  Deployment Scale

   Just how many sessions do we need at any given Media Server?  First,
   let us consider a Media Server that would handle ALL calls on the

   Take a population of seven billion people.  Let us assume that every
   person calls one other person, on average, once every week.  That
   means we are looking at 1 billion calls per day.  Calculating the
   maximum number of simultaneous calls, let us assume that in any given
   populated time zone, up to 1/12th of the population of the world is
   actively making calls.  The assumption here is that the time zones
   dividing the Pacific and Atlantic Oceans are essentially unpopulated
   (sorry Greenland and Alaska), while the time zones covering Europe
   have a relatively high teledensity.  We make this assumption as we
   assume that busy hour will rotate around the Earth for a given

   With these assumptions, there are about 83 million calls per day in a
   given time zone.  Since, for most applications, 15% of calls occur
   during the busy hour, we are looking at 12.5 million simultaneous

   Now it is time for a reality check.  Just how many simultaneous
   sessions will any given Application Server or Media Server really
   need to handle?  In the above example, we found an upper limit of
   12.5 million simultaneous sessions ASSUMING ALL CALLS IN THE WORLD GO
   THROUGH THE APPLICATION.  That is a pretty hefty assumption.

   What if we worked it backward?  Let us assume that a single
   Application Server and Media Server provided voice messaging to the
   entire world.  Again, let us start with a population of seven billion
   people.  With a ratio of 200 subscribers per session, we get 35
   million sessions.  Taking time zones into account, we would be
   looking at about 2 million simultaneous sessions.

   What is the point of these calculations?  It is that the argument

Burger                  Expires December 7, 2006                [Page 9]

Internet-Draft                MSCL Thoughts                    June 2006

   that one must have a single control channel to effectively scale
   services is a bit disingenuous.  Namely, if an Application Server
   will be handling, say, 100 million users, only a small percentage
   will be using the service at any given time.  Moreover, if one
   architected the Application Server to be a single node, it will have
   to handle hundreds of thousands of inbound connections anyway.  If
   you can handle a few hundreds of thousands of simultaneous
   connections, you can probably handle a few two- or three- hundreds of
   thousands of connections.  To put this into perspective, 100,000
   inbound connections represents well over 2 entire IP port address

2.7.  Compatibility with SIP Model

   Various proposals offer to use SIP in some way.  The question is,
   will one use SIP within the acceptable use of SIP, or will one use it
   "because it is there."

   For example, does a given protocol proposal leverage the SIP routing
   infrastructure, or is it intended for a point-to-point deployment?
   Does the server offer SIP-level services, or is it simply using SIP
   to transport, or tunnel, device control commands?  Does the protocol
   preserve layer integrity, by using references in the SIP domain, or
   does it require references to the SDP [9] or IP domain?

   One measure of compatibility with the SIP model a given proposal
   offers is to see what its compatibility with SIP Proxies, as defined
   by RFC3261 [3], is.  For example, does the proposal require SDP
   manipulation?  If so, how deep does the manipulation need to be?
   Clearly, any SDP manipulation makes the protocol incompatible with
   SIP Proxies - SDP modification requires the use of a back-to-back
   User Agent (B2BUA).  Is the B2BUA simply inserting an m-line in the
   SDP to plumb a control channel?  Is the B2BUA parsing the SDP to
   determine RTP addresses and media types?

   The best would be pure proxies, as this will have the highest chance
   of avoiding compatibility issues in the future.

2.8.  Security Issues

   One issue is who is allowed to manipulate what at the Media Server.
   For services like announcements, IVR, and IVVR, a straightforward
   security model is to have commands come on the same SIP dialog as
   what established the media connection.  Clearly, if you can create
   the connection, you have some kind of relationship with the end
   point, if you are not the requesting end point itself.

   Other relationships get more complicated.  For example, if we have a

Burger                  Expires December 7, 2006               [Page 10]

Internet-Draft                MSCL Thoughts                    June 2006

   single control pipe from the Application Server, everything is OK if
   there is only one Application Server.  This is the model for H.248.
   However, if we have more than one Application Server, then we have to
   ensure a separation of the resources from one Application Server from

   One solution for this problem is to partition the Media Server into
   multiple virtual Media Servers, each one dedicated to a given
   Application Server.  This is a suggested model in H.248.  However, as
   mentioned above in Section 2.4, this may be difficult for server-
   centric Media Servers.

3.  Transport Protocols

3.1.  Pure Device Control

   H.248 [1] is the IETF/ITU-T media gateway control protocol.  H.248
   provides generic session establishment machinery and gateway internal
   resource interconnection.  H.248 packages define various resources,
   including tone detectors, tone generators, audio recorders, and
   fixed-function audio prompt resources.

   H.248 uses SDP for session negotiation, but it is considerably
   different than SIP's SDP offer/answer [12] protocol.

   H.248 assumes a single media gateway controller per media gateway.
   H.248 uses a single TCP, UDP, or SCTP pipe between the controller and

   Most H.248 implementations use text encoding over the wire.  For
   those that are enamored with XML PDU's, H.248 does have an ASN.1 [13]
   encoding.  This means one can use XER [14] to have an XML wire

3.2.  Pure SIP

   Using the netann [5] convention, one can perform basic media
   services, such as announcements and basic mixing.  However, SIP does
   not provide the necessary controls for enhanced conferencing, such as
   gain control, identification of preferred speakers (if they speak,
   they have priority in the mix, even if they are not the loudest),
   creating sidebar and other topologies (such as Coach/Agent/Mark), and
   so on.

   Note that Pure SIP uses a single TCP or SCTP socket.  However, there
   is a separate SIP session per leg.

Burger                  Expires December 7, 2006               [Page 11]

Internet-Draft                MSCL Thoughts                    June 2006

3.3.  SIP With TCP Side Channel

   MRCPv2 [15] is an example of a media processing protocol that uses a
   TCP side channel.  In MRCPv2, the client uses SIP to route to a
   speech server, uses SIP's SDP offer/answer [12] protocol to negotiate
   the media codecs, and specifies the protocol machinery for
   establishing a side channel transfer protocol, such as TCP or TLS,
   for the actual MRCPv2 PDU's.

   The MRCPv2 server hands back a unique session identifier to the
   client.  All subsequent messages relating to a given MRCPv2 session
   include the session identifier.  This means one can share the side
   channel between multiple client instances on the requesting node.
   MRCPv2 allows the client to request channel reuse or to request a new
   channel at session establishment time.  Correspondingly, the MRCPv2
   server can insist on a side channel per session, rather than sharing
   the side channel amongst sessions.

   The MRCPv2 model has the benefit of using the SIP protocol machinery
   for session establishment.  This includes using the SIP security
   mechanisms to authorize the association of the side channel with the
   media channel.

   MRCPv2 itself has the drawbacks of having a totally different state
   machine.  The MRCPv2 state machine is optimized for speech services
   like speech recognition and speech synthesis.  Moreover, the methods
   are incompatible with the needs for conference control.

   In addition, the MRCPv2 approach rules out the use of the protocol by
   SIP Proxies, as the B2BUA must modify the SDP to insert the SDP
   m-line for the control channel.

   One might ask, "If all we are doing is establishing a TCP connection
   to control the media server, what do we need SIP for?"  This is a
   reasonable question.  The key is to be using SIP for media session
   establishment.  If we are using SIP for media session establishment,
   then we need to ensure the URI used for session establishment
   resolves to the same node as the node for session control.  Using the
   SIP routing mechanism, and having the server initiate the TCP
   connection back, ensures this works.  For example, the URI sip:
   myserver.example.com may resolve to sip:
   server21.farm12.northeast.example.net, whereas the URI
   http://myserver.example.com may resolve to
   http://server41.httpfarm.central.example.net.  That is, the host part
   is NOT NECESSARILY unambiguous.

Burger                  Expires December 7, 2006               [Page 12]

Internet-Draft                MSCL Thoughts                    June 2006

3.4.  SIP With INFO

   Two proposals have been put forward that use the SIP dialog for the
   side channel.  Both use the INFO method.  They are MSCML [16] and
   MSML [18].

   MSCML uses the SIP Requires and Content-Type headers to ensure
   interoperability and preservation of SIP semantics.  MSCML correlates
   the commands received on the dialog with the dialog's media streams.
   In the case of enhanced conferences, where there are global commands
   such as conference size, playing to the entire conference, or
   recording the entire conference, MSCML has the concept of a
   Conference Control Leg. The Conference Control Leg is not associated
   with any media dialog.  However, it is a SIP dialog in the normal

   MSML relies on a private (non-Internet) agreement between the
   Application Server and Media Server to know the context of the INFO
   messages.  MSML tunnels SDP-layer information over the established
   dialog; in the case of media processing, it uses a secondary markup,
   MOML [18].  MOML is a device control protocol, with primitives
   similar to H.248.

   Deployed versions of MOML/MSML do not use SIP, such as for
   referencing entities with SIP dialog properties, using SIP semantics
   for control, or transparently correlating SIP dialogs with RTP
   streams.  However, the current version of the MSML specification does
   suggest using the SIP Dialog identifier to identify media sessions.

   We will touch upon the content of what goes over the side channel in
   Section 4.

   Using the SIP dialog for the side channel has the benefit of using
   the SIP routing network for getting the messages to locate and follow
   (in the mobility case) the UAS and UAC.  In particular, proxies that
   are important for routing can Record-Route, while proxies that are
   not needed other than for session establishment can chose to not
   Record-Route.  Thus the transport of side channel commands places
   only a small burden on the SIP routing network.

   Note that there are a few problems resulting from the use of INFO.
   First, there are no throttling mechanisms, other than that provided
   by the underlying transport mechanism (TCP or Connection-Mode SCTP).
   If you are using UDP, you are out of luck.  Second, even in the case
   of MSCML, which is well behaved in that it is guaranteed by the SIP
   protocol machinery that both the UAS and UAC will interoperate and
   understand the semantics of the MSCML INFO messages, the stacks can
   still get other, ill-behaved INFO messages that it may not

Burger                  Expires December 7, 2006               [Page 13]

Internet-Draft                MSCL Thoughts                    June 2006

   understand.  Third, even though this has never happened in the real
   world, there is a theoretical problem that INFO message handling may
   overwhelm a proxy.  In practice, one sizes ones proxies to the total
   traffic they need to handle.  Moreover, only active element proxies,
   such as Edge Proxies, need Record-Route.  That said, this might be a
   problem in the future.

   The following sections explore alternatives that use the SIP Dialog.


   As outlined in the expired draft, INFO Considered Harmful [19], the
   events framework (SUBSCRIBE/NOTIFY) addresses all of the problems
   with INFO.  Namely, event packages must offer throttling mechanisms,
   all event packages identify themselves and thus globally
   interoperate, and even stupid proxies that Record-Route everything
   often decide not to Record-Route SUBSCRIBE and NOTIFY messages.

   Of course, SUBSCRIBE/NOTIFY really, really, really should not
   (actually, most of us, including me, say "MUST NOT") reuse the SIP
   dialog directly associated with the media session.  This means we
   lose the auto-correlation feature that we have by using the INFO

   There is a subtler, yet arguably more important problem with using
   SUBSCRIBE/NOTIFY.  Namely, the semantics of SUBSCIBE are, "tell me
   (monitor) what is going on at the device."  Typical uses for
   SUBSCRIBE are for presence [20] (what is the state of the user?), MWI
   [21] (what is the state of the message store?), and KPML [22] (what
   is the state of the key press buffer?).  No package changes the state
   of the UAS.  Using SUBSCRIBE, for example, to play a prompt or to
   change the configuration of a mixer, most definitely changes the
   state of the UAS.

3.6.  SIP With MEDIA

   Another approach outlined in INFO Considered Harmful [19] is to
   introduce a new method.  This was the route taken by PUBLISH [23], as
   it was not quite NOTIFY.

   Properly defined, a new method can safely share the SIP dialog.
   Moreover, it would satisfy the auto-correlation properties used by,
   for example, MSCML.  Lastly, the semantics would be well defined,
   addressing the issues raised by INFO Considered Harmful.

4.  Models

Burger                  Expires December 7, 2006               [Page 14]

Internet-Draft                MSCL Thoughts                    June 2006

4.1.  H.248

   H.248 [1] provides:
   1.  A single control channel between Application Server and Media
   2.  The possibility for an XML transport encoding.
   3.  Total control of media resources, at the assembly language level.
   The first item is of use to those whom would want a single control
   channel and socket per Application Server.  The second item is of use
   to those whom love XML.  The third item ensures a measure of
   capabilities possibility.  That is, since the Application explicitly
   defines the application-level semantics of media processing at the
   media layer, future Applications can define future, unanticipated

   The drawbacks of H.248 are:
   1.  Layer violation et al.
   2.  Market adoption
   The first item touches upon virtually every issue raised in
   Section 2.  By definition, H.248 is a low-level device control
   protocol.  That means more lines of code for a given function, higher
   complexity for a given function, no compatibility with the SIP model
   (everything becomes a MGC), and the Application Server must dive deep
   into SDP and they media layer to do basic operations.

   The second item, while not in itself a determining factor in the
   IETF, is important to note as a leading indicator.  For many of the
   reasons noted above, neither Application Server developers nor Media
   Server developers desire H.248 as an Application Server - Media
   Server protocol.  Moreover, none of the major media server
   manufacturers have or plan to offer H.248-based media servers.  In a
   sense, the market has spoken about this option, even in light of the
   1999 declaration (well before there were any enhanced media services)
   by 3GPP that H.248 would be the media server (MRFP) interface.

4.2.  MSCML

   MSCML [16] provides:
   1.  Automatic correlation, including security associations, between
       the control channel and the media session.
   2.  Preservation of SIP semantics, including being SIP Proxy
   3.  Operations and all semantics are at the SIP dialog layer.
   4.  Application Servers can be relatively simple, as addressing of
       media processing commands is straightforward: send the command
       down the associated SIP media dialog.

Burger                  Expires December 7, 2006               [Page 15]

Internet-Draft                MSCL Thoughts                    June 2006

   5.  Establishing a media session is straightforward: INVITE the Media
       Server to a session.
   6.  Strict adherence to the philosophy espoused by, among other
       places, the Application Interaction Framework [24].

   The drawbacks of MSCML include:
   1.  Even though MSCML properly uses INFO, using INFO in itself has
       theoretical problems with non-interoperating devices.
   2.  By relying on SIP dialogs, the Application Server uses multiple
       SIP dialogs to control, for example, an enhanced conference on
       the Media Server.
   3.  By taking the application layer approach, MSMCL requires one to
       two more protocol messages than a device control approach.
   The first issue is a result of using INFO.

   The second issue is more interesting.  For example, the enhanced
   conference case, that is, where one needs to play or record into the
   entire conference, one has to setup an additional SIP dialog, the
   Conference Control Dialog, per conference.  In the extreme case of
   two-party conferences, this increases the number of SIP dialogs by
   50%.  Of course, few two-party scenarios require the enhanced
   conferencing features, and thus would not increase the number of
   dialogs.  However, if one did need those features, then the dialog
   expansion would occur.

   The third issue refers to the situation where the Application Server
   wants to place the caller into a conference, but the application
   needs to interact with the caller before the application knows which
   conference to place them into.  In the MSCML model, the application
   has to INVITE the caller into a dialog (VoiceXML) or IVR session with
   the caller, determine the address of the conference, and then re-
   INVITE or REFER the caller into the conference.

   Of course, if one uses a low-level device control markup rather than
   an application-level markup like VoiceXML, then the number of
   protocol messages to implement a voice dialog will swamp the extra
   redirect message.

   Interestingly, MSML and MSCML exchange the same number of messages to
   do the same task.

   The re-INVITE model offers total flexibility, in that the application
   never has to change if the modality of the IVR step changes.  For
   example, the IVR step could be to a low-cost audio media resource,
   which then places the caller into a high-cost, 30fps, continuous
   presence video bridge.

Burger                  Expires December 7, 2006               [Page 16]

Internet-Draft                MSCL Thoughts                    June 2006

    Application Server                            Media Server
             |                                          |
             |INVITE sip:dialog@ms.example.net          |
             |;voicexml=http://as.example.net/get-id    |
             |                                          |
             |200 OK                                    |
             |                                          |
             |ACK                                       |
             |                                          |
             |GET http://as.example.net/cgi-bin/get-id  |
             |                                          |
             |(VoiceXML script)                         |
             |                                          |
             |POST (result)                             |
             |                                          |
             |REFER sip:conf=12345@ms.example.net       |
             |                                          |
             |202 ACCEPTED                              |
             |                                          |
             |NOTIFY                                    |
             |                                          |
             |200 OK (NOTIFY)                           |
             |                                          |
             |                                          |

   The downside of the re-INVITE model is that it involves the endpoint
   in the SDP renegotiation.  This puts an additional burden on the
   Application Server and caller device to relay and act upon the

   The REFER model does not involve the calling endpoint.  However, it
   does have one additional protocol message.

Burger                  Expires December 7, 2006               [Page 17]

Internet-Draft                MSCL Thoughts                    June 2006

    Application Server                            Media Server
             |                                          |
             |INVITE sip:dialog@ms.example.net          |
             |;voicexml=http://as.example.net/get-id    |
             |                                          |
             |200 OK                                    |
             |                                          |
             |ACK                                       |
             |                                          |
             |GET http://as.example.net/cgi-bin/get-id  |
             |                                          |
             |(VoiceXML script)                         |
             |                                          |
             |POST (result)                             |
             |                                          |
             |REFER sip:conf=12345@ms.example.net       |
             |                                          |
             |202 ACCEPTED                              |
             |                                          |
             |NOTIFY                                    |
             |                                          |
             |200 OK (NOTIFY)                           |
             |                                          |
             |                                          |


   MSML [18] provides:
   1.  As of the -04 draft, a SIP Dialog addressing scheme.
   2.  Arbitrarily complex mixing topologies, on a par with H.248.
   3.  With MOML [17], the audio prompt, record, DTMF detection, and
       other functions of H.248, with the addition of access to speech
   4.  Switching between IVR and conferencing can be done without a re-
       INVITE or REFER.

   The drawbacks of MSML include:

Burger                  Expires December 7, 2006               [Page 18]

Internet-Draft                MSCL Thoughts                    June 2006

   1.  The application has to be aware of and manipulate the media
       resource plumbing.
   2.  With most operations on a par with H.248, why not use H.248?
   3.  The MSML model assumes everything resides in a single server,
       especially with respect to the audio/video example given above.

    Application Server                            Media Server
             |                                          |
             |                                          |
             |INVITE sip:dialog@ms.example.net          |
             |;moml=cid:foobratz12@ms.example.net *     |
             |                                          |
             |200 OK                                    |
             |                                          |
             |ACK                                       |
             |                                          |
             |GET http://as.example.net/cgi-bin/get-id  |
             |                                          |
             |(VoiceXML script)                         |
             |                                          |
             |POST (result)                             |
             |                                          |
             |INFO (MSML <result>)                      |
             |                                          |
             |200 OK                                    |
             |                                          |
             |INFO (MSML <join>)                        |
             |                                          |
             |200 OK                                    |
             |                                          |
             |                                          |

   * The MSML specification does not state how to start a session.  We
   assume that one starts a MOML session and then send a <msml>
   document.  The URI of the VoiceXML script, and the programming logic
   necessary to start that script, is embedded in the MSML document sent
   to the Media Server.

Burger                  Expires December 7, 2006               [Page 19]

Internet-Draft                MSCL Thoughts                    June 2006

5.  Recommendations

   This section is in the spirit of getting a conversation started.
   Everything here is opinion.  Feel free to argue.

   First of all, it is clear there is interest in a standard for the
   Application Server - Media Server protocol in the Internet community.
   The adoption of MOML/MSML in the developer community and MSCML in the
   developer and vendor community is an existence proof of the utility
   of, and need for, such a protocol.

   The official impetus for this work is the XCON Media Server
   Requirements [26].  However, in spite of the fact we have VoiceXML
   for application level IVR specification and H.248 for low-level IVR
   specification, people keep asking for IVR with conferencing, as
   evidenced by the XCON requirements.  The problem is this IVR
   functionality bleeds out, and thus we need to ensure it is well
   thought out before just tossing something in there.

   There is a desire to leverage the SIP protocol machinery for media
   session establishment, namely the SIP Offer/Answer protocol.

   Application developers want to see the Media Server as a server that
   offers application-level media processing.  That is, modeling the
   Media Server as a server that offers IVR, conference mixing, and
   other, application-level media processing services.

   If application developers want low-level, DSP-level media
   manipulation, they already have an IETF protocol, H.248.

   If application developers want a single control channel (total,
   including session establishment) from the Application Server to the
   Media Server, they already have an IETF protocol, H.248.

   If application developers want an XML transport encoding for a low-
   level protocol or a single control channel, they already have an IETF
   protocol, H.248.

   Assuming developers do not want H.248, what are the options?

   INFO probably isn't it.

   That leaves to directions to go.  The first is to stick with the SIP
   Dialog model of MSCML and the other is to stick with the side channel
   model of MRCPv2.

   The former would indicate a new method, such as MEDIA.  The latter
   would indicate a new establishment procedure, such as described in

Burger                  Expires December 7, 2006               [Page 20]

Internet-Draft                MSCL Thoughts                    June 2006

   the other MSRP [25].

   What does all this mean?


   It is easy to identify protocol abuse in the determination of the
   control channel.  However, even if we have a decent control channel
   establishment mechanism, sending the wrong kind of messages down that
   channel can render the protocol less than useful.

   For example, it is great to use SIP to route messages to a media
   server.  However, if those messages emulate H.248, but encoded in
   XML, it would be much more efficient, cleaner, and avoid the layer
   violation by simply using H.248.  You can even get H.248 in XML!
   Just please, please, please, don't transport it in SIP or a SIP side

      NOTE: This is one of the reasons I pulled out of [25] at the last
      minute.  What goes in to the pipe is as important as the pipe

6.  Security Considerations

   One issue is who is allowed to manipulate what at the Media Server.
   For services like announcements, IVR, and IVVR, a straightforward
   security model is to have commands come on the same SIP dialog as
   what established the media connection.  Clearly, if you can create
   the connection, you have some kind of relationship with the end
   point, if you are not the requesting end point itself.

   Other relationships get more complicated.  For example, if we have a
   single control pipe from the Application Server, everything is OK if
   there is only one Application Server.  This is the model for H.248.
   However, if we have more than one Application Server, then we have to
   ensure a separation of the resources from one Application Server from

   One solution for this problem is to partition the Media Server into
   multiple virtual Media Servers, each one dedicated to a given
   Application Server.  This is a suggested model in H.248.  However, as
   mentioned above in Section 2.4, this may be difficult for server-
   centric Media Servers.

7.  IANA Considerations

Burger                  Expires December 7, 2006               [Page 21]

Internet-Draft                MSCL Thoughts                    June 2006

   As this is an Informative exploration, there are no IANA

8.  Informative References

   [1]   Groves, C., Pantaleo, M., Anderson, T., and T. Taylor, "Gateway
         Control Protocol Version 1", RFC 3525, June 2003.

   [2]   Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
         Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
         HTTP/1.1", RFC 2616, June 1999.

   [3]   Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A.,
         Peterson, J., Sparks, R., Handley, M., and E. Schooler, "SIP:
         Session Initiation Protocol", RFC 3261, June 2002.

   [4]   Campbell, B. and R. Sparks, "Control of Service Context using
         SIP Request-URI", RFC 3087, April 2001.

   [5]   Burger, E., Van Dyke, J., and A. Spitzer, "Basic Network Media
         Services with SIP", RFC 4240, December 2005.

   [6]   Burnett, D., Hunt, A., McGlashan, S., Porter, B., Lucas, B.,
         Ferrans, J., Rehor, K., Carter, J., Danielsen, P., and S.
         Tryphonas, "Voice Extensible Markup Language (VoiceXML) Version
         2.0", W3C REC REC-voicexml20-20040316, March 2004.

   [7]   Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E.
         Zeidner, "Internet Small Computer Systems Interface (iSCSI)",
         RFC 3720, April 2004.

   [8]   Sybase, Inc., "TDS 5.0 Functional Specification Version 3.4",
         URL http://www.sybase.com/content/1013412/tds34.pdf,
         August 1999.

   [9]   Handley, M. and V. Jacobson, "SDP: Session Description
         Protocol", RFC 2327, April 1998.

   [10]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
         C., Eisler, M., and D. Noveck, "Network File System (NFS)
         version 4 Protocol", RFC 3530, April 2003.

   [11]  Klensin, J., "Simple Mail Transfer Protocol", RFC 2821,
         April 2001.

   [12]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with
         Session Description Protocol (SDP)", RFC 3264, June 2002.

Burger                  Expires December 7, 2006               [Page 22]

Internet-Draft                MSCL Thoughts                    June 2006

   [13]  Telecommunication Standardization Sector of International
         Telecommunication Union, "Abstract Syntax Notation One (ASN.1):
         Specification of basic notation", ITU-T Recommendation X.680,
         July 2002.

   [14]  Telecommunication Standardization Sector of International
         Telecommunication Union, "ASN.1 encoding rules: XML Encoding
         Rules (XER)", ITU-T Recommendation X.693, December 2001.

   [15]  Burnett, D. and S. Shanmugham, "Media Resource Control Protocol
         Version 2 (MRCPv2)", draft-ietf-speechsc-mrcpv2-09 (work in
         progress), December 2005.

   [16]  Dyke, J., "Media Server Control Markup Language (MSCML) and
         Protocol", draft-vandyke-mscml-08 (work in progress), May 2006.

   [17]  Saleem, A. and G. Sharratt, "Media Objects Markup Language
         (MOML)", draft-melanchuk-sipping-moml-06 (work in progress),
         October 2005.

   [18]  Melanchuk, T. and G. Sharratt, "Media Sessions Markup Language
         (MSML)", draft-melanchuk-sipping-msml-05 (work in progress),
         March 2006.

   [19]  Rosenberg, J., "The Session Initiation Protocol (SIP) INFO
         Method Considered Harmful", draft-rosenberg-sip-info-harmful-00
         (work in progress), January 2003.

   [20]  Rosenberg, J., "A Presence Event Package for the Session
         Initiation Protocol (SIP)", RFC 3856, August 2004.

   [21]  Mahy, R., "A Message Summary and Message Waiting Indication
         Event Package for the Session Initiation Protocol (SIP)",
         RFC 3842, August 2004.

   [22]  Burger, E., "A Session Initiation Protocol (SIP) Event Package
         for Key Press Stimulus  (KPML)", draft-ietf-sipping-kpml-07
         (work in progress), December 2004.

   [23]  Niemi, A., "Session Initiation Protocol (SIP) Extension for
         Event State Publication", RFC 3903, October 2004.

   [24]  Rosenberg, J., "A Framework for Application Interaction in the
         Session Initiation Protocol  (SIP)",
         draft-ietf-sipping-app-interaction-framework-05 (work in
         progress), July 2005.

   [25]  Boulton, C. and T. Melanchuk, "Media Server Request Protocol",

Burger                  Expires December 7, 2006               [Page 23]

Internet-Draft                MSCL Thoughts                    June 2006

         draft-boulton-media-server-control-00 (work in progress),
         June 2005.

   [26]  Even, R., "Requirements for a media server control protocol",
         draft-even-media-server-req-00 (work in progress),
         January 2005.

Appendix A.  Contributors

   I cannot share blame with anyone on this one.

Appendix B.  Acknowledgements

   Brooks Gelfand in 1985 made the quote, "If you cannot do it in
   assembly language, you cannot do it at all," during an argument I was
   having with another engineer about the relative merrits of C versus

   The catalyst for this document was the very hard and dedicated work
   of Chris Boulton, Tim Melanchuk, and I to bang out the and argue over
   the other MSRP draft, starting in April of 2005 and lasting through
   the very end of June.

Burger                  Expires December 7, 2006               [Page 24]

Internet-Draft                MSCL Thoughts                    June 2006

Author's Address

   Eric Burger
   Cantata Technology, Inc.
   18 Keewaydin Dr.
   Salem, NH  03079-2839

   Phone: +1 603 890 7587
   Fax:   +1 603 457 5944
   Email: eburger@cantata.com

Burger                  Expires December 7, 2006               [Page 25]

Internet-Draft                MSCL Thoughts                    June 2006

Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at

Disclaimer of Validity

   This document and the information contained herein are provided on an

Copyright Statement

   Copyright (C) The Internet Society (2006).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.


   Funding for the RFC Editor function is currently provided by the
   Internet Society.

Burger                  Expires December 7, 2006               [Page 26]