[Search] [pdf|bibtex] [Tracker] [Email] [Nits]

Versions: 00 01                                                         
Internet Engineering Task Force                                   SIP WG
Internet Draft                              Rosenberg/Mataga/Schulzrinne
draft-rosenberg-sip-app-components-00.txt        dynamicsoft/Columbia U.
November 15, 2000
Expires: May 2001

          An Application Server Component Architecture for SIP


   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet- Drafts as reference
   material or to cite them other than as work in progress.

   The list of current Internet-Drafts can be accessed at

   The list of Internet-Draft Shadow Directories can be accessed at


   An application server is defined as an entity that is capable of
   providing advanced features to users. Examples of features include
   call forwarding, call screening, debit card calling, web interactive
   voice response, etc. However, the set of functions needed to enable a
   broad range of such applications is quite large - it includes speech
   recognition, DTMF recognition and digit collection, text-to-speech
   synthesis, database interfacing, audio and video coding and decoding,
   audio and video bridging and mixing, and signaling, to name a few.
   Supporting such a large set of functions on the same box presents a
   major challenge. To solve this problem, the industry is proposing a
   decomposition of the application server into two components - a media
   server that handles the media component, and an application server
   that handles the call control, data, and signaling. The interface
   that has been proposed between these two elements is a control
   mechanism along the lines of MGCP or Megaco. In this paper, we
   propose an orthogonal decomposition, which breaks an application

Rosenberg/Mataga/Schulzrinne                                  [Page 1]

Internet Draft               AS Components             November 15, 2000

   server into application server components. Each component represents
   a application server in its own right, but it provides a well defined
   component that by itself may be a complete, but simpler, application.

1 Introduction

   An observable trend in VoIP systems is the continuing decomposition
   of monolithic elements into component subparts, with the
   corresponding development of standardized interfaces between
   components. This kind of decomposition can be observed in the
   MGCP/megaco [1] gateway decomposition of a large gateway into a
   signaling gateway (SG), media gateway (MG) and media gateway
   controller (MGC), often referred to as a softswitch. Following that
   decomposition, the softswitch was further decomposed into a pure call
   control component (still referred to as a softswitch) and an
   application server (AS), which provides features and services. The AS
   was then decomposed, breaking it into a signaling piece (still
   referred to as an application server), and a media server (MS), which
   provides the media components of applications. Protocols like MGCP
   [2] and Megaco [3] have been proposed as the interface between an AS
   and MS.

   This paper proposes an additional decomposition of an application
   server into application server components (ASCs). This decomposition
   is orthogonal to the MS/AS decomposition, and differs significantly
   in its goals and benefits. The primary motivation is the recognition
   that most complex (and interesting) applications require a common set
   of core pieces - speech recognition and text-to-speech, translation
   services, conference servers, messaging servers, etc. Each of these
   components is complex and a full-fledged application in its own
   right. In most cases, a complex application really doesn't care about
   the details of the operation of the component. In many cases, these
   components run on separate servers, and often, would be provided by
   separate providers. What is needed, then, is a well-defined,
   distributed interface to these application server components. Here,
   we motivate a distributed decomposition of applications into
   components, and then show why, for many of these, the interface is
   ideally suited for a distributed, session establishment and
   termination interface that follows a standardized pattern of
   addressing and parameter passing. We believe the Session Initiation
   Protocol (SIP) [4] is ideally suited for such an interface.

2 Why Decompose

   The first question to address is "why decompose an application

Rosenberg/Mataga/Schulzrinne                                  [Page 2]

Internet Draft               AS Components             November 15, 2000

   Decomposition is the act of breaking a large, monolithic system into
   a number of smaller compoents that interact according to specified
   behaviors. Decomposition of large components offers a number of

        Scale. As systems need to serve more and more users, there are
             two approaches to scaling up. One is to buy increasingly
             faster hardware, so that the monolithic servers can keep up
             with increasing use. The second is to distribute the work
             across components, so that multiple servers perform the
             work. Distribution is fundamentally cheaper, since the cost
             of large monolithic systems increases exponentially with
             capacity, compared to the linear increase in cost with
             multiple, smaller units. Distribution of work can be done
             through load balancing, where each server remains
             homogeneous, but the work is spread across numerous
             servers, or it can be done through specialization, where
             the work is split into separate functions, and each
             function placed on a separate server. Specialization is
             ideal in cases where the work has different requirements
             for it to be completed. As an example, a component of an
             application may require special purpose hardware. This
             component can distributed to a specialized processor, with
             a normal off the shelf processor handling the more generic
             software tasks. Several of the components that we are
             describing fit into this category (such as the TTS server).

        Sharing of resources. By decomposing a server into components, a
             many-to-many interaction between them becomes possible.
             This means that one component can provide services to many
             other components. This provides for sharing of resources,
             which ultimately results in cost reduction.

        Expertise. Building a complex application requires expertise in
             call control, media services, compression, web, speech
             recognition, etc. It is highly unlikely that one
             organization will have enough expertise in all of these to
             build them all. By decomposing an application server into
             subpieces, organizations with expertise in one particular
             piece can build that one. The result is that the complete
             system can be composed of best in breed components.

        Speed of deployment. By decomposing, upgrading existing
             applications and deploying new ones becomes simpler. The
             decomposition provides isolation. This isolation means that

Rosenberg/Mataga/Schulzrinne                                  [Page 3]

Internet Draft               AS Components             November 15, 2000

             one component can be changed or improved without affecting
             others. That makes it easy to add new features to an
             application, or to deploy a new one by using components
             already deployed.

   Decomposition does have its drawbacks. Primary amongst them is
   security. In general, the more boxes in a system, and the more they
   interact with each other, the more complex the security is. As a
   result, any distributed system has inherently more complex security
   issues. Another drawback is reliability. A system with multiple
   boxes, where the system requires all boxes to work in order to
   function, is less reliable than a system with a single box which must

3 Tightly Coupled Decomposition

   As an example of decomposition, it has been proposed to break the
   application server into a signaling and control component (the AS),
   plus a media server component (the MS). This decomposition is shown
   in Figure 1.

   Calls arrive at the AS component over SIP. The AS then accesses the
   MS using MGCP, and learns the IP address and port where the media for
   the call can be sent. This is returned in the 200 OK response by the
   AS. The AS then begins to instruct the MS to perform specific
   functions - collect digits, play tones and announcements, and to
   report the digits and tones back to the AS for further processing.
   Typically, the MGCP interface between the two devices is fairly
   "busy"; there is a lot of messaging for complex applications.

   In this model, there is a tightly coupled relationship between the MS
   and AS. The MS cannot function without the AS, and the AS needs to
   perform tight, low-level controls over the detailed operation of the
   media server.

   To some degree, breaking of an application server into these two
   components represents an implementation detail of how one builds a
   large, monolithic application server. It is not generally possible
   for the two components to be owned by separate providers. In fact, it
   has yet to be shown that complete interoperability and integration is
   possible with two components from different vendors, let alone
   different providers.

   This decomposition also does not provide a true separation of
   function. Most applications that require media interaction (IVR,
   credit card and debit card, etc.) have very cleanly separated media
   phases and signaling phases. The details of the media interactions

Rosenberg/Mataga/Schulzrinne                                  [Page 4]

Internet Draft               AS Components             November 15, 2000

                         .                  .
                         . +-------------+  .
                         . |             |  .
               SIP       . |             |  .
              -------------+      AS     |  .
                         . |             |  .
                         . |             |  .
                         . |             |  .
                         . +-------------+  .
                         .        |         .
                         .        |         .
                         .        |         .
                         .        |MGCP     .
                         .        |         .
                         .        |         .
                         .        |         .
                         . +-------------+  .
                         . |             |  .
                         . |             |  .
               RTP       . |             |  .
              -------------+     MS      |  .
                         . |             |  .
                         . |             |  .
                         . +-------------+  .
                         .                  .
                          Complete Application

   Figure 1: MGCP-based decomposition

   are usually not important to the signaling component, and vice a
   versa. As an example, consider a debit card application. The
   application starts with the user making a call. As part of the call
   processing, interaction is needed with the user via the media stream
   to determine the debit card number. The precise set of menu
   operations and interactions used to obtain this number aren't
   important to the call/signaling processing piece; only the result
   (the number), is important. Once the number is returned, media
   processing ceases, and data and call processing commence. The debit
   card is looked up in a subscriber database, and if enough time

Rosenberg/Mataga/Schulzrinne                                  [Page 5]

Internet Draft               AS Components             November 15, 2000

   remains, the call is completed. The signaling component monitors the
   call, and when the card has run out of minutes, the call is

   Consider the case where the application provider decides that the
   menus presented for debit card collection are confusing, and they
   need to be changed. This change really affects the media processing
   only; ideally, we would like to have no change whatsoever in the data
   processing and signaling part of the application. However, in the
   decomposition afforded by MGCP, the AS component contains both the
   signaling and call control, in addition to the control of the IVR
   menus and and processing. Thus, the AS needs to be updated, even
   though what has changed is really an IVR component.

   The MGCP decomposition also presents a burden for software developers
   on the AS. They need to understand, and program, the detailed
   interactions with the MS that are provided by MGCP, in addition to
   the detailed signaling and data processing operations. The developers
   will also need to build and manage the low level state representing
   the controlled entity, which can be painful. The result is longer
   development times, less code reuse, and slower innovation.

   It has been argued that one of the benefits of the MGCP decomposition
   is that it offloads the "burden" of call control from the media
   server. However, from a complexity standpoint, the MGCP processing
   required is probably on par with (if not more than), the simple
   amount of call control and SIP processing needed if SIP were used

   From a reliability perspective, an MGCP style decomposition is less
   desirable. Since the components are strongly coupled, the system will
   fail so long as any of the pieces fail. Failure can also be
   introduced because of additional network resources needed for
   communications between the boxes. The result is that the MGCP
   decomposition may actually increase the probability of failure, as
   compared to no decomposition at all.

   Another decomposition that has been proposed is to break a proxy into
   a routing and call control component, plus a services component. The
   interface between the two is then a transactional interface for
   services, similar in concept to INAP, based upon state transitions
   within a call model. This is another form of tight coupling, since it
   requires the services component to have detailed knowledge of the
   operational model of the call control component. We believe that this
   decomposition is limiting, for the same reasons the AS/MS
   decomposition is limiting.

4 The Decoupled Model

Rosenberg/Mataga/Schulzrinne                                  [Page 6]

Internet Draft               AS Components             November 15, 2000

4.1 Architecture

   As a result of this, we see the master/slave decomposition as being
   ideal for a single vendor to build a large system. However, this
   decomposition does not solve the other distribution needs we have
   motivated above. As a result, we propose that the AS be decomposed
   into an application component responsible for coordinating the
   overall execution of the application (called the controller), and
   application server components that provide pieces of the overall
   application. These components are only loosely coupled with the
   coordinating application server. The loose coupling implies that the
   interaction between them is the same as the interaction between the
   user and the coordinating application server, which is, in turn, the
   same as the interation between the application server components and
   other application server components. The components can easily be
   from separate vendors, and the interactions support the needed
   security and routing features to allow them to be owned by separate
   providers, even.

   The architecture is shown in Figure 2.

   The goal of the decoupling is to break the application into as
   coarse-grained pieces as possible. Each component (the coordinator
   included) should need to know as little as possible about the
   detailed operations performed by other components. A coarse-grained
   decomposition means that there is a clean and simple break in the
   functionality provided by the components. This enables significantly
   simpler interfaces between those components.

   Each component is really interested in passing a request for service
   to another, letting the other component perform its task, and then
   getting the final result of the task back as an output. From a
   software engineering perspective, this represents the classic
   function call; the call signaling component is making a function call
   to the media part. It is interested only in the return value - the
   debit card number, for example - and does not really care about the
   implementation of it. From a protocol perspective, this is a classic
   client-server system. The client makes a request of the server, and
   the server does whatever it needs to do to return the final response.
   The problem more closely resembes the client-server system than the
   function call, however. This is because we need the interaction to be
   across the network, rather than between code within the same process.
   This is because one of the key concepts here is that components can
   be provided by separate service providers.

   In such a model, where does the state for the sessions live? Here, we
   define a session as the complete set of interactions amongst all

Rosenberg/Mataga/Schulzrinne                                  [Page 7]

Internet Draft               AS Components             November 15, 2000

   components for the delivery of the service. Thus, a session might
   span multiple protocols, and even multiple calls. Not surprisingly,
   session state is distributed amongst the components, and the
   distribution follows the architectural model of Figure 2. The top
   level server, the controller, maintains the high level pieces of
   state that deal with overall delivery of the service, and the state
   required to coordinate the interactions with the component servers.
   Each component server maintains only the state needed to execute
   their component, and to manage interactions with components below
   them. A component server does not know about the complete service
   being delivered, and does not know about sibling servers. This aspect
   of our model - hierarchical distribution of session state, leads to
   one of the primary benefits of the architecture - ease of
   development. Someone building a new application by reusing existing
   components only needs to manage the high level state for delivery of
   the service. State related to the details of operation of one of the
   components - timings between digits in an IVR server, for example, is
   not relevant to the coordinator, and does not need to be managed.

   The difference between classic RPC or client/server interactions and
   the interactions between the components here is that the relationship
   between the components represents a long lived association (i.e., a
   session), during which a session level service is being provided,
   rather than a simple input/output service. As an example, consider a
   component providing continuous real-time text-to-speech translation
   services. The application coordinator that wishes to use this service
   acts as a client, initiating a request for service to the server (in
   this case, the TTS server). However, the text is not passed as an
   "argument" to the TTS server, it is continually streamed for the
   duration of an active session, and the TTS server would continuously
   stream back the speech version of the text, which is the output of
   the service.

   Another example is a voice messaging server. The messaging server
   provides basic services like message drop, message retrieve, and
   message management. Each of these represent procedures that can be
   executed by a client component. To drop a message, for example, the
   client component would initiate a session with the messaging server.
   A prompt would be played over that session, something like "please
   record your message for Joe now", and then the component takes the
   media input stream, records it, and saves it. When it is done, the
   session is terminated.

   In some cases, the session may require a "side channel" over which
   intermediate data is passed, needed to control the session
   interactions from that point forward. IVR is the classic example. In
   some cases the coordinating application server can kick off the IVR
   script, and then only get back the final result - a menu option, a

Rosenberg/Mataga/Schulzrinne                                  [Page 8]

Internet Draft               AS Components             November 15, 2000

                            |           |
                            |           |
                            |  AS       |
                            |           |
                            |           |
                   SIP,     --    \    ---
                    RTP?  --       \      ----      SIP,
                        --         \          ----   RTP?
                      --            \ SIP,        ----
                    --               \ RTP?           ----
                  --                 \                    --
            +----------+        +-----\----+         +----------+
            |          |        |          |         |          |
            |          |        |          |         |          |
            |          |        |          |         |          |
            |  ASC     |        |    ASC   |         |   ASC    |
            |          |        |          |         |          |
            |          |        |          |         |          |
            +----------+        +----------+         +----------+
                                       \                    /
               /                        \\  SIP,           /
              / SIP,                      \  RTP?        //
             /   RTP?                      \\           / SIP,
            /                                \         /   RTP?
           /                               +----------+
     +----------+                          |          |
     |          |                          |          |
     |          |                          |          |
     |          |                          |   ASC    |
     |   ASC    |                          |          |
     |          |                          |          |
     |          |                          +----------+

   Figure 2: Decoupled Architecture

Rosenberg/Mataga/Schulzrinne                                  [Page 9]

Internet Draft               AS Components             November 15, 2000

   credit card number, or what have you. In other cases, the
   coordinating component may need to get intermediate results, so that
   it can guide the operation of the IVR moving forward. This requires a
   companion control channel that provides data output from the
   component server back to the client, and then returns further high
   level instructions from the client back to the server.

   There is a thin line in some cases between this control channel and
   the tightly coupled interactions of a master-slave MGCP relationship.
   However, the loosely coupled nature of the interaction can be
   maintained by using coarse-grained data passing over a distributed
   client-server protocol, such as HTTP or Corba.

   From this architectural description, it is clear that a client-server
   session establishment protocol, which allows for passing of
   parameters that describe service, is the ideal mechanism to
   coordinate the interaction between components. Clearly, SIP is
   perfect in such a role.

   Following the example above, an IVR application server component
   would be completely responsible for the execution of the IVR piece of
   an application, including both the media and the signaling call
   control. It would know the menus to maneuver through, and it would
   know when to collect digits and present prompts. The coordinating
   application server would request service from the IVR component by
   initiating a call to it (possibly using third party call control [5]
   to direct the media directly to the IVR without passing through
   itself; more on that below). The application component takes the
   media from the incoming call, running it against the IVR application.
   When the IVR is done, the final result - in this case, the credit
   card number, is passed back to the coordinating AS, possibly throug
   an HTTP POST operation. The coordinating AS then terminates the call
   with the IVR.

4.2 Benefits of the Decoupling

   This decoupled interaction between components provides several
   important benefits:

        Separation of Businesses. The decoupled interaction between
             components is needed to allow the components to be provided
             by separate providers. Master-slave control interactions do
             not work well across service providers, let alone across
             vendors. By allowing separate providers to offer the
             components, new businesses can be created that specialize
             in the piece they are providing.

Rosenberg/Mataga/Schulzrinne                                 [Page 10]

Internet Draft               AS Components             November 15, 2000

        Rapid Development. Since the components can easily be placed in
             separate boxes from separate vendors, or even in separate
             providers, we achieve a separation of function that allows
             each piece to be developed in complete isolation. We also
             get reuse of components for new applications. This allows
             for rapid service creation.

        Better Interoperability. It can be argued that the decoupled
             interaction between components is more like to be
             interoperable that a master-slave mechanism. This is
             largely based on the assumption that a master-slave
             interaction requires a lot more messaging and exchange
             between the components, whereas the decoupled client-server
             mechanism requires less. The fewer information that passes
             back and forth, the easier it is to interoperate.

        Architectural Flexibility. The loose coupling of the components
             means that a server, such as a conferencing application or
             IVR, need not be implemented as an actual server. Rather,
             complex networks of components, with proxies providing
             routing of requests in arbitrarily complex ways, can be
             built to provide a service. Since the interaction is SIP,
             the application controller accessing the service doesn't
             know whether it is communicating with a single server or a
             network built in this fashion. That allows ASPs flexibility
             in how they can construct their service networks.

        Reliability The loose coupling of the components improves
             reliability compared to a tight coupling. Thats because the
             system can probably still continue to operate in the
             failure of a single component. For example, if a TTS server
             fails during a session, an application server can use a
             server from a completely different provider, or it can use
             a media server instead, converting the text to VoiceXML
             scripts. Depending on the service, the TTS component could
             possible be skipped altogether. Note, however, that the
             reliability is still not as good as a monolithic system.
             Having ten identical boxes each running a complete set of
             services is better than spreading the service across ten
             boxes, where some subset cause total failure.

5 Architecture for the Interfaces

   Up to now, we have been fairly vague about exactly how such an

Rosenberg/Mataga/Schulzrinne                                 [Page 11]

Internet Draft               AS Components             November 15, 2000

   interface would work in practice. We have argued that it is SIP, but
   not described in detail how SIP is actually used for this function.

   SIP (along with SDP [6]) clearly provides the facilities for
   initiation and termination of the sessions between the controller and
   components, and for specification of the media addresses to and from
   which media is sent. However, SIP leaves a lot of flexibility in
   terms of naming, additional message content, session duration, and
   control. Here, we discuss each of these in turn.

5.1 Naming

   In any remote procedure call system, a key component is naming. The
   identified resource must be properly addressed so that the underlying
   message passing system can properly determine where the request
   should go.

   The same is true in SIP. Messages are routed based on the request
   URI, as it serves as the primary naming tool for routing messages. In
   its application to AS component interaction, the request URI serves
   as the primary tool to identify the resource to which the session is
   addressed. A critical piece of defining a session level service that
   can be accessed by SIP is defining the naming of the resources within
   that service. This point cannot be understated.

   As an example, consider a conferencing service. In this case, the
   primary resource that is being accessed is a mixing service. We would
   like to have a way to identify which conference is being addressed by
   any given call. All calls for the same conference are all bridged
   together. By default, the bridging would operate in an N-1
   configuration (that is, each user receives a mixed media stream that
   represents all of the other users besides themself). Conferences can
   be set up in two ways - ad-hoc, which are not pre-established at all,
   and exist so long as there is a participant in them, and scheduled,
   where they exist for a certain period of time.

   One might imagine that a conferencing service breaks its URI
   namespace into two pieces - one piece that represents ad-hoc
   conferences, and another that represents scheduled conferences. Ad-
   hoc conferences are addressed using a URI of the form <conference
   ID>.adhoc@conferences.com. All users who initiate a call to the URI
   sip:as9dahas89.adhoc@conferences.com are bridged together. The
   conference state is established when the first call to a conference
   occurs, and destroyed when the last call terminates. In contrast,
   scheduled conferences might be named by <conference
   id>.scheduled@conferences.com, so that a call to
   sip:conference12.scheduled@conferences.com allows a user access to a
   pre-arranged conference.

Rosenberg/Mataga/Schulzrinne                                 [Page 12]

Internet Draft               AS Components             November 15, 2000

   There are several benefits to naming ad-hoc conferences vs. scheduled
   ones in this fashion. The primary one is convenience; the name makes
   it the type of conference apparent to any entities that are
   interested. Secondly, it can avoid certain misconfigurations. Let's
   say there are no conventions for naming of ad-hoc versus scheduled
   conferences. I am asked to join a scheduled conference
   (conf2321@conferences.com), but I mis-type the URL in my browser
   (conf2123@conferences.com). I don't want this to drop me into an ad-
   hoc conference where I sit for 15 minutes thinking others will
   eventually join. If ad-hoc conferences are named differently, a call
   to cond2123@conferences.com is never going to be an ad-hoc
   conference, and so my call will be rejected immediately.

   For an application server to use a conferencing service as a
   component, the AS must know the URI namespace conventions used to
   identify the various conferences. The above information, for example,
   would be provided by the conferencing provider to its customers.

   This same concept of using the request URI as a service identifier
   has been described in detail for voicemail systems [7].

   The great advantage of using the request URI as a service identifier
   comes because of the combination of two facts. First, unlike in the
   PSTN, where numbers are limited, URIs come from an infinite space.
   They are plentiful, and they are free. Secondly, the primary function
   of SIP is call routing through manipulations of the request URI. In
   the traditional SIP application, this URI represents people. However,
   the URI can also represent services, as we propose here. This means
   we can apply the routing services SIP provides to routing of calls to
   services. The result - the problem of service invocation and service
   location becomes a routing problem, for which SIP provides a scalable
   and flexible solution. Since there is such a vast namespace of
   services, we can explicitly name each service in a finely granular
   way. This allows the distribution of services across the network. In
   the conferencing example above, since we have separated the names of
   ad-hoc conferences from scheduled conferences, we can program proxies
   to route calls for ad-hoc conferences to one set of servers, and
   calls for scheduled ones to another, possibly even in a different
   provider. In fact, since each conference itself is given a URI, we
   can distribute conferences across servers, and easily guarantee that
   calls for the same conference always get routed to the same server.

   This is in stark contrast to conferences in the telephone network,
   where the equivalent of the URI - the phone number - is scarce. An
   entire conferencing provider generally has one or two numbers.
   Conference IDs must be obtained through IVR interactions with the
   caller, or through a human attendant. This makes it difficult to
   distribute conferences across servers all over the network, since the

Rosenberg/Mataga/Schulzrinne                                 [Page 13]

Internet Draft               AS Components             November 15, 2000

   PSTN routing only knows about the dialed number.

   Care must be taken not to push this concept too far. Naming of
   services should not become so fine-grained that all parameters
   associated with the service simply become encoded into the request
   URI as well. The right level of granularity can be determined based
   on routing. If a service is represented by multiple URLs, but
   requests for each of those URLs are always routed in the same way,
   the naming is too fine-grained.

5.2 Additional Message Content

   Sometimes, connecting to a service requires the service to know
   additional information that is not appropriate for the request URI.
   As an example, the conferencing server might need to know the name,
   address, phone number, company, and email address of the
   participants, which it converts to speech and uses as an announcement
   when the user joins and leaves the bridge.

   This kind of content can easily be carried in the body of the SIP
   messages used to establish and manage the session with the service.
   For simple data, SIP headers may be appropriate. In the conferencing
   example above, the conferencing service might mandate that a vCard be
   attached to all INVITEs, in order to provide that information.

   When existing data formats (like a vCard) are not defined to provide
   the needed information, it can be encoded in an XML document, for
   example, and carried along in the INVITE.

   Each service would need to specify the content that it needs in order
   to process the session invitation.

5.3 Session Duration

   The duration of the session that is established with a server depends
   entirely on the nature of the service. For example, for a conference,
   the initiation of the call begins the mixing service for that user,
   and the termination of the call results in that user leaving the

   For an IVR service, the INVITE request begins the interaction with
   the service. Once the INVITE transaction completes, the IVR would
   play out the initial prompt, and begin collecting data from the
   caller. How the IVR terminates depends on its usage. When the
   initiator of the service is an application server, we would argue
   that in almost all cases, it should be the responsibility of the
   controller to determine when the interaction is complete (and thus
   terminate the call with a BYE). However, when the initiator is an end

Rosenberg/Mataga/Schulzrinne                                 [Page 14]

Internet Draft               AS Components             November 15, 2000

   user, the IVR will usually be the one to terminate the session. We
   discuss IVR interactions in more detail below in Section 6.1.

5.4 Third Party Call Control

   Third party call control, as defined in [5], plays an integral role
   in this architecture.

   In many cases, the controller orchestrating a service wishes to
   invoke the resources of an IVR or conferencing server. However, the
   AS is not the actual source of the media that drives the IVR. The
   source of the media is the end user that initiated the call to the
   controller. What is needed, then, is a way for the AS to call the IVR
   or conferencing server, and pass it the media information of the end
   user. Similarly, the media address of the IVR server (described in
   the SDP from the media server), needs to be passed to the end user
   that initiated the call. By using third party call control, an
   application server can direct the media of the end user to and from
   the components that it is using to provide the application. Once one
   service is complete, the controller can move the media to a different
   component. SIP re-INVITEs also allow the controller to request the
   caller to send multiple media streams, one, for example, containing
   only DTMF and tones. This allows for DTMF control of services without
   carrying DTMF in SIP itself.

   Figure 3 shows how we use a component server to collect DTMF input
   for a service; specifically, a simple (and perhaps useless) service
   that allows a caller to press '1' to indicate that they want to put
   the call on hold. The service is, in principal, useless, since hold
   is so common that the end user can do this themselves. However, it is
   useful for example purposes.

   The caller sends an INVITE request to the called party (1), which is
   routed to a server handling calls for the domain of the called party.
   In this case, the server is an application server. The AS decides
   that it would like to offer the caller advanced services based on
   DTMF events sent mid-call. As a result, it decides to invoke the
   services of a media server component. The AS will use third party
   call control mechanisms to have the caller send any DTMF related
   media to the media server, in addition to sending its media to the
   called party. To accomplish this, the AS sends an INVITE to the media
   server (2), with an indication that the media stream is send only
   (this is accomplised using the sendonly SDP attribute [6]). The
   request URI of this INVITE binds that session to a service that looks
   for any in-band DTMF, and reports it back to the AS through an HTTP
   GET or POST operation. In section 6.1, we show how this is easily
   done with a VoiceXML driven IVR server.

Rosenberg/Mataga/Schulzrinne                                 [Page 15]

Internet Draft               AS Components             November 15, 2000

   The media server responds with a 200 OK (3) that contains SDP with
   the address where the media should be sent to. The application server
   ACKs this response (4), and holds on to that SDP. The AS then proxies
   the original INVITE request (5), and the called party answers the
   call (6). This acceptance is proxied upstream (7), and then
   acknowledged (8,9). At this point, media is flowing between the
   caller and called party (10). The next step for the AS is to get a
   stream of DTMF digits to flow from the caller to the media server. To
   do this, it sends a re-INVITE to the caller (11). This re-INVITE
   contains the same SDP as the response (6) from the called party, but
   with the addition of a new media line. This media line is audio, and
   contains a single codec, the RTP payload format for DTMF and tones
   [8]. The connection address and port are from the SDP returned from
   the media server. This tells the caller to send an additional media
   stream to the media server, using only the DTMF codec. The result is
   that RTP packets are sent only when the caller presses a button on
   the phone.

   The caller accepts this re-INVITE (12), and the AS acknowledges it
   (13). Now, DTMF only RTP is flowing between the caller and the media
   server (14). At some point later, the caller presses the 1 key
   (which, for example, might imply call hold). This is processed by the
   media server, and the result is an HTTP request being sent to the AS
   (15). The HTTP request contains the value of the collected digit. The
   AS receives this request, and knows that the user keyed in a 1.
   Recognizing this input as call hold, the AS sends a re-INVITE to the
   called party (17). The SDP in this re-INVITE is the same as the SDP
   in the original INVITE from the called party (1), except that the
   connection address is set to zero, indicating call hold. The called
   party accepts the re-INVITE (18), and this is ACKed by the AS (19).
   The called party is now on hold.

   Note that the call flow remains unchanged if the stimulus were based
   on voice recognition instead of DTMF. The only difference would be
   that a general purpose codec, such as G.711, would be used instead of
   RFC 2833 for communications between the caller and the media server.
   This achieves an important unification. Independent of the type of
   stimulus - voice, DTMF, or, in fact, direct http requests from the
   caller (if they were using a softphone), the service execution code
   is unchanged.

   Others have proposed that DTMF digits be carried in SIP directly from
   the caller to the AS [9,10].  However, this approach does not work
   for anything beyond DTMF, while our approach works for DTMF, speech,
   and web interfaces. Another drawback of the DTMF-in-SIP approach is
   that all entities on the call signaling path will receive any DTMF
   digits dialed by the called party. Furthermore, since the caller
   doesn't know if there is an entity interested in DTMF, it is required

Rosenberg/Mataga/Schulzrinne                                 [Page 16]

Internet Draft               AS Components             November 15, 2000

    Caller          Coordinator         Media Server       Callee
      |                |                  |                 |
      |(1) SIP INV     |                  |                 |
      |--------------->|(2) SIP INV       |                 |
      |                |----------------->|                 |
      |                |(3) 200 OK        |                 |
      |                |<-----------------|                 |
      |                |(4) SIP ACK       |                 |
      |                |----------------->|                 |
      |                |(5) SIP INV       |                 |
      |                |----------------------------------->|
      |                |(6) 200 OK        |                 |
      |(7) 200 OK      |<-----------------------------------|
      |<---------------|                  |                 |
      |(8) SIP ACK     |                  |                 |
      |--------------->|(9) SIP ACK       |                 |
      |                |----------------------------------->|
      |(10) RTP        |                  |                 |
      |                |                  |                 |
      |(11) SIP INV    |                  |                 |
      |<---------------|                  |                 |
      |(12) 200 OK     |                  |                 |
      |--------------->|                  |                 |
      |(13) SIP ACK    |                  |                 |
      |<---------------|                  |                 |
      |(14) RTP        |                  |                 |
      |...................................|                 |
      |                |                  |                 |
      |                |(15) HTTP GET     |                 |
      |                |<-----------------|                 |
      |                |(16) 200 OK       |                 |
      |                |----------------->|                 |
      |                |                  |                 |
      |                |(17) SIP INV      |                 |
      |                |------------------+---------------->|
      |                |(18) 200 OK       |                 |
      |                |<-----------------+-----------------|
      |                |(19) SIP ACK      |                 |
      |                |------------------+---------------->|
      |                |                  |                 |
      |                |                  |                 |
      |                |                  |                 |
      |                |                  |                 |
      |                |                  |                 |

   Figure 3: Call Flow for DTMF Enabled Hold Service

Rosenberg/Mataga/Schulzrinne                                 [Page 17]

Internet Draft               AS Components             November 15, 2000

   to send DTMF within SIP messages all the time, even if no entity is

   There have been proposals for adding a subscription/notification
   mechanism on top of this to avoid this problem. However, this further
   complicates the system by adding a requirement for the caller to
   support a subscription and notification service just for DTMF.

   Our approach fits well within the existing SIP framework, and
   requires no additional work from the end users. Furthermore, it
   transparently supports multiple application server components
   receiving DTMF. This is because an AS is able to send a DTMF stream
   to a component by adding a new media line to the list of media
   streams being sent by the caller. The list of media streams being
   sent by the caller is observed by each AS through the initial INVITE,
   along with any subsequent re-INVITEs which might modify it. Consider
   the situation with two application servers, A and B, depicted in
   Figure 4. The original call setup starts with the caller, flows
   through A, then B, then the called party. At some point later, A
   sends a re-INVITE (10) to the caller, adding a media stream, just as
   described in Figure 3. The SDP in this INVITE will be the same as
   provided by the caller in message (1), plus the additional DTMF
   stream. Note that this re-INVITE does not pass through B. Now, B
   decides to add a media stream for DTMF. So, it sends a re-INVITE
   (13). This goes first to A. As far as A is concerned, this re-INVITE
   is from the called party. A computes the difference between what it
   believes the called party should perceive as the set of media
   streams, and what is in the re-INVITE (13). This difference (the
   additional DTMF stream added by B) is added to the SDP that A had
   sent to the caller previously (10), and the result is sent in a re-
   INVITE to the caller (14). This SDP now contains the media streams
   meant for the actual called party, along with two DTMF streams; one
   for A, and one for B. The caller thus sends DTMF to both servers.

   A further advantage of our approach is that the DTMF can even be sent
   using multicast, since it is being sent in RTP rather than as part of
   SIP. This allows for tremendous scalability, if needed, in the number
   of entites receiving the DTMF streams.

5.5 Side Channels

   Side channels are used for passing of events from the application
   server components back to the client, and for passing control
   commands from the client to the application server component.

   Unfortunately, side channels complicate the simple session level
   interface between components. It is our belief, at least for the

Rosenberg/Mataga/Schulzrinne                                 [Page 18]

Internet Draft               AS Components             November 15, 2000

   components described here, that only minimal side channels are
   needed. Specifically, the only service below that requires one to be
   effective is the IVR service, for which HTTP forms an ideal side
   channel. If the side channel becomes so complex as to introduce
   extensive synchronization, bandwidth, and transactional issues, the
   relationship between the components becomes tightly coupled once
   more, and the benefits we are espousing here begin to disappear.

   As such, we believe that a reasonable side channel for decoupled
   server interactions is defined as follows:

        o The event reporting and control components have no real time

        o Event reporting from the component back to the client
          accessing it are infrequent; specifically, the intervals are
          much larger than the round trip times between the client and
          the component.

        o Control from the client to the component is infrequent;
          specifically, the intervals are much larger than the round
          trip times between the client and component.

        o Event reporting is coarsely granular, so that the client does
          not need to explicitly subscribe to specific events in order
          to avoid be overwhelmed with data.

        o The amount of data passed in both the events and in the
          control is small.

        o There are no requirements for transaction support.

   Note that protocols like MGCP and megaco do not meet these
   requirements, as they require tight timing, synchronization, and
   explicit subscriptions. HTTP, as used in VoiceXML, however, does meet
   these requirements.

6 Patterns for Accessing Components

   In this section, we propose a set of patterns that define the
   interaction of a controller with an application server component.
   These patterns manifest themselves in the description of the service
   invoked when a session is initiated, a discussion of the naming
   conventions of the service, and a description of any back channel
   used for control and data passing.

6.1 Interactive Voice Response Services

Rosenberg/Mataga/Schulzrinne                                 [Page 19]

Internet Draft               AS Components             November 15, 2000

       Caller            A                B              Callee
         |               |                |                 |
         |(1) SIP INV    |                |                 |
         |-------------->|(2) SIP INV     |                 |
         |               |--------------->|(3) SIP INV      |
         |               |                |---------------->|
         |               |                |(4) 200 OK       |
         |               |(5) 200 OK      |<----------------|
         |(6) 200 OK     |<---------------|                 |
         |<--------------|                |                 |
         |(7) SIP ACK    |                |                 |
         |-------------->|(8) SIP ACK     |                 |
         |               |--------------->|(9) SIP ACK      |
         |               |                |---------------->|
         |(10) SIP INV   |                |                 |
         |<--------------|                |                 |
         |(11) 200 OK    |                |                 |
         |-------------->|                |                 |
         |(12) SIP ACK   |                |                 |
         |<--------------|                |                 |
         |               |                |                 |
         |               |(13) SIP INV    |                 |
         |(14) SIP INV   |<---------------|                 |
         |<--------------|                |                 |
         |(15) 200 OK    |                |                 |
         |-------------->|(16) 200 OK     |                 |
         |               |--------------->|                 |
         |               |(17) SIP ACK    |                 |
         |(18) SIP ACK   |<---------------|                 |
         |<--------------|                |                 |
         |               |                |                 |
         |               |                |                 |
         |               |                |                 |
         |               |                |                 |
         |               |                |                 |

   Figure 4: Multiple Application Servers and DTMF

   We have touched upon the basics of the interaction between a
   controller and an IVR server. The controller initiates a call to the
   server, the server executes some kind of IVR service, and data is
Rosenberg/Mataga/Schulzrinne                                 [Page 20]

Internet Draft               AS Components             November 15, 2000

   A number of questions still need to be answered, however:

        1.   How is the IVR service identified?

        2.   How can the controller specify the details of the dialog
             the IVR carries out with the user?

        3.   How does data from the IVR get passed back to the

        4.   How is intermediate control performed (e.g., to interrupt
             or reset IVR based on some event at the controller, in this

   We believe that VoiceXML [11] represents the ideal partner for SIP in
   the development of distributed IVR servers. VoiceXML is an XML based
   scripting language for describing IVR services at an abstract level.
   VoiceXML supports DTMF recognition, speech recognition, text-to-
   speech, and playing out of recorded media files. The results of the
   data collected from the user are passed to a controlling entity
   through an HTTP form POST operation. The controller can then return
   another script, or terminate the interaction with the IVR server.

   From a naming perspective, the primary issue is how a request URI is
   associated with a script to invoke when the call is answered. We see
   three primary mechanisms:

        1.   There is a one-to-one binding of the address in the request
             URI to a script to execute. These bindings are published by
             the provider of the IVR service.

        2.   The initial script to execute is actually carried as
             content in the body of the SIP INVITE request. The request
             URI indicates that the desired service is execution of
             content in the request (i.e., sip:executebody@servers.com).

        3.   The initial script to execute is fetched by the VoiceXML
             server; the URL to fetch it from is passed in the SIP
             INVITE message that initiates the IVR session. This can be
             accomplished either with the application/uri MIME type as a
             body, or using the new *-Info headers [12] which provide
             references to content to fetch.

   We believe that the third approach is probably the best one. SIP is
   not the ideal transfer mechanism. Passing a URI allows a far better
   transfer tool, namely HTTP, to be used to actually fetch the script
   back from the controller.

Rosenberg/Mataga/Schulzrinne                                 [Page 21]

Internet Draft               AS Components             November 15, 2000

   HTTP is then also used to pass back form data from the IVR to the
   controller. The results of the HTTP POST can also contain additional
   VoiceXML scripts to execute. It represents the side channel discussed
   in section 5.5

   Note that in some cases, there needs to be interactions between the
   HTTP server that receives the HTTP POST requests, and the controller
   that initiates and terminates the SIP sessions with the IVR. This is
   the case when the data collected by the VoiceXML server is used to
   guide signaling behavior. For example, a pre-paid calling application
   might use the IVR to collect the users PIN code. The PIN code is
   looked up, and the number of minutes remaining is determined. This
   amount of time must be known to the SIP controller, as it will need
   to hang up the call once this time expires. Some kind of session
   sharing mechanism is needed between the SIP controller and the HTTP
   server in this case.

   Figure 5 shows the interaction between an application server acting
   in a coordinating role, and an IVR server component. In this example,
   consider an application where the user makes a call, but the system
   needs additional information to determine where to forward it to. The
   user is prompted for the info, and once the name of the desired
   called party is obtained and looked up, the call is completed to the
   requested destination.

   First, in step (1), the caller sends an INVITE to the controller. The
   controller then creates a brand new call to the IVR application
   server (2), using the SDP from the INVITE in (1). The IVR accepts the
   call (3), and the SDP from that acceptance is returned in a 183
   response to the caller (4). The call to the IVR is acked (5), and now
   a media stream exists between the caller and the IVR server. The IVR
   server, in step (6), fetches the initial VoiceXML script to execute,
   which is returned by the controller (7). The prompts are played to
   the caller, and the identity of the called party is collected. This
   is passed to the controller through another POST (8), which returns
   an empty VoiceXML script (9)[1] complete, the controller hangs up
   with it (10 and 11). The information the controller got in the POST
   (8) is used to determine the next hop SIP server, and the initial
   INVITE is proxied there (12).

   Its important to observe the all call control related to executing
   the service lives within the controlling application server. The IVR
  [1] Note that it is unusual for an empty script to be
returned;  this  is  because we want the AS to maintain
control of the call signaling

Rosenberg/Mataga/Schulzrinne                                 [Page 22]

Internet Draft               AS Components             November 15, 2000

   application server deals strictly with the media component. This
   division of work, as we have discussed above, allows for independent
   evolution of the call control and media components of services. For
   example, if the desired called party did not have a reachable SIP
   address, but they did have an email address, the call could be
   redirected to a mailto URL. To support this twist, only the
   controlling application server code need change. The media component
   remains completely and totally unchanged.

   Readers familiar with VoiceXML will observe that VoiceXML almost
   achieves this perfect separation. It lacks any call control excepting
   a two - for call transfer and call termination. These tags are
   clearly not sufficient for many services. Our architecture would
   argue that instead of adding call control to VoiceXML, all control
   should be removed, so that call control can be left to other server

   The separation of the control from the media component also allows
   the media component to change without affecting the control
   component. In fact, because of the http interface between the two,
   the media server can be completely removed and replaced with a normal
   web browser, with only a small effect on the call control component.
   As an example, if the calling party was coming from a web enabled SIP
   client (known by the presence of the Accept header with text/html as
   a value in the INVITE request), the controller could return an HTTP
   URL in the 183 with an actual web form that gets filled out by the
   caller. This would be instead of using an IVR server to collect the
   data. Interestingly, the representation of the collected data is
   identical in both cases. Both use an HTTP POST operation to send the
   data to the controller. This allows the data collection code in the
   controller to be unified across both voice access and web access.

6.2 Conferencing Servers

   Conferencing servers today vary in type and complexity. Some are
   dialup only, supporting IVR access. Others support ad-hoc
   conferencing with web interfaces. Others still support three way
   calling as part of a PBX system.

   We observe once more that all of these conferencing "servers" are
   really conferencing applications that are just bundled as a server.
   These conferencing applications can be decomposed into components in
   exactly the way we have described above. At the core of each of these
   conferencing applications is a mixing service. This service is
   responsible for taking N audio or video streams, mixing them
   according to some matrix, and returning the mixed stream to each
   participant. Issues such as conference policy, provisioning of
   conferences, and authentication are all completely separate and

Rosenberg/Mataga/Schulzrinne                                 [Page 23]

Internet Draft               AS Components             November 15, 2000

     |      INVITE (1)         |                          |
     |------------------------>|                          |
     |                         |        INVITE (2)        |
     |                         |------------------------->|
     |                         |       200 OK (3)         |
     |                         |<-------------------------|
     |      183 (4)            |                          |
     |<------------------------|                          |
     |                         |       ACK (5)            |
     |                         |------------------------->|
     |           MEDIA         |                          |
     |                         |                          |
     |                         |     HTTP GET (6)         |
     |                         |<-------------------------|
     |                         |     HTTP 200 OK (7)      |
     |                         |------------------------->|
     |                         |                          |
     |                         |                          |
     |                         |                          |
     |                         |                          |
     |                         |     HTTP GET (8)         |
     |                         |<-------------------------|
     |                         |                          |
     |                         |     HTTP 200 OK (9)      |
     |                         |------------------------->|
     |                         |                          |
     |                         |      BYE (10)            |
     |                         |------------------------->|
     |                         |       200 OK (11)        |
     |                         |<-------------------------|
     |                         |                          |
     |                         |       INVITE (12)        |
     |                         |--------------------------------------->
     |                         |                          |
     |                         |                          |
     |                         |                          |
     |                         |                          |
     |                         |                          |
     |                         |                          |
     |                         |                          |

  Caller                    Controller               IVR Server

   Figure 5: Interaction of App Server and IVR Component

Rosenberg/Mataga/Schulzrinne                                 [Page 24]

Internet Draft               AS Components             November 15, 2000

   outside of this basic mixing component.

   For this reason, we argue that a large variety of conferencing
   applications can be easily constructed by having the mixing service
   as separate application server component.

   What does the interface to such a mixing server look like? For the
   call control interface, users would join a conference by calling the
   server. The server would answer the call, thus appearing as a SIP
   UAS. The media sent from the user is mixed with other users in the
   conference, and the media sent back to the user is the mixed stream.
   The user can leave the conference by sending a BYE to the server, and
   the server can kick a user out of the conference by sending the user
   a BYE.

   Since the primary resource being accessed is a conference, it is no
   surprise that we would argue that the request URI of an incoming call
   defines the conference a user is mixed in to. In other words, all
   users that call the server with the same request URI, are all mixed
   together. The conferences are not defined by Call-ID or other SIP
   header fields. Using the request URI has tremendous advtanges from a
   routing and naming perspective, as we have discussed more generally

   It is not neccesary (in fact, not even advisable), for the
   conferencing server to require that the URIs that define the
   conference be set up ahead of time. Conference lifecycles in the
   mixing server are very simple. Conference state is created when the
   first call arrives for a particular URI, and ends when the last user
   with a call to that URI hangs up. This model allows the same mixing
   server to support both ad-hoc conferences, and pre-arranged
   conferences too. Pre-arranged conferences are handled through policy
   and control in a coordinating server external to the mixing server.
   This server lives entirely in the call control and signaling plane,
   not in the media plane.

   SIP (and RTP, of course) alone is not sufficient for complete usage
   of a conferencing server. Media mixing policies (effectively, the
   matrix indicating which users hear which other users, and with what
   relative volumes) need to be set. Information on the status of the
   conference, such as the identity of the current speaker, number of
   users currently being mixed, etc., may need to be reported back to
   some control entity. These represent the requirements for the side
   channel. In IVR servers, the side channel used HTTP. We argue that to
   unify these concepts, HTTP is ideally suited here as well. Updates to
   the mixing policy can be made through HTTP POST requests against the
   mixing server, using well defined interfaces (possibly SOAP).
   Similarly, information about the status of the conference can be

Rosenberg/Mataga/Schulzrinne                                 [Page 25]

Internet Draft               AS Components             November 15, 2000

   obtained through HTTP GET operations against the mixing server. The
   side channel here meets the requirements outlined in Section 5.5; it
   is not real time in nature, does not reuqire transactional support,
   and passes relatively infrequent data and control. In fact, such a
   side channel will often not be needed at all. In 90 default mixing
   policy (the so-called N-1 matrix, where each user hears everyone but
   themselves, all at equal volume, with no floor control) will suffice.

   Fans of the INFO method [13] will argue that instead of using HTTP
   for the control, why not INFO? This would eliminate the need for an
   additional protocol, after all. The answer is the same as to why SIP
   should not simply replace HTTP - the two have different strengths and
   weakenesses. SIP is a poor data transfer protocol. It has insufficent
   support for transfer of medium to large data sets, which is important
   here. Furthermore, we may want to allow an entity separate from the
   one that initiated the session to control the session. Usage of INFO
   would only work from the same device (because of the sequence

   In the next few sections, we show how this basic application server
   component can be used, along with a controller and other components,
   to build more complex conferencing applications.

6.2.1 Web Scheduled Conference Services

   In this application, we'd like a conferencing service where all
   conferences must be pre-scheduled. The pre-scheduling is done through
   a web page. At the page, the user will enter the start time (but not
   mandatory stop time) of the conference, the maximum number of
   attendees, and the identities of the attendees (if known). Once
   entered in a form, the server returns a SIP URL representing the

   To implement this, we use an coordinating application server that has
   a SIP and HTTP interface, along with the mixing application server
   just described.

   Figure 6 shows a call flow for this service. A web client is first
   used to submit the information. Let us suppose a simple case where
   the conference can have up to two participants, and the conference
   starts immediately. The HTTP POST representing the form data is sent
   to the controller (1). It stores the information for the conference
   in a local data store, and chooses a SIP URL for the conference. This
   URL can be anything, so long as it is different from any URLs handed
   out so far by the controller. The URL is returned to the web client
   in step (2). As an additional convenience feature, the URL could be
   emailed to the participants. This would require the controller to

Rosenberg/Mataga/Schulzrinne                                 [Page 26]

Internet Draft               AS Components             November 15, 2000

   have an SMTP interface, in addition to HTTP and SIP. Note that this
   SIP URL points to the controller, NOT the mixing server.

   A few moments later, the first participant calls in using a SIP
   INVITE (3). The call is routed to the controller. It checks the
   conference ID. It finds that the policy permits up to two
   participants (not a practical example, but simplifies the call flow).
   It stores data indicating that one participant has now joined, and
   the proxies the INVITE request in step (4) to the mixer. The request
   URI in this request will have the same user part as (3), but the host
   part now represents the mixer. The mixer receives the INVITE, creates
   the initial conference state (as this is the first call for that
   URL), and returns a 200 OK (5), which is forward to the caller (6),
   and then ACKed (7 and 8).

   In step (9), the second caller calls in. The controller sees that
   only one participant is on the call so far, so the second call is
   accepted. The controller stores the fact that there are now 2
   participants, and proxies the INVITE (10). The INVITE is accepted by
   the mixer (11), and the response forwarded to the second caller (12),
   and then ACKed (13 and 14). The two participants A and B can now hear
   each other.

   A third caller then calls in (15). The controller checks its records,
   and notices that this conference is now full. So, it rejects the
   INVITE (16), which is acknowleged (17).

   The astute reader will observe that, strictly speaking, the HTTP
   server does not really need to be co-resident with the SIP server in
   the controller. The initial conference setup can be stored in a
   database by a web server, and the controller can simply read this
   database. However, in more complex cases, we may wish to have web
   access to learn dynamic information about the conference as it
   progresses (for example, which users are in the conference). For this
   kind of dynamic session state, using a shared database between
   components is cumbersome. Rather, an integrated HTTP/SIP server is
   much better suited, where integrated implies only that it has built
   in mechanisms for session state sharing between the SIP and HTTP

   For this simple conferencing service, it was sufficient for the
   controller to act as a proxy. Thats because it does not need to
   forcibly kick anyone out of the conference once they are in. To
   support that kind of functionality, third party call control is
   needed. Let us examine a more complex service in the next section.

6.2.2 Web Scheduled, IVR supported, Time Limited Conference

Rosenberg/Mataga/Schulzrinne                                 [Page 27]

Internet Draft               AS Components             November 15, 2000

   |   |   |   | (1) HTTP POST  |                      |
   |--------------------------->|                      |
   |   |   |   | (2) 200 OK     |                      |
   |<---------------------------|                      |
   |   |   |   |                |                      |
   |   |   |   | (3) INVITE     |                      |
   |   |----------------------->|  (4) INVITE          |
   |   |   |   |                |--------------------->|
   |   |   |   |                |   (5) 200 OK         |
   |   |   |   |  (6) 200 OK    |<---------------------|
   |   |<-----------------------|                      |
   |   |   |   | (7) ACK        |                      |
   |   |----------------------->|   (8) ACK            |
   |   |   |   |                |--------------------->|
   |   |   |   |                |                      |
   |   |   |   | (9) INVITE     |                      |
   |   |   |------------------->|   (10) INVITE        |
   |   |   |   |                |--------------------->|
   |   |   |   |                |   (11) 200 OK        |
   |   |   |   | (12) 200 OK    |<---------------------|
   |   |   |<-------------------|                      |
   |   |   |   |  (13) ACK      |                      |
   |   |   |------------------->| (14) ACK             |
   |   |   |   |                |--------------------->|
   |   |   |   |                |                      |
   |   |   |   | (15) INVITE    |                      |
   |   |   |   |--------------->|                      |
   |   |   |   |(16) 500 Full   |                      |
   |   |   |   |<---------------|                      |
   |   |   |   |(17) ACK        |                      |
   |   |   |   |--------------->|                      |
   |   |   |   |                |                      |
   |   |   |   |                |                      |
   |   |   |   |                |                      |

  Web  A   B   C              Controller            Mixer

   Figure 6: Web Scheduled Conference Services

Rosenberg/Mataga/Schulzrinne                                 [Page 28]

Internet Draft               AS Components             November 15, 2000

   In this more complex example, we once again wish to use a web
   interface to set up the conferences. However, we wish to add a stop
   time. If there are participants in the conference when the stop time
   arrives, a warning announcement is played 10 minutes prior, and then
   they are kicked off. In addition, when a user joins the conference,
   before they are added, they hear an announcement that states the name
   of the person that set up the conference, and what the start and stop
   times are. They are then asked to speak their name. Then, they are
   dropped in. The conference server then speaks their name, so that
   everyone knows who just joined.

   This seemingly complex service is very easily constructed by adding
   an IVR server as described above. Now, we have a controller, a mixing
   server, and an IVR server, all working together to build the service.
   Each provides a specific component towards the overall solution, yet
   each is an application server in its own right, with both signaling
   and media interfaces.

   We assume that the web setup is done as above. This time, the stop
   time is provided, along with the name of the person setting up the

   The call flow for the initial participant is shown in Figure 7.

   The initial participant sends an INVITE, which is forwarded to the
   controller. The controller matches the request URI against the
   conference that the user wishes to join. The controller recognizes
   that it needs to play an announcement. So, in step (2), it initiates
   a call to an IVR server. This call is accepted in step (3), and the
   resulting SDP is passed back to the UAC in step (4) in a provisional
   response. After ACKing the call with the IVR in step (5), the
   controller receives an HTTP GET to fetch the root VoiceXML script in
   step (6). The controller dynamically generates the VoiceXML script,
   whose content will cause the server to read out "Welcome to the
   conference, Bob. The call will start at 10 am, and end at 11am.". The
   name of the caller, Bob, is extracted from the INVITE (1).

   Once the prompt has been played, the IVR server prompts the caller
   for their name, and the result is recorded into a file. Then, the
   VoiceXML server attempts to fetch the next VoiceXML script from the
   controller (8). Before responding, the controller reconnects the
   media stream from the media server into the conference bridge. To do
   this, it first sends an INVITE to the conferencing server, using SDP
   indicating send only (9). The server accepts (10), and the controller
   ACKs (11). The SDP from the acceptance (10) is passed in a re-INVITE
   (12) to the IVR server. The IVR server then accepts (13) and the
   controller ACKs (14). Now, a unidirectional media stream from the IVR

Rosenberg/Mataga/Schulzrinne                                 [Page 29]

Internet Draft               AS Components             November 15, 2000

   server into the conference bridge is set up. The controller returns
   the next VoiceXML script (15), which tells the IVR server to play the
   previously recorded file into the conference, announcing the joining
   user. Once this is done, the IVR server fetches the next script (16),
   and gets back an empty response (17). The controller then disconnects
   from the IVR server (18,19). Finally, the controller re-INVITEs the
   conference server (20), updating the SDP to be that from the initial
   INVITE (1).  The SDP from the acceptance (21) is passed on to the
   caller (22). Now, the caller is connected to the mixer as the first
   user in the conference.

   The second user would join in much the same way.

   Approximately 10 minutes before the end of the conference, a timer
   fires inside of the controller. It is time to play a warning
   announcement into the conference. The call flow for this is shown in
   Figure 8.

   The basic idea is to initiate a call to the IVR server and mixer,
   connect them using third party call control, and then have the IVR
   server play the announcement into the conference. The controller then
   hangs up.

   In step (1), the controller sends an INVITE to the mixer with a
   single audio stream on hold (i.e., "empty"). The request URI of the
   request is that of the conference. The mixer returns a 200 OK in step
   (2), and an ACK is sent in (3). The SDP from (2) is then used in step
   (4) to call the IVR server, which answers with its SDP in step (5).
   This is used in a re-invite (7,8,9) to the mixer to update the IP
   address and port as that of the IVR server. The IVR server then
   fetches the root VoiceXML document from the controller (11). This
   document instructs the server to read out some kind of conference
   warning - "Warning, your conference will end in 10 minutes". Once
   this is done, the IVR server fetches the next document (13), which is
   empty. The controller then hangs up with both the mixer (17) and the
   IVR server (19), disconnecting the IVR server from the conference.

   These examples demonstrate the component model we are proposing. The
   mixing component does not have application level intelligence. It has
   a call control interface, allowing it to exist anywhere (and be
   provided by any ASP service) and yet be a callable resource by other
   application server components. By combining a controller with an IVR
   server and the mixing server, complex and useful applications can be
   constructed in a distributed fashion.

6.3 Continuous Text-to-Speech

Rosenberg/Mataga/Schulzrinne                                 [Page 30]

Internet Draft               AS Components             November 15, 2000

  Caller          Controller         IVR Server          Mixing Server
    |               |                  |                   |
    | (1) INVITE    |                  |                   |
    |-------------->| (2) INVITE       |                   |
    |               |----------------->|                   |
    |               | (3) 200 OK       |                   |
    | (4) 183       |<-----------------|                   |
    |<--------------|                  |                   |
    |               | (5) ACK          |                   |
    |               |----------------->|                   |
    |               | (6) HTTP GET     |                   |
    |               |<.................|                   |
    |               | (7) 200 OK       |                   |
    |               |.................>|                   |
    |               |                  |                   |
    |               | (8) HTTP GET     |                   |
    |               |<.................|                   |
    |               | (9) INVITE       |                   |
    |               |------------------------------------->|
    |               | (10) 200 OK      |                   |
    |               |<-------------------------------------|
    |               | (11) ACK         |                   |
    |               |------------------------------------->|
    |               | (12) INVITE      |                   |
    |               |----------------->|                   |
    |               | (13) 200 OK      |                   |
    |               |<-----------------|                   |
    |               | (14) ACK         |                   |
    |               |----------------->|                   |
    |               |                  |                   |
    |               | (15) 200 OK      |                   |
    |               |.................>|                   |
    |               | (16) HTTP GET    |                   |
    |               |<.................|                   |
    |               | (17) 200 OK      |                   |
    |               |.................>|                   |
    |               | (18) BYE         |                   |
    |               |----------------->|                   |
    |               | (19) 200 OK      |                   |
    |               |<-----------------|                   |
    |               | (20) INVITE      |                   |
    |               |------------------------------------->|
    |               | (21) 200 OK      |                   |
    | (22) 200 OK   |<-------------------------------------|
    |<--------------|                  |                   |
    | (23) ACK      |                  |                   |
    |-------------->| (24) ACK         |                   |
    |               |------------------------------------->|
    |               |                  |                   |
    |               |                  |                   |
    |               |                  |                   |

  Caller          Controller         IVR Server          Mixing Server

      | (1) INVITE empty SDP  |                          |
      |---------------------->|                          |
      | (2) 200 OK SDP A      |                          |
      |<----------------------|                          |
      | (3) ACK               |                          |
      |---------------------->|                          |
      |                       |   (4) INV SDP A          |
      | (5) 200 OK SDP B      |                          |
      |                       |   (6) ACK                |
      | (7) INV SDP B         |                          |
      |---------------------->|                          |
      | (8) 200 OK SDP A      |                          |
      |<----------------------|                          |
      | (9) ACK               |                          |
      |---------------------->|                          |
      |                       |  (11) HTTP GET           |
      |                       |  (12) 200 OK             |
      |                       |                          |
      |                       |                          |
      |                       |  (13) HTTP GET           |
      |                       |  (14) 200 OK             |
      |                       |                          |
      | (15) BYE              |                          |
      |                       |  (16) 200 OK             |
      | (17) BYE              |                          |
      |---------------------->|                          |
      | (18) 200 OK           |                          |
      |<----------------------|                          |
      |                       |                          |
      |                       |                          |

   Controller               Mixer                      IVR Server

   Figure  8:  Advanced  Web  Scheduled  Conference   Service:   Warning

Rosenberg/Mataga/Schulzrinne                                 [Page 32]

Internet Draft               AS Components             November 15, 2000


   Another example of an application server component is a continuous
   Text-to-Speech (TTS) converter. This kind of service allows a real
   time text stream (encapsulated in RTP using the RTP payload format
   for text [14] to be received, which is then converted to speech and
   returned as an audio stream encoded using a traditional speech codec,
   be it G.723.1, G.711, or what have you.

   Like the IVR server and mixing server, the TTS server acts as a user
   agent server. It answers incoming calls, and basically mirrors
   incoming text back as speech. It continutes to do so until the call
   is hung up by the initiating client.

   A TTS service can be done using VoiceXML with an IVR server, as in
   the examples above. However, the difference is that here, the text
   stream to be converted is in the data path, not the control path. The
   stream is likely to be generated by other entities in the system, not
   the controller.

6.3.1 Service Interface

   It is likely that the text-to-speech conversation process differs
   significantly depending on the language. As such, separate URIs
   SHOULD be used for language specific TTS services. Specifically, the
   convention sip:<server-specific-name>-<language-tag>@<domain> is
   RECOMMENDED. The language tags SHOULD be selected from the set
   defined in RFC1766 [15].

   One of the unfortunate limitations of SDP is that it is not currently
   possible for a single media stream to be composed of separate media
   formats in each direction. The text over RTP stream is, in fact,
   based on the top level text MIME type (text/t140). As a result, two
   media streams are needed for this service - a unidirectional audio
   stream and a unidirectional text stream.

   First, the client INVITEs the server. The SDP MUST indicate a two
   media streams. One stream MUST be of type audio. It SHOULD contain
   the set of audio codecs acceptable to the client. The stream MUST be
   marked as recv-only. The other stream MUST be of type text. It MUST
   contain a single codec, which is a dynamic payload number bound to
   text/t140. The stream MUST be marked as send-only. The 200 OK
   response from the TTS server that accepts the call has SDP with a two
   media lines, one of type audio, and one of type text, in the same
   order the streams appeared in the INVITE, as mandated by RFC2543. The
   audio stream SHOULD contain a subset of the codecs listed in the
   audio stream in the INVITE. The audio stream MUST be marked as send-

Rosenberg/Mataga/Schulzrinne                                 [Page 33]

Internet Draft               AS Components             November 15, 2000

   only. The text stream MUST contain a single codec, which is a dynamic
   payload type number bound to text/t140. The stream MUST be marked as

   The client then ACKs the request. The TTS server SHOULD attempt to
   convert all text received on the incoming text stream to speech, and
   return the resulting speech on the outgoing audio stream.

6.3.2 Hearing Impaired Service

   The TTS server is extremely useful in supporting hearing impaired
   services. Examples of such services are described in describes a
   service where a controller accesses a TTS service.

6.4 Messaging Servers

   Another type of application server component is a messaging server.
   Messaging servers allow for callers to record audio messages for
   users on the system. Users can also call into the server to retrieve
   these messages, delete them, and file them. The system operates
   through the use of voice prompts combined with DTMF detection and/or
   speech recognition. The prompts that are played are context
   dependent. A messaging server can be viewed as a specialized version
   of an IVR server with an application specific controller associated
   with it. In fact, a messaging server can be implemented in this way
   exactly. However, the combination is also usefully viewed as a
   component in its own right, due to the frequent need for messaging
   components in more complex applications.

6.4.1 Service Interface

   The service interface for communicating with a messaging server is
   described in detail in [7]. The interface provides well known URIs
   for the most common resources within a messaging server - user
   specific message drops with a variety of drop conditions (called
   party busy, called party not there, etc.), message retrievals using a
   variety of authentication mechanisms (PIN, SIP level authentication),
   and message drops that are not user specific, so that the target user
   is queried for as part of the interface.

6.4.2 Web Enabled Message Drops

   An example usage of this application component is a web front end
   that allows users to leave voicemail for company employees through
   the company web page. The page has a URL for each company employee.
   If some user A clicks on a URL for employee B, A's phone rings. When
   A picks up, they hear a greeting to record a message for employee B.

Rosenberg/Mataga/Schulzrinne                                 [Page 34]

Internet Draft               AS Components             November 15, 2000

   The call flow for this application is the combination of third party
   call control combined with access to the service. It is shown in
   Figure 9.

   The caller, from a web page, clicks on the URL for the user they wish
   to leave a message for. The result is an HTTP request (1) to the
   controller. The URI in this request would be some controller-specific
   identifier that tells the controller what it needs to do. The
   controller then calls the user (3) using an SDP with a single media
   stream on hold initially. This is accepted (4), and the resulting SDP
   is used in an INVITE to the messaging server (6). The URI of this
   INVITE is that for message drop with standard greeting (sip:sub-
   jdrosen-deposit@voiceserver.com). The call is accepted (7) and the
   200 OK is used in a re-INVITE to the caller (9) to set the address of
   the media stream to that of the voicemail server. After the call is
   accepted (10) and ACKed (11), the caller hears the voice drop prompt
   for the messaging server, and can record their message.

7 Security Considerations

   In many cases, authorization may need to be made to allow a caller
   access to a session level resource. Traditional SIP level
   authentication mechanisms can be used to accomplish this. Note,
   however, that in many cases the caller is the controller, which is
   acting as a third party call controller. In these cases, a two level
   trust model is really needed. The trust relationship in such
   situations is really between the session level resource and the
   controller (perhaps through an explicit business arrangement), and
   then between the controller and the caller. Thus, controllers should
   authenticate themselves to session resources they contact, rather
   than trying to proxy credentials from the caller.

8 Conclusion

   In this paper, we have argued that rapid deployment of complex
   communications applications will require a distributed model where
   application components are spread across the network. These
   components could be offered by separate providers, for example,
   enabling an ASP component model to evolve. We have observed that many
   of the components can be described as having some kind of session
   level resource that can be communicated with, usually in an automated
   fashion. Access to these resources is typically parameterized. As a
   result, SIP access, using the request URI as a service indicator, is
   an ideal way to communicate across these components.

   To validate this model, we examined the specific service interfaces
   that would be defined by IVR servers, conferencing servers, text-to-

Rosenberg/Mataga/Schulzrinne                                 [Page 35]

Internet Draft               AS Components             November 15, 2000

     |      |              |                      |
     |      | (1) HTTP GET |                      |
     |-------------------->|                      |
     |      | (2) 200 OK   |                      |
     |<--------------------|                      |
     |      | (3) INV      |                      |
     |      |<-------------|                      |
     |      | (4) 200 OK   |                      |
     |      |------------->|                      |
     |      | (5) ACK      |                      |
     |      |<-------------|                      |
     |      |              |  (6) INV             |
     |      |              |--------------------->|
     |      |              |  (7) 200 OK          |
     |      |              |<---------------------|
     |      |              |  (8) ACK             |
     |      |              |--------------------->|
     |      | (9) INV      |                      |
     |      |<-------------|                      |
     |      | (10) 200 OK  |                      |
     |      |------------->|                      |
     |      | (11) ACK     |                      |
     |      |<-------------|                      |
     |      |              |                      |
     |      |              |                      |
     |      |              |                      |

    Web    SIP           Controller             Messaging
      Caller                                     Server

   Figure 9: Web Enabled Message Drops

Rosenberg/Mataga/Schulzrinne                                 [Page 36]

Internet Draft               AS Components             November 15, 2000

   speech servers and messaging servers. We gave call flows of complex
   applications built up from these components using the specified

9 Author's Addresses

   Jonathan Rosenberg
   72 Eagle Rock Avenue
   First Floor
   East Hanover, NJ 07936
   email: jdrosen@dynamicsoft.com

   Peter Mataga
   72 Eagle Rock Avenue
   First Floor
   East Hanover, NJ 07936
   email: jdrosen@dynamicsoft.com

   Henning Schulzrinne
   Columbia University
   M/S 0401
   1214 Amsterdam Ave.
   New York, NY 10027-7003
   email: schulzrinne@cs.columbia.edu

10 Bibliography

   [1] N. Greene, M. Ramalho, and B. Rosen, "Media gateway control
   protocol architecture and requirements," Request for Comments 2805,
   Internet Engineering Task Force, Apr. 2000.

   [2] M. Arango, A. Dugan, I. Elliott, C. Huitema, and S. Pickett,
   "Media gateway control protocol (MGCP) version 1.0," Request for
   Comments 2705, Internet Engineering Task Force, Oct. 1999.

   [3] F. Cuervo, N. Greene, C. Huitema, A. Rayhan, B. Rosen, and J.
   Segers, "Megaco protocol 0.8," Request for Comments 2885, Internet
   Engineering Task Force, Aug. 2000.

   [4] M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, "SIP:

Rosenberg/Mataga/Schulzrinne                                 [Page 37]

Internet Draft               AS Components             November 15, 2000

   session initiation protocol," Request for Comments 2543, Internet
   Engineering Task Force, Mar. 1999.

   [5] J. Rosenberg, H. Schulzrinne, and J. Peterson, "Third party call
   control in SIP," Internet Draft, Internet Engineering Task Force,
   Mar. 2000.  Work in progress.

   [6] M. Handley and V. Jacobson, "SDP: session description protocol,"
   Request for Comments 2327, Internet Engineering Task Force, Apr.

   [7] B. Campbell and R. Sparks, "Control of service context using SIP
   Request-URI," Internet Draft, Internet Engineering Task Force, Oct.
   2000.  Work in progress.

   [8] H. Schulzrinne and S. Petrack, "RTP payload for DTMF digits,
   telephony tones and telephony signals," Request for Comments 2833,
   Internet Engineering Task Force, May 2000.

   [9] V. Bharatia, E. Cave, and B. Culpepper, "SIP INFO method for
   event reporting," Internet Draft, Internet Engineering Task Force,
   Apr. 2000.  Work in progress.

   [10] T. Choudhuri, C. Haun, P. Sollee, S. Orton, and S. Whynot, "SIP
   INFO method for DTMF digit transport and collection," Internet Draft,
   Internet Engineering Task Force, Apr. 2000.  Work in progress.

   [11] VoiceXML Forum, "Voice extensible markup language (voicexml)
   version 1.00," voicexml forum specification, VoiceXML Forum, Mar.

   [12] M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, "SIP:
   Session initiation protocol," Internet Draft, Internet Engineering
   Task Force, Aug. 2000.  Work in progress.

   [13] S. Donovan, "The SIP INFO method," Request for Comments 2976,
   Internet Engineering Task Force, Oct. 2000.

   [14] G. Hellstrom, "RTP payload for text conversation," Request for
   Comments 2793, Internet Engineering Task Force, May 2000.

   [15] H. Alvestrand, "Tags for the identification of languages,"
   Request for Comments 1766, Internet Engineering Task Force, Mar.

Rosenberg/Mataga/Schulzrinne                                 [Page 38]

Internet Draft               AS Components             November 15, 2000

                           Table of Contents

   1          Introduction ........................................    2
   2          Why Decompose .......................................    2
   3          Tightly Coupled Decomposition .......................    4
   4          The Decoupled Model .................................    6
   4.1        Architecture ........................................    7
   4.2        Benefits of the Decoupling ..........................   10
   5          Architecture for the Interfaces .....................   11
   5.1        Naming ..............................................   12
   5.2        Additional Message Content ..........................   14
   5.3        Session Duration ....................................   14
   5.4        Third Party Call Control ............................   15
   5.5        Side Channels .......................................   18
   6          Patterns for Accessing Components ...................   19
   6.1        Interactive Voice Response Services .................   19
   6.2        Conferencing Servers ................................   23
   6.2.1      Web Scheduled Conference Services ...................   26
   6.2.2      Web Scheduled, IVR supported, Time Limited
   Conference .....................................................   27
   6.3        Continuous Text-to-Speech ...........................   30
   6.3.1      Service Interface ...................................   33
   6.3.2      Hearing Impaired Service ............................   34
   6.4        Messaging Servers ...................................   34
   6.4.1      Service Interface ...................................   34
   6.4.2      Web Enabled Message Drops ...........................   34
   7          Security Considerations .............................   35
   8          Conclusion ..........................................   35
   9          Author's Addresses ..................................   37
   10         Bibliography ........................................   37

Rosenberg/Mataga/Schulzrinne                                 [Page 39]