Internet Draft                                                 B. Wyld
 Document: draft-ietf-speechsc-protocol-eval-02                  Editor
 Expires: October 2003                                         Eloquant
 Version 02                                                   June 2003
 
                        SPEECHSC Protocol Evaluation
 
 Status of this Memo
 
    This document is an Internet-Draft and is in full conformance with
    all provisions of Section 10 of RFC2026.
 
    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note that
    other groups may also distribute working documents as Internet-
    Drafts.
 
    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other
    documents at any time. It is inappropriate to use Internet-Drafts
    as reference material or to cite them other than as "work in
    progress."
 
    The list of current Internet-Drafts can be accessed at
         http://www.ietf.org/ietf/1id-abstracts.txt
    The list of Internet-Draft Shadow Directories can be accessed at
         http://www.ietf.org/shadow.html.
 
 Abstract
 
    This document is the Protocol Evaluation Document for the SPEECHSC
    Working Group.  Section 3 provides the summary of the  individual
    protocol comparisons (in the sections 4-N following) against the
    SPEECHSC requirements [1].
 
 Table of Contents
 
    1.   Overview...................................................2
    2.   Protocol Proposals.........................................3
    3.   Protocol Evaluation Summary and Conclusion.................3
    4.   Session Control : ôBeepö Compliance Evaluation (Jerry Carter)
         3
       4.1.   General notes:........................................3
       4.2.   Analysis of General Requirements......................4
       4.3.   Analysis of Duplexing and Parallel Operation
       Requirements..................................................4
       4.4.   Analysis of additional considerations (non-normative).5
       4.5.   Analysis of Security considerations...................5
       4.6.   Interaction Model.....................................5
    5.   Session Control : ôSIPö Complience Evaluation (Rajiv
    Dharmadhikari)...................................................5
       5.1.   Introduction..........................................5
       5.2.   Analysis of General Requirements......................6
       5.3.   Analysis of Duplexing and Parallel Operation
       Requirements..................................................7
       5.4.   Analysis of additional considerations (non-normative).7
       5.5.   Analysis of Security considerations...................7
       5.6.   Other Criteria........................................7
       5.7.   Interaction Model.....................................8
    6.   Session Control : ôRTSPö Complience Evaluation (Brian Wyld)8
       6.1.   General Introduction..................................8
 
 
 Wyld                    Expires û October 2003               [Page 1]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
       6.2.   Analysis of General Requirements......................9
       6.3.   Analysis of Duplexing and Parallel Operation
       Requirements..................................................9
       6.4.   Analysis of additional considerations (non-normative).9
       6.5.   Analysis of Security considerations..................10
       6.6.   Interaction Model....................................10
    7.   Session Control : ôWeb Servicesö Complience Evaluation
    (Stephane H. Maes)..............................................10
       7.1.   General Notes:.......................................10
       7.2.   Analysis of General Requirements.....................11
       7.3.   Analysis of Duplexing and Parallel Operation
       Requirements.................................................13
       7.4.   Analysis of additional considerations (non-normative)13
       7.5.   Analysis of Security considerations..................14
       7.6.   Interaction Model....................................14
    8.   Resource Control : ôMRCPö Complience Evaluation (Sarvi
    Shanmugham).....................................................14
       8.1.   General..............................................14
       8.2.   Analysis of TTS requirements.........................15
       8.3.   Analysis of ASR requirements.........................16
       8.4.   Analysis of Speaker Identification and Verification
       Requirements.................................................17
    9.   Resource Control : ôRTSPö Complience Evaluation (Brian Wyld)
         18
       9.1.   General Introduction.................................18
       9.2.   Analysis of TTS requirements.........................18
       9.3.   Analysis of ASR requirements.........................18
       9.4.   Analysis of Speaker Identification and Verification
       Requirements.................................................19
    10.  Security Considerations...................................19
    11.  References................................................19
 
      1. Overview
 
    This document provides the template for the content for the
    SPEECHSC Protocol Evaluation document.
 
    This section will contain an overview of the process.
 
    Section 2 contains a list of the proposed protocols submitted to
    WG. These protocols are in fact of 3 different natures:
       - a generic service/resource access system (Web Services)
       - æclassicÆ session control protocols for media channels (BEEP,
          SIP, RTSP), of which RTSP also provides resource control.
       - a specific ASR/TTS resource control message set designed to
          be tunneled in a session control protocol (MRCP)
 
    Section 3 provides a conclusion of the evaluations of the proposed
    protocols against the Requirements and framework, and recommends a
    direction for the creation of the speechsc protocol.
 
    Sections 4-7 provide the individual protocol evaluations for the
    protocols providing session control against the SPEECHSC
    requirements [1] that relate to these needs.
 
    Sections 8 and 9 provide individual protocol evaluations for
    protocols providing resource control against the SPEECHSC resource
    control requirements.
 
 
 
 
 Wyld                    Expires û October 2003               [Page 2]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
      2. Protocol Proposals
 
    This section contains a list of the existing protocols submitted to
    the SPEECHSC WG for consideration by the deadline.
          1. BEEP
          2. SIP
          3. RTSP
          4. MRCP (initial submission)
          5. Web Services
 
 
    Each protocol section contains a review of the protocolÆs level of
    compliance to each of the SPEECHSC Requirements [1] as derived from
    the proposed protocol documents. The following key will be used to
    identify the level of compliancy of each of the individual
    protocols:
 
    T = Total Complience.  Meets the requirement fully.
    P = Partial Compliance.  Meets some aspect of the requirement.
    P+ = Complience possible. Could meet the requirement with ônaturalö
    evolution of the protocol.
    F = Failed Compliance.  Does not meet the requirement.
 
      3. Protocol Evaluation Summary and Conclusion
 
    In summary, it appears that the decomposition of the problem into
    session control and resource control gives rise to:
 
     - SIP is a good fit for session control for the speechsc
    requirements
 
     - MRCP is a good start for resource control for the speechsc
    requirements
 
    The conclusion for the protocol evaluation is therefore to:
       - create a new speechsc specific resource control only
          protocol, based on MRCP
       - use sessions that have been established by SIP (and defines a
          means of referring to these sessions)
 
 
      4. Session Control : ôBeepö Compliance Evaluation (Jerry Carter)
 
         4.1. General notes:
 
    The BEEP protocol provides a general framework for establishing
    connections, defining new channels, negotiating security, and
    performing user authentication. Protocols build on beep must define
    a profile detailing how connections are established and must define
    a set of messages which will be delivered using BEEP. The protocol
    is peer-to-peer although client-server style requests could be
    easily handled.
 
    The following sub-sections compare each individual requirement
    against the protocol.
 
 
 
 
 
 
 Wyld                    Expires û October 2003               [Page 3]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
         4.2. Analysis of General Requirements
 
           4.2.1. Reuse existing protocols [5.1]
 
    T: Beep is a published protocol, listed as RFC 3080
    (http://www.ietf.org/rfc/rfc3080.txt).
 
           4.2.2. Maintain Existing Protocol Integrity [5.2]
 
    P: BEEP assumes that protocols, such as SpeechSC, will add
    messages.
 
    Supporting multiple clients using TCP may require some effort.
 
           4.2.3. Avoid Duplicating Existing Protocols [5.3]
 
    T: Building SpeechSC over BEEP would allow the specification to
    focus on managing the ASR, media server, and SI/SV resources and
    the possible interactions between them. The operations for
    establishing connections and defining new channels would be handled
    by BEEP.
 
           4.2.4. Protocol efficiency [5.4]
 
    P+: BEEP imposes a small overhead (roughly 40 bytes per message).
    It provides a mechanism for supporting multiple communication
    channels over a single port. If grouping of requests is desired,
    this would need to be handled by grouping the SpeechSC messages.
 
           4.2.5. Explicit invocation of services [5.5]
 
    T: Though it is primarily a peer-to-peer protocol, BEEP may act as
    a traditional client server protocol.
 
           4.2.6. Server Location and Load Balancing [5.6]
 
    P+: This functionality is not provided by BEEP. This would need to
    be added as an extension.
 
           4.2.7. Simultaneous services [5.7]
 
    T: Multiple channels providing different services is possible. Each
    service is simply a message type which is passed to the server
    using BEEP.
 
           4.2.8. Multiple media sessions [5.8]
 
    F: BEEP assumes a 1:1 using TCP/IP.
 
         4.3. Analysis of Duplexing and Parallel Operation Requirements
 
           4.3.1. Duplexing and Parallel Operation Requirements [9]
 
    P+: Parallel operations may be obtained using multiple channels. A
    message on one channel could potentially interrupt activity
    happening on the second. BEEP is very flexible allowing the server
    to implement whatever behavior is desired.
 
 
 Wyld                    Expires û October 2003               [Page 4]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
           4.3.2. Full Duplex operation [9.1.1]
 
    T: BEEP is a peer-to-peer protocol allowing full duplex
    communication on a single channel or parallel communication on
    multiple channels.
 
           4.3.3. Multiple services in parallel [9.1.2]
 
    P+: Multiple services may be run on separate channels. Merging or
    T-ing of RTP must be implemented by the server.
 
           4.3.4. Combination of services
    TBD
 
         4.4. Analysis of additional considerations (non-normative)
    TBD
 
         4.5. Analysis of Security considerations
 
           4.5.1. Security Considerations [11]
 
    P+: BEEP offers a mechanism for managing security and user
    authentication.
    SpeechSC requires managing multiple data streams and some form of
    unified authentication / security might be a goal. If so, BEEP
    security should be revisited with this in mind.
 
 
         4.6. Interaction Model
    TBC : TO BE COMPLETED : Analysis of the interaction model of the
    protocol during the ædataÆ phase (ie after session establishment)
    and its suitability for speechsc.
 
 
      5. Session Control : ôSIPö Complience Evaluation (Rajiv
         Dharmadhikari)
 
         5.1. Introduction
    SIP is a protocol for initiating, modifying, and terminating
    multimedia sessions. The protocol is considered an IETF standard
    and its specifications can be found in [2]. The following sections
    provide a general statement with regards to the applicability of
    SIP as the control protocol for SPEECHSC.
 
 
           5.1.1. SIP General Applicability
    SIP is a pretty mature, well understood, and frequently used
    session establishment protocol. It has gone through multiple
    revisions in the IETF standard process. There are number of
    commercial and public domain implementations of SIP that are
    available. Because of its close resemblance to HTTP and being a
    text based protocol, there are large number of SIP application
    developers available.
 
 
 
 
 
 
 Wyld                    Expires û October 2003               [Page 5]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
           5.1.2. SIP Use in VOIP environment
    SIP is already being used to establish and redirect RTP streams
    from various end points. The SPEECHSC requires a protocol for
    controlling ASR, TTS and SV resources. When these resources are
    deployed in a VOIP network that requires them to process media
    carried in RTP, the SIP protocol is used in lot of deployments.
    Rather than inventing a new control protocol and introducing
    operational aspects of the new protocol, SIP can be reused for
    controlling SPEECHSC resources.
 
 
         5.2. Analysis of General Requirements
 
           5.2.1. Reuse existing protocols [5.1]
    T: SIP is an existing, widely used, and mature protocol defined in
    [2].
 
           5.2.2. Maintain Existing Protocol Integrity [5.2]
    T: Existing SIP methods and header fields will not be changed when
    SIP is used to control SPEECHSC resources. In case, if extensions
    are required, SIP allows carriage of custom payload in the body.
    This payload is understood only by UAs and it does not impact
    protocol integrity.
 
 
           5.2.3. Avoid Duplicating Existing Protocols [5.3]
    T: Lot of the requirements for SPEECHSC operation can easily be
    satisfied by SIP, e.g. establishing RTP streams or redirecting
    them. Without SIP, new SPEECHSC protocol will have to duplicate lot
    of session management functionality.
 
 
           5.2.4. Protocol efficiency [5.4]
    T: SIP is a very lightweight protocol when run over TCP or UDP. It
    leverages efficiency available in TCP and UDP protocols that have
    been around for over 20 years.
 
           5.2.5. Explicit invocation of services [5.5]
    T: SIP URI mechanism allows invocation of different services.
 
           5.2.6. Server Location and Load Balancing [5.6]
    P+: SIP employs standard DNS name resolution for locating
    resources. SIP itself does not provide load balancing features.
    Application level load balancers can be used to load balance SIP
    requests.
 
 
           5.2.7. Simultaneous services [5.7]
    T: SIP allows simultaneous invocation of different services. SIP
    allows forking or splitting the same media stream to different end
    points as defined in [2].
 
 
           5.2.8. Multiple media sessions [5.8]
    T: SIP uses SDP to describe RTP stream characteristics. This allows
    the control of direction of RTP stream such as bi-directional or
 
 
 
 Wyld                    Expires û October 2003               [Page 6]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
    uni-directional. SIP allows a UA to establish sessions with
    multiple UAs for the same session.
 
 
         5.3. Analysis of Duplexing and Parallel Operation Requirements
 
           5.3.1. Duplexing and Parallel Operation Requirements [9]
    T: SPEECHSC resource is a SIP UA that can handle session requests
    from different UAs.
 
           5.3.2. Full Duplex operation [9.1.1]
    T: Each SIP UA consists of a UAC and a UAS. This allows for full
    duplex operation.
 
           5.3.3. Multiple services in parallel [9.1.2]
    T: SIP allows simultaneous invocation of different services. SIP
    allows forking or splitting the same media stream to different end
    points as defined in [2].
 
 
           5.3.4. Combination of services
    T: See 5.6.3. SIP UA can invoke different services and combine the
    results.
 
         5.4. Analysis of additional considerations (non-normative)
    TBD
 
         5.5. Analysis of Security considerations
 
           5.5.1. Security Considerations [11]
    T: SIP protocol employs different authentication schemes that are
    widely used in IP based protocols.
 
         5.6. Other Criteria
    The following criteria were also defined by the evaluator of SIP.
 
           5.6.1. Ability to establish session between SPEECHSC client
               and SPEECHSC resource
    T: SIP User Agent can establish a session with another SIP User
    Agent.
 
           5.6.2. Ability to terminate session by either SPEECHSC
               client or SPEECHSC resource
    T: SIP User Agent can terminate a session with another SIP User
    Agent.
 
           5.6.3. Support reliable sequencing and delivery between
               SPEECHSC client and SPEECHSC resource
    P: SIP can be run over TCP or UDP. When run over TCP, this
    requirement is easily satisfied. When run over UDP, SIP User Agent
    is required to implement logic to ensure reliable sequencing and
    delivery.
 
 
 
 
 
 
 
 Wyld                    Expires û October 2003               [Page 7]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
           5.6.4. Ability for SPEECHSC client to coordinate SPEECHSC
               resources on different machines for a single session
    T: SPEECHSC client can use SIP to establish SIP sessions with
    different machines.
 
           5.6.5. Ability for SPEECHSC resource to handle multiple
               SPEECHSC clients
    T: SPEECHSC resource is a SIP UA that can handle session requests
    from different UAs.
 
           5.6.6. The SPEECHSC resource should be able to generate
               asynchronous events or unsolicited messages
    T: SIP allows asynchronous events or unsolicited messages to be
    generated using SUBSCRIBE/NOTIFY mechanism.
 
           5.6.7. The SPEECHSC client and resource should have ability
               for authenticating each other
    T: SIP protocol employs different authentication schemes that are
    widely used in IP based protocols.
 
           5.6.8. Ability to determine success or failure from both
               SPEECHSC client and SPEECHSC resource side
    T: The protocol has following response codes: 200 for success, 3xx,
    4xx, and 5xx for failure.
 
           5.6.9. Support for versioning between SPEECHSC client and
               SPEECHSC resource
    P+: This will require an additional header or element in the
    body of SIP message for versioning. The current version field is
    intended for SIP protocol version.
 
 
         5.7. Interaction Model
 
    Speechsc has certain needs related to the interaction model of the
    protocol during the ædataÆ phase (ie after session establishment).
    Specifically, speechsc will require that the resource server can
    send unsolicited messages/transactions to the resource client to
    return results and indicate events.
 
    SIP messages in the data phase can flow in both directions (client
    to server as well as server to client). SIP INFO message can be
    used for this purpose. The SIP INFO is intended for mid-call
    message semantics. With this message, transactions can be
    initiated/defined by both ends.
 
    SIP therefore has an interaction model suited to the speechsc
    model, which supports peer-peer messaging with a basic
    transactional symmetrical request/response model.
 
      6. Session Control : ôRTSPö Complience Evaluation (Brian Wyld)
 
         6.1. General Introduction
    RTSP is an existing protocol, orientated towards audio playback and
    recording. As such, it has support for RTP session control, with
    SDP used for session description, and a message set allowing
    operation as a player/recorder with audio ôVCRö controls.
 
 
 
 Wyld                    Expires û October 2003               [Page 8]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
    Only the session control is evaluated here (see later section for
    evaluation of the resource control elements)
 
    The following sub-sections compare each individual requirement
    against the protocol.
 
         6.2. Analysis of General Requirements
 
           6.2.1. Reuse existing protocols [5.1]
    T: RTSP/RTP/SDP would be reused.
 
           6.2.2. Maintain Existing Protocol Integrity [5.2]
    T: The extensions to RTSP to allow speechsc use would be in the
    spirit of the protocol, and would not break existing servers or
    clients.
 
           6.2.3. Avoid Duplicating Existing Protocols [5.3]
    T: Using RTSP would not recreate it.
 
           6.2.4. Protocol efficiency [5.4]
    T: RTSP is a text based protocol, but is relatively succinct as
    messages are specific to their operation.
 
           6.2.5. Explicit invocation of services [5.5]
    T: RTSP service invocation is sufficient.
 
           6.2.6. Server Location and Load Balancing [5.6]
    F: RTSP does not address this topic; however it can be used with
    other IETF protocols such as SLP or UDDI to do so.
 
           6.2.7. Simultaneous services [5.7]
    T: RTSP allows simultaneous invocation of services on the same or
    different control channel.
 
           6.2.8. Multiple media sessions [5.8]
    T: RTSP allows multiple media sessions.
 
         6.3. Analysis of Duplexing and Parallel Operation Requirements
 
           6.3.1. Duplexing and Parallel Operation Requirements [9]
    T: RTSP allows session setup that should fulfill these
    requirements.
 
           6.3.2. Full Duplex operation [9.1.1]
    T: RTSP can create a full duplex session.
 
           6.3.3. Multiple services in parallel [9.1.2]
    T: RTSP can request multiple operations of the same type on the
    same session.
 
           6.3.4. Combination of services
    T: RTSP can request multiple operations of different types on the
    same session.
 
         6.4. Analysis of additional considerations (non-normative)
    TBD
 
 
 Wyld                    Expires û October 2003               [Page 9]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
         6.5. Analysis of Security considerations
 
           6.5.1. Security Considerations [11]
    F: RTSP provides no specific security functionality at all, but
    depends on other IETF security protocols (as it uses TCP) to pre-
    validate and protect the sessions.
 
         6.6. Interaction Model
    Speechsc has certain needs related to the interaction model of the
    protocol during the ædataÆ phase (ie after session establishment).
    Specifically, speechsc will require that the resource server can
    send unsolicited messages/transactions to the resource client to
    return results and indicate events.
 
    RTSP messages in the data phase can flow in both directions (client
    to server as well as server to client). Transactions can be
    initiated/defined by both ends. Currently most of the defined
    transactions are C-S; however there already exists an ANNOUNCE
    message transaction that is used to transit general content in both
    directions (and is in fact used by MRCP to transport its resource
    control messages).
 
    RTSP has therefore an interaction model suited to the speechsc
    model, which supports peer-peer messaging with a basic
    transactional symmetrical request/response model.
 
      7. Session Control : ôWeb Servicesö Complience Evaluation
         (Stephane H. Maes)
 
         7.1. General Notes:
    Speech engines (speech recognition, speaker, recognition, speech
    synthesis, recorders and playback, NL parsers, and any other speech
    processing engines (e.g. speech detection, barge-in detection etc)
    etc...) as well as audio sub-systems (audio input and output
    sub-systems) can be considered as web services that can be
    described and asynchronously programmed via WSDL (on top of SOAP),
    combined in a flow described via WSFL, discovered via UDDI and
    asynchronously controlled via SOAP that also enables
    asynchronous exchanges between the engines.
 
    This solution presents the advantage to provide flexibility,
    scalability and extensibility while reusing an existing framework
    that fits the evolution of the web: web services and XML protocols
    [WS1]
 
    According to the web services framework, speech engines (audio
    sub-systems, engines, speech processors) can be defined as web
    services
    that are characterized by an interface that consists of some of the
    following ports:
        - "control in" port(s): It sets the engine context, i.e. all
    the
        settings required for a speech engine to run. It may include
        addresses where to get or send the streamed audio or results.
        - "control out" port(s): It produces the non-audio engine
    output
        (i.e. results and events). It may also involve some session
        control exchanges.
        - "audio in" port(s): It receives streamed input data.
 
 
 Wyld                    Expires û October 2003              [Page 10]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
        - "audio out" port(s): It produces streamed output data.
 
    Audio sub-systems can also be treated as web services that can
    produce streamed data or play incoming streamed data as specified
    by
    the control parameters.
 
    The "control in" or "control out" messages can be out-of-band or
    sent or received interleaved with "audio in or out" data. This can
    be determined in the context (setup) of the web services.
 
    Speech engines and audio sub-systems are pre-programmed as web
    services and composed into more advanced services. Once programmed
    by the application / controller, audio-sub-systems and engines
    await
    an incoming event (established audio session, etc...) to execute
    the
    speech processing that they have been programmed to do and send the
    results as programmed.
 
    Speech engines as web services are typically programmed to handle
    completely a particular speech processing task, including handling
    of possible errors. For example, as speech engine is programmed to
    perform recognition of the next incoming utterance with a
    particular
    grammar, to send result to a NL parser and to contact a particular
    error recovery process if particular errors occur.
 
    The following sub-sections compare each individual requirement
    against the protocol.
 
         7.2. Analysis of General Requirements
 
           7.2.1. Reuse existing protocols [5.1]
    T: Web services are is a class of protocols (framework) widely
    studied and developed across numerous standard bodies like W3C,
    OASIS, WS-I, Liberty, Parlay and adapted to numerous deployment
    environments  issues at IETF, OMA, 3GPP, 3GPP2, JCP, etcà As an
    entry point, we recommend consulting the work at W3C [WS1].
 
 
           7.2.2. Maintain Existing Protocol Integrity [5.2]
    T: Web services is an XML-based framework that is by definition
    extensible to support appropriate syntax and semantics.
 
    Web services are bound on underlying transport protocols. Numerous
    such binding have been specified. Others are in development. By
    handling at SPEECHSC at the level of the
    Web services framework, the integrity is maintained for:
    - underlying transport protocols (to which the web service are
    bound (e.g. SOAP)
    - web service framework
 
    This does not prevent introducing bindings to new protocols if
    needed. For example, binding to SIP or BEEP could be advantageous
    for mobile deployments.
 
 
 
 
 
 Wyld                    Expires û October 2003              [Page 11]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
           7.2.3. Avoid Duplicating Existing Protocols [5.3]
    T: By definition, the web service framework can be specified to
    remote control any web service. Specified syntax can be limited to
    avoid duplicating remote control functionalities offered by other
    protocols.
 
    At the same time, the extensibility inherent to the framework
    guarantees that it is possible to specify (standard) or define
    (application specific) remote control for other entities beyond the
    current scope of SPEECHSC.
 
    In that context and in view of unifying the remote control
    framework exposed to an application developer or a system
    integrator, it may be of interest to provide remote control syntax
    for special entities like prompt player etcà
 
 
           7.2.4. Protocol efficiency [5.4]
    P+ to P: Web services are by definition more verbose protocols.
    Hence, at this stage this does not qualify work a T mark.
 
    However work is in progress (e.g. OMA, JCP) to optimize the
    exchanges to handle:
    - Client with limited resources
    - Constrained bandwidth
    These rely on protocol compression and optimization, caching and
    gateways.
 
    As such the protocols qualify as P+.
 
    In addition, based on the qualification of efficiency provided in
    [WS8], the web service framework proposed for SPEECHSC and
    described in [WS1] relies indeed on known efficient techniques:
    - Asynchronous pre-programming of the engines as web services to
    reduce exchanges and avoid racing conditions
    - Possibility to piggy back on response message if transported on
    optimized protocols like SIP or BEEP.
    - state caching in the engines that are considered as stand-alone,
    pre-packaged and pre-programmed engines.
    - etcà
 
 
           7.2.5. Explicit invocation of services [5.5]
    T: Web service is typically used in a client-server environment.
    Solutions exist for peer to peer (service to service) etcà
 
    Web services have been deigned to support clients and servers at
    least one of which is operating directly on behalf of the user
    requesting the service.
 
    In addition, work on-going at OMA and JCP addresses some of these
    issues in mobile environment with the introduction of possible web
    service gateways.
 
 
           7.2.6. Server Location and Load Balancing [5.6]
    T: Web services are widely developed for e-business applications.
    Numerous tools and mechanisms have been provided for service
    discovery ad advertisement. In addition, numerous offerings provide
 
 
 Wyld                    Expires û October 2003              [Page 12]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
    routing and load balancing capabilities as part of the web
    application server used to deploy the web service.
 
    Note that web services do not specify server location or load
    balancing; but they are deployed on systems that provide such
    functionalities. As web services are expected to be widely used in
    the future and central to most e-business offerings, it is to
    expect that such tools will become even more pervasive and
    efficient.
 
 
           7.2.7. Simultaneous services [5.7]
    Web services allow control (interface) and composition of web
    services at will (e.g. WSFL).
 
           7.2.8. Multiple media sessions [5.8]
    T: The framework proposed does not pre-supposes how many ports or
    streams are associated to the engine. Different inbound and
    outbound can be used at will
 
 
         7.3. Analysis of Duplexing and Parallel Operation Requirements
 
           7.3.1. Duplexing and Parallel Operation Requirements [9]
    T: As explained, web services allow control (interface) and
    composition of web services at will (e.g. WSFL).  Also, it does not
    pre-supposes how many ports or streams are associated to the
    engine. Different inbound and outbound can be used at will; in full
    duplex or even between engines as supported by WSFL [WS4] and WSXl
    [WS7].
 
           7.3.2. Full Duplex operation [9.1.1]
    T:
 
           7.3.3. Multiple services in parallel [9.1.2]
    T:
 
           7.3.4. Combination of services
    T: As explained, web services allow control (interface) and
    composition of web services at will (e.g. WSFL) into complex
    parallel, serial or coordinated combinations as supported by WSFL
    [WS4] and WSXl [WS7].
 
         7.4. Analysis of additional considerations (non-normative)
    The framework proposed supports:
    - Use of SDP to describe sessions and streams for the streamed
    channels
    - Time stamps could be transmitted as part of the control messages
    at the web service level or in band (e.g. with dynamic payload
    switch or within the payload).
    - The framework is compatible with any encoding scheme. This is
    illustrated by the work on SRF (Speech Recognition Framework)
    driven at 3GPP that supports conventional and DSR optimized codecs
    and possible exchange of speech meta-information (e.g. data that
    may be required to facilitate and enhance the server-side
    processing of the input speech and facilitate the dialog management
    in an automated voice service. These may include keypad events
    over-riding spoken input, notification that the UE is in hands-free
 
 
 Wyld                    Expires û October 2003              [Page 13]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
    mode, client-side collected information (speech/no-speech, barge-
    in), etcà.).
    - SOAP over SIP or BEEP to support the framework described in
    section 1 can also support VCR controls.
    - real-time messaging between engine and control is supported
    within the framework (e.g. via SOAP or XML events). The framework
    also support exchange between engines (same process; see also WSXL
    [WS7]).
 
    Although non-normative, the web service framework described
    probably deserves marks of P+ to T.
 
 
         7.5. Analysis of Security considerations
 
           7.5.1. Security Considerations [11]
    Web services are evolving to provide security, authentication,
    encryption, trust management and privacy . Details can be found for
    example in [WS9] and explained in [WS10]. This is now an OASIS
    activity [WS11].
 
    This framework would enable SPEECHSC to employ the security
    mechanism provided bu WS-Security for the remote control aspects.
    Exchanged media can rely on security mechanism at the transport /
    streaming level.
 
    The web service framework described probably deserves marks of P+
    to T.
 
         7.6. Interaction Model
    TBC : TO BE COMPLETED : Analysis of the interaction model of the
    protocol during the ædataÆ phase (ie after session establishment)
    and its suitability for speechsc.
 
 
      8. Resource Control : ôMRCPö Complience Evaluation (Sarvi
         Shanmugham)
 
         8.1. General
 
           8.1.1. MRCP Framework and General Applicability
 
    The overall MRCP framework, the components involved and their
    distribution and relationship to each other meet the framework
    specified by SPEECHSC. The primary advantage of MRCP is that it is
    a text based protocol designed to meet most of the requirements of
    SPEECHSC pertaining to speech recognition and Text to speech.
    Though Speaker Recognition (SR) and Speaker Verification (SV) are
    not supported in its current form, MRCP was explicitly designed to
    be extendable for such needs. The core MRCP definition only deals
    with the control of the ASR or TTS resource and the commands and
    responses needed to achieve it.
 
    There are multiple interoperable implementations of MRCP and hence
    is a proven technology. It leverages existing W3C XML standards for
    exchange of data between the client and the server resource. For
    Example, its uses the W3C XML grammar format (GRXML) along with W3C
    semantic attachments and Natural Language Semantic Markup Language
 
 
 Wyld                    Expires û October 2003              [Page 14]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
    to exchange data with speech recognition resource. The W3C Speech
    Markup Language is used when dealing with Text to speech engines.
 
    It was designed to work as a tunneled protocol, over RTSP or SIP.
    Hence it depends on the carrier protocol to establish a control and
    a media path between the client and the ASR or TTS server resource.
    Hence it gets most of the security and media pipe management
    operations for free. Once these are established, MRCP commands and
    responses are tunneled over, controlling the ASR or TTS resource on
    the server.
 
           8.1.2. MRCP can be evolved
 
    Though MRCP directly meets many of the needs of SPEECHSC. The
    notion that it is a tunneled protocol disallows its independent
    operation. Further more the tunneled aspect is also a less
    efficient protocol design.
 
    But these can be addressed and the core MRCP messages can be
    evolved to either become standalone protocol by itself or
    extensions to an existing protocol such as SIP or RTSP.  To make
    this a standalone protocol and allow MRCP to operate by itself, new
    session and media management messages need to be defined to allow
    it to operate independently. To evolve MRCP as extensions to SIP or
    RTSP would also be relatively simple since it is also a text based
    protocol with message format and headers very similar to them.  In
    this protocol evaluation, the compliance evaluates MRCP from the
    perspective of evolution in one of these forms.
 
    The following sub-sections compare each individual requirement
    relating to resource management against the protocol.
 
         8.2. Analysis of TTS requirements
 
           8.2.1. Requesting Text Playback [6.1]
    T: MRCP has the SPEAK method for the client to request the TTS
    resource to playback text as an audio stream.
 
           8.2.2. Text Formats [6.2]
    T: When the client requests the TTS resource to playback a text
    stream it can provide the content in the following formats and
    through the following mechanism.
 
       1. Plain text
       2. W3C XML based Speech Markup Language (SSML)
       3. This content to be spoken can be provided by value directly
          through the control path.
       4. It also supports passing the content by reference. This is
          achieved having an audio tag inside the SSML markup text.
          This URL is then fetched and played on the RTP stream in
          sequence with the rest of the text according to the SSML
          specification.
    When the client sends plain text, SSML or another format of speech
    text the content is coded as a mime-type. Hence the server knows
    what format the speech content is coded in, and does not have to
    figure it out from the content.
 
 
 
 
 
 Wyld                    Expires û October 2003              [Page 15]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
           8.2.3. Plain text [6.2.1]
    T: see above
 
           8.2.4. SSML [6.2.2]
    T: see above
 
           8.2.5. Text in Control Channel [6.2.3]
    T : see above
 
           8.2.6. Document Type Indication [6.2.4]
    T: see above
 
           8.2.7. Control Channel [6.3]
    T: In MRCP, this Reset-Audio-Channel header defined for the ASR
    resource allows the recognizer to re-initialize the audio
    characteristics that it has learnt till then. This allows a
    recognizer resource to be used for multiple recognition sessions.
    It can be used for short single utterance recognitions as well.
    This is by applying the Reset-Audio-Channel header to every
    recognition. I suspect the performance may not be as good, due to
    the lack of line characteristics, but this is a recognizer issue.
 
           8.2.8. Playback Controls [6.4]
    T: MRCP supports the CONTROL method with the Jump-Target header can
    used to achieve, jumping in time or to an exact or relative
    location. It supports jumping in paragraphs, sentences, words and
    to specific markers that may be embedded in the speech content. The
    CONTROL method can be used with the Voice and Prosody parameters,
    derived from SSML, and can address the speed of speech or
    increasing/decreasing the volume. It also supports the PAUSE/RESUME
    methods to pause or resume a current SPEAK request.
 
           8.2.9. Session Parameters [6.5]
    T: As mentioned the previous section, MRCP supports voice and
    prosody parameters which are directly derived from the W3C SSML
    specification. These headers can be sent using the SET-PARAMS
    method and applied as a default for the entire session. They can
    also be applied in SPEAK requests to apply per usage or in the
    CONTROL message to change the parameters of an active SPEAK
    request.
 
           8.2.10. Speech Markers [6.6]
    T: Specifying speech markers in the content is supported through
    SSML. The CONTROL message can then be used to jump to specific
    marker points in the text. Also, when the TTS resource reaches
    specific markers in the text, the server would generate the SPEECH-
    MARKER method to the client.
 
         8.3. Analysis of ASR requirements
 
           8.3.1. Requesting Automatic Speech Recognition [7.1]
    T: The client uses the RECOGNIZE method in MRCP to request the
    recognition resource to process the audio stream in the pipe. The
    RECOGNIZE method also specifies parameters and grammars the
    recognizer should match against.
 
 
 
 
 Wyld                    Expires û October 2003              [Page 16]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
           8.3.2. XML [7.2]
    T: Similar to the TTS resource in MRCP, ASR also uses XML data to
    exchange information between the client and the recognition
    resource. It supports the W3C GRXML to pass grammars from the
    client to the server. When the server is done recognizing, it uses
    the W3C Natural Language Semantic Markup Language (NLSML) to pass
    the results back to the client. It supports other grammar formats
    as well, as long as the server allows it. This is possible since,
    it uses mime-types to package this data and hence the format type
    is specified.
 
           8.3.3. Grammar Specification [7.3.1]
    P+: MRCP supports specifying the grammar both by value and by
    reference. The RECOGNIZE method can carry with it grammar content
    and/or a URI referring to the grammar content. Since MRCP supports
    referring a grammar, the referred grammar could be located on the
    server itself. With respect to sharing of grammars, the grammars
    defined/compiled through the DEFINE-GRAMMAR primitive are not
    sharable across sessions on the same server. This needs to be
    addressed to meet this set of requirements in full.
 
           8.3.4. Explicit Indication of Grammar Format [7.3.2]
    P+: see above
 
           8.3.5. Grammar Sharing [7.3.3]
    TBD
 
           8.3.6. Session Parameters [7.4]
    T: This requirement as defined is already fully met since MRCP is
    the referred standard for compliance.
 
           8.3.7. Input Capture [7.5]
    T: This is achieved by setting the Waveform-url header in the
    RECOGNIZE method. This tells the server to record the audio of the
    recognition and will return a URI to the client in the completion
    event, which can be used to retrieve or play back the audio.
 
         8.4. Analysis of Speaker Identification and Verification
            Requirements
 
           8.4.1. Requesting SI/SV [8.1]
    F: not supported
 
           8.4.2. Identifiers for SI/SV [8.2]
    F: not supported
 
           8.4.3. State for multiple utterances [8.3]
    F: not supported
 
           8.4.4. Input Capture [8.4]
    F: not supported
 
           8.4.5. SI/SV functional extensibility [8.5]
    F: not supported
 
 
 
 
 
 Wyld                    Expires û October 2003              [Page 17]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
      9. Resource Control : ôRTSPö Complience Evaluation (Brian Wyld)
 
         9.1. General Introduction
    RTSP is an existing protocol, orientated towards audio playback and
    recording. As such, it has support for RTP session control, with
    SDP used for session description, and a message set allowing
    operation as a player/recorder with audio ôVCRö controls. This
    comparison only addresses the existing resource control elements
    here and their applicability to the speechsc requirements.
 
    The current PLAY state machine is exactly as required for TTS
    operation. Although by analogy RECORD could initiate an ASR
    session, with headers giving the grammer source or references, itÆs
    state machine is not nearly as compatible, and not at all for
    SV/SI.
 
         9.2. Analysis of TTS requirements
 
           9.2.1. Requesting Text Playback [6.1]
    P+: the RTSP PLAY message semantics would require minor extensions
 
           9.2.2. Text Formats [6.2]
    P+: Text can be defined as all text types.
 
           9.2.3. Plain text [6.2.1]
    T: Plain text may be carried directly in the message payload.
 
           9.2.4. SSML [6.2.2]
    T: Text may be in any format.
 
           9.2.5. Text in Control Channel [6.2.3]
    T: Text may be attached to the control messages.
 
           9.2.6. Document Type Indication [6.2.4]
    T : Via the Content-Type header
 
           9.2.7. Control Channel [6.3]
    T: RTSP sessions may use a private or shared TCP connection.
 
           9.2.8. Playback Controls [6.4]
    T: RTSP defines playback control messages and a state machine.
 
           9.2.9. Session Parameters [6.5]
    T: RTSP defines operations for session parameter control.
 
           9.2.10. Speech Markers [6.6]
    P+: Markers may be inserted in the text, but to provide the
    required asynchronous events when a marker is synthesized will
    require use specific ANNOUNCE type messages for server->client
    notification.
 
         9.3. Analysis of ASR requirements
 
           9.3.1. Requesting Automatic Speech Recognition [7.1]
    P: The RECORD message and semantics could be used but would require
    extensions (stretching the current semantic quite a lot)
 
 
 Wyld                    Expires û October 2003              [Page 18]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
           9.3.2. XML [7.2]
    P+: Text can be defined as all text types.
 
           9.3.3. Grammar Specification [7.3.1]
    P+: Text can be defined as all text types.
 
           9.3.4. Explicit Indication of Grammar Format [7.3.2]
    T : Via the Content-Type headers
 
           9.3.5. Grammar Sharing [7.3.3]
    F: TBD
 
           9.3.6. Session Parameters [7.4]
    T: RTSP defines operations for session parameter control.
 
           9.3.7. Input Capture [7.5]
    P+: would require addition of a header to the initiation message.
 
         9.4. Analysis of Speaker Identification and Verification
            Requirements
 
           9.4.1. Requesting SI/SV [8.1]
    F: not supported
 
           9.4.2. Identifiers for SI/SV [8.2]
    F: not supported
 
           9.4.3. State for multiple utterances [8.3]
    F: not supported
 
           9.4.4. Input Capture [8.4]
    F: not supported
 
           9.4.5. SI/SV functional extensibility [8.5]
    F: not supported
 
 
      10.  Security Considerations
 
    Security considerations for the SPEECHSC protocol are covered by
    the comparison against the specific Security requirements in the
    SPEECHSC requirements document [1].
 
      11.  References
    [1] Oran, D., "Requirements for Distributed Control of ASR, SI/SV
    and TTS Resources", draft-ietf-speechsc-reqts-04, June 6, 2003,
    work in progress.
 
    [2] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J.
    Peterson, R. Sparks, M. Handley, E.Schooler, SIP: Session
    Initiation Protocol, RFC3265, June 2002. (Obsoletes RFC2543)
 
    [WS1] W3C Web Services, http://www.w3c.org/2002/ws/
    [WS2] Simple Object Access Protocol (SOAP)
    http://www.w3c.org/2002/ws/
    [WS3] Web Services Description Language (WSDL 1.1), W3C Note 15
    March
 
 
 Wyld                    Expires û October 2003              [Page 19]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
        2001, http://www.w3.org/TR/wsdl.
    [WS4] Leymann, F., Web Service Flow Language, WSFL 1.0, May 2001,
         http://www-
    4.ibm.com/software/solutions/webservices/pdf/WSFL.pdf
    [WS5] UDDI, http://www.uddi.org/specification.html
    [WS6] W3C Voice Activity, http://www.w3c.org/Voice/
    [WS7] WSXL - Web Service eXperience Language submitted to OASIS
    WSIA
          and WSRP - WSXL - Web Service eXperience Language submitted
    to
           OASIS WSIA and WSRP
    [WS8] Requirements for Distributed Control of ASR, SI/SV and TTS
    Resources,
    draft-ietf-speechsc-reqts-01.txt
    [WS9] Security in a Web Services World: A Proposed Architecture and
    Roadmap,
    April 7, 2002, Version 1.0, http://www.verisign.com/wss/wss.pdf
    [WS10] Kapil Apshankar, WS-Security, Security for Web Services,
    http://www.webservicesarchitect.com/content/articles/apshankar04.as
    p
    [WS11] OASIS Web Services Security TC, http://www.oasis-
    open.org/committees/wss/
 
 AuthorÆs Address
 
    Brian Wyld
    Eloquant SA
    ZA Malvaisin                   Phone:  +33 476 77 46 92
    Le Versoud, France             Email:  brian.wyld@eloquant.com
 
 
 Full Copyright Statement
 
    Copyright (C) The Internet Society (2002).  All Rights Reserved.
 
    This document and translations of it may be copied and furnished to
    others, and derivative works that comment on or otherwise explain
    it
    or assist in its implementation may be prepared, copied, published
    and distributed, in whole or in part, without restriction of any
    kind, provided that the above copyright notice and this paragraph
    are included on all such copies and derivative works.  However,
    this
    document itself may not be modified in any way, such as by removing
    the copyright notice or references to the Internet Society or other
    Internet organizations, except as needed for the purpose of
    developing Internet standards in which case the procedures for
    copyrights defined in the Internet Standards process must be
    followed, or as required to translate it into languages other than
    English.  The limited permissions granted above are perpetual and
    will not be revoked by the Internet Society or its successors or
    assigns.  This document and the information contained
    herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND
    THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES,
    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
    THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
    ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
    PARTICULAR PURPOSE."
 
 
 
 
 Wyld                    Expires û October 2003              [Page 20]


                 SPEECHSC Protocol Evaluation Template      June 2003
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Wyld                    Expires û October 2003              [Page 21]