Internet Draft B. Wyld
Document: draft-ietf-speechsc-protocol-eval-02 Editor
Expires: October 2003 Eloquant
Version 02 June 2003
SPEECHSC Protocol Evaluation
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in
progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This document is the Protocol Evaluation Document for the SPEECHSC
Working Group. Section 3 provides the summary of the individual
protocol comparisons (in the sections 4-N following) against the
SPEECHSC requirements [1].
Table of Contents
1. Overview...................................................2
2. Protocol Proposals.........................................3
3. Protocol Evaluation Summary and Conclusion.................3
4. Session Control : ôBeepö Compliance Evaluation (Jerry Carter)
3
4.1. General notes:........................................3
4.2. Analysis of General Requirements......................4
4.3. Analysis of Duplexing and Parallel Operation
Requirements..................................................4
4.4. Analysis of additional considerations (non-normative).5
4.5. Analysis of Security considerations...................5
4.6. Interaction Model.....................................5
5. Session Control : ôSIPö Complience Evaluation (Rajiv
Dharmadhikari)...................................................5
5.1. Introduction..........................................5
5.2. Analysis of General Requirements......................6
5.3. Analysis of Duplexing and Parallel Operation
Requirements..................................................7
5.4. Analysis of additional considerations (non-normative).7
5.5. Analysis of Security considerations...................7
5.6. Other Criteria........................................7
5.7. Interaction Model.....................................8
6. Session Control : ôRTSPö Complience Evaluation (Brian Wyld)8
6.1. General Introduction..................................8
Wyld Expires û October 2003 [Page 1]
SPEECHSC Protocol Evaluation Template June 2003
6.2. Analysis of General Requirements......................9
6.3. Analysis of Duplexing and Parallel Operation
Requirements..................................................9
6.4. Analysis of additional considerations (non-normative).9
6.5. Analysis of Security considerations..................10
6.6. Interaction Model....................................10
7. Session Control : ôWeb Servicesö Complience Evaluation
(Stephane H. Maes)..............................................10
7.1. General Notes:.......................................10
7.2. Analysis of General Requirements.....................11
7.3. Analysis of Duplexing and Parallel Operation
Requirements.................................................13
7.4. Analysis of additional considerations (non-normative)13
7.5. Analysis of Security considerations..................14
7.6. Interaction Model....................................14
8. Resource Control : ôMRCPö Complience Evaluation (Sarvi
Shanmugham).....................................................14
8.1. General..............................................14
8.2. Analysis of TTS requirements.........................15
8.3. Analysis of ASR requirements.........................16
8.4. Analysis of Speaker Identification and Verification
Requirements.................................................17
9. Resource Control : ôRTSPö Complience Evaluation (Brian Wyld)
18
9.1. General Introduction.................................18
9.2. Analysis of TTS requirements.........................18
9.3. Analysis of ASR requirements.........................18
9.4. Analysis of Speaker Identification and Verification
Requirements.................................................19
10. Security Considerations...................................19
11. References................................................19
1. Overview
This document provides the template for the content for the
SPEECHSC Protocol Evaluation document.
This section will contain an overview of the process.
Section 2 contains a list of the proposed protocols submitted to
WG. These protocols are in fact of 3 different natures:
- a generic service/resource access system (Web Services)
- æclassicÆ session control protocols for media channels (BEEP,
SIP, RTSP), of which RTSP also provides resource control.
- a specific ASR/TTS resource control message set designed to
be tunneled in a session control protocol (MRCP)
Section 3 provides a conclusion of the evaluations of the proposed
protocols against the Requirements and framework, and recommends a
direction for the creation of the speechsc protocol.
Sections 4-7 provide the individual protocol evaluations for the
protocols providing session control against the SPEECHSC
requirements [1] that relate to these needs.
Sections 8 and 9 provide individual protocol evaluations for
protocols providing resource control against the SPEECHSC resource
control requirements.
Wyld Expires û October 2003 [Page 2]
SPEECHSC Protocol Evaluation Template June 2003
2. Protocol Proposals
This section contains a list of the existing protocols submitted to
the SPEECHSC WG for consideration by the deadline.
1. BEEP
2. SIP
3. RTSP
4. MRCP (initial submission)
5. Web Services
Each protocol section contains a review of the protocolÆs level of
compliance to each of the SPEECHSC Requirements [1] as derived from
the proposed protocol documents. The following key will be used to
identify the level of compliancy of each of the individual
protocols:
T = Total Complience. Meets the requirement fully.
P = Partial Compliance. Meets some aspect of the requirement.
P+ = Complience possible. Could meet the requirement with ônaturalö
evolution of the protocol.
F = Failed Compliance. Does not meet the requirement.
3. Protocol Evaluation Summary and Conclusion
In summary, it appears that the decomposition of the problem into
session control and resource control gives rise to:
- SIP is a good fit for session control for the speechsc
requirements
- MRCP is a good start for resource control for the speechsc
requirements
The conclusion for the protocol evaluation is therefore to:
- create a new speechsc specific resource control only
protocol, based on MRCP
- use sessions that have been established by SIP (and defines a
means of referring to these sessions)
4. Session Control : ôBeepö Compliance Evaluation (Jerry Carter)
4.1. General notes:
The BEEP protocol provides a general framework for establishing
connections, defining new channels, negotiating security, and
performing user authentication. Protocols build on beep must define
a profile detailing how connections are established and must define
a set of messages which will be delivered using BEEP. The protocol
is peer-to-peer although client-server style requests could be
easily handled.
The following sub-sections compare each individual requirement
against the protocol.
Wyld Expires û October 2003 [Page 3]
SPEECHSC Protocol Evaluation Template June 2003
4.2. Analysis of General Requirements
4.2.1. Reuse existing protocols [5.1]
T: Beep is a published protocol, listed as RFC 3080
(http://www.ietf.org/rfc/rfc3080.txt).
4.2.2. Maintain Existing Protocol Integrity [5.2]
P: BEEP assumes that protocols, such as SpeechSC, will add
messages.
Supporting multiple clients using TCP may require some effort.
4.2.3. Avoid Duplicating Existing Protocols [5.3]
T: Building SpeechSC over BEEP would allow the specification to
focus on managing the ASR, media server, and SI/SV resources and
the possible interactions between them. The operations for
establishing connections and defining new channels would be handled
by BEEP.
4.2.4. Protocol efficiency [5.4]
P+: BEEP imposes a small overhead (roughly 40 bytes per message).
It provides a mechanism for supporting multiple communication
channels over a single port. If grouping of requests is desired,
this would need to be handled by grouping the SpeechSC messages.
4.2.5. Explicit invocation of services [5.5]
T: Though it is primarily a peer-to-peer protocol, BEEP may act as
a traditional client server protocol.
4.2.6. Server Location and Load Balancing [5.6]
P+: This functionality is not provided by BEEP. This would need to
be added as an extension.
4.2.7. Simultaneous services [5.7]
T: Multiple channels providing different services is possible. Each
service is simply a message type which is passed to the server
using BEEP.
4.2.8. Multiple media sessions [5.8]
F: BEEP assumes a 1:1 using TCP/IP.
4.3. Analysis of Duplexing and Parallel Operation Requirements
4.3.1. Duplexing and Parallel Operation Requirements [9]
P+: Parallel operations may be obtained using multiple channels. A
message on one channel could potentially interrupt activity
happening on the second. BEEP is very flexible allowing the server
to implement whatever behavior is desired.
Wyld Expires û October 2003 [Page 4]
SPEECHSC Protocol Evaluation Template June 2003
4.3.2. Full Duplex operation [9.1.1]
T: BEEP is a peer-to-peer protocol allowing full duplex
communication on a single channel or parallel communication on
multiple channels.
4.3.3. Multiple services in parallel [9.1.2]
P+: Multiple services may be run on separate channels. Merging or
T-ing of RTP must be implemented by the server.
4.3.4. Combination of services
TBD
4.4. Analysis of additional considerations (non-normative)
TBD
4.5. Analysis of Security considerations
4.5.1. Security Considerations [11]
P+: BEEP offers a mechanism for managing security and user
authentication.
SpeechSC requires managing multiple data streams and some form of
unified authentication / security might be a goal. If so, BEEP
security should be revisited with this in mind.
4.6. Interaction Model
TBC : TO BE COMPLETED : Analysis of the interaction model of the
protocol during the ædataÆ phase (ie after session establishment)
and its suitability for speechsc.
5. Session Control : ôSIPö Complience Evaluation (Rajiv
Dharmadhikari)
5.1. Introduction
SIP is a protocol for initiating, modifying, and terminating
multimedia sessions. The protocol is considered an IETF standard
and its specifications can be found in [2]. The following sections
provide a general statement with regards to the applicability of
SIP as the control protocol for SPEECHSC.
5.1.1. SIP General Applicability
SIP is a pretty mature, well understood, and frequently used
session establishment protocol. It has gone through multiple
revisions in the IETF standard process. There are number of
commercial and public domain implementations of SIP that are
available. Because of its close resemblance to HTTP and being a
text based protocol, there are large number of SIP application
developers available.
Wyld Expires û October 2003 [Page 5]
SPEECHSC Protocol Evaluation Template June 2003
5.1.2. SIP Use in VOIP environment
SIP is already being used to establish and redirect RTP streams
from various end points. The SPEECHSC requires a protocol for
controlling ASR, TTS and SV resources. When these resources are
deployed in a VOIP network that requires them to process media
carried in RTP, the SIP protocol is used in lot of deployments.
Rather than inventing a new control protocol and introducing
operational aspects of the new protocol, SIP can be reused for
controlling SPEECHSC resources.
5.2. Analysis of General Requirements
5.2.1. Reuse existing protocols [5.1]
T: SIP is an existing, widely used, and mature protocol defined in
[2].
5.2.2. Maintain Existing Protocol Integrity [5.2]
T: Existing SIP methods and header fields will not be changed when
SIP is used to control SPEECHSC resources. In case, if extensions
are required, SIP allows carriage of custom payload in the body.
This payload is understood only by UAs and it does not impact
protocol integrity.
5.2.3. Avoid Duplicating Existing Protocols [5.3]
T: Lot of the requirements for SPEECHSC operation can easily be
satisfied by SIP, e.g. establishing RTP streams or redirecting
them. Without SIP, new SPEECHSC protocol will have to duplicate lot
of session management functionality.
5.2.4. Protocol efficiency [5.4]
T: SIP is a very lightweight protocol when run over TCP or UDP. It
leverages efficiency available in TCP and UDP protocols that have
been around for over 20 years.
5.2.5. Explicit invocation of services [5.5]
T: SIP URI mechanism allows invocation of different services.
5.2.6. Server Location and Load Balancing [5.6]
P+: SIP employs standard DNS name resolution for locating
resources. SIP itself does not provide load balancing features.
Application level load balancers can be used to load balance SIP
requests.
5.2.7. Simultaneous services [5.7]
T: SIP allows simultaneous invocation of different services. SIP
allows forking or splitting the same media stream to different end
points as defined in [2].
5.2.8. Multiple media sessions [5.8]
T: SIP uses SDP to describe RTP stream characteristics. This allows
the control of direction of RTP stream such as bi-directional or
Wyld Expires û October 2003 [Page 6]
SPEECHSC Protocol Evaluation Template June 2003
uni-directional. SIP allows a UA to establish sessions with
multiple UAs for the same session.
5.3. Analysis of Duplexing and Parallel Operation Requirements
5.3.1. Duplexing and Parallel Operation Requirements [9]
T: SPEECHSC resource is a SIP UA that can handle session requests
from different UAs.
5.3.2. Full Duplex operation [9.1.1]
T: Each SIP UA consists of a UAC and a UAS. This allows for full
duplex operation.
5.3.3. Multiple services in parallel [9.1.2]
T: SIP allows simultaneous invocation of different services. SIP
allows forking or splitting the same media stream to different end
points as defined in [2].
5.3.4. Combination of services
T: See 5.6.3. SIP UA can invoke different services and combine the
results.
5.4. Analysis of additional considerations (non-normative)
TBD
5.5. Analysis of Security considerations
5.5.1. Security Considerations [11]
T: SIP protocol employs different authentication schemes that are
widely used in IP based protocols.
5.6. Other Criteria
The following criteria were also defined by the evaluator of SIP.
5.6.1. Ability to establish session between SPEECHSC client
and SPEECHSC resource
T: SIP User Agent can establish a session with another SIP User
Agent.
5.6.2. Ability to terminate session by either SPEECHSC
client or SPEECHSC resource
T: SIP User Agent can terminate a session with another SIP User
Agent.
5.6.3. Support reliable sequencing and delivery between
SPEECHSC client and SPEECHSC resource
P: SIP can be run over TCP or UDP. When run over TCP, this
requirement is easily satisfied. When run over UDP, SIP User Agent
is required to implement logic to ensure reliable sequencing and
delivery.
Wyld Expires û October 2003 [Page 7]
SPEECHSC Protocol Evaluation Template June 2003
5.6.4. Ability for SPEECHSC client to coordinate SPEECHSC
resources on different machines for a single session
T: SPEECHSC client can use SIP to establish SIP sessions with
different machines.
5.6.5. Ability for SPEECHSC resource to handle multiple
SPEECHSC clients
T: SPEECHSC resource is a SIP UA that can handle session requests
from different UAs.
5.6.6. The SPEECHSC resource should be able to generate
asynchronous events or unsolicited messages
T: SIP allows asynchronous events or unsolicited messages to be
generated using SUBSCRIBE/NOTIFY mechanism.
5.6.7. The SPEECHSC client and resource should have ability
for authenticating each other
T: SIP protocol employs different authentication schemes that are
widely used in IP based protocols.
5.6.8. Ability to determine success or failure from both
SPEECHSC client and SPEECHSC resource side
T: The protocol has following response codes: 200 for success, 3xx,
4xx, and 5xx for failure.
5.6.9. Support for versioning between SPEECHSC client and
SPEECHSC resource
P+: This will require an additional header or element in the
body of SIP message for versioning. The current version field is
intended for SIP protocol version.
5.7. Interaction Model
Speechsc has certain needs related to the interaction model of the
protocol during the ædataÆ phase (ie after session establishment).
Specifically, speechsc will require that the resource server can
send unsolicited messages/transactions to the resource client to
return results and indicate events.
SIP messages in the data phase can flow in both directions (client
to server as well as server to client). SIP INFO message can be
used for this purpose. The SIP INFO is intended for mid-call
message semantics. With this message, transactions can be
initiated/defined by both ends.
SIP therefore has an interaction model suited to the speechsc
model, which supports peer-peer messaging with a basic
transactional symmetrical request/response model.
6. Session Control : ôRTSPö Complience Evaluation (Brian Wyld)
6.1. General Introduction
RTSP is an existing protocol, orientated towards audio playback and
recording. As such, it has support for RTP session control, with
SDP used for session description, and a message set allowing
operation as a player/recorder with audio ôVCRö controls.
Wyld Expires û October 2003 [Page 8]
SPEECHSC Protocol Evaluation Template June 2003
Only the session control is evaluated here (see later section for
evaluation of the resource control elements)
The following sub-sections compare each individual requirement
against the protocol.
6.2. Analysis of General Requirements
6.2.1. Reuse existing protocols [5.1]
T: RTSP/RTP/SDP would be reused.
6.2.2. Maintain Existing Protocol Integrity [5.2]
T: The extensions to RTSP to allow speechsc use would be in the
spirit of the protocol, and would not break existing servers or
clients.
6.2.3. Avoid Duplicating Existing Protocols [5.3]
T: Using RTSP would not recreate it.
6.2.4. Protocol efficiency [5.4]
T: RTSP is a text based protocol, but is relatively succinct as
messages are specific to their operation.
6.2.5. Explicit invocation of services [5.5]
T: RTSP service invocation is sufficient.
6.2.6. Server Location and Load Balancing [5.6]
F: RTSP does not address this topic; however it can be used with
other IETF protocols such as SLP or UDDI to do so.
6.2.7. Simultaneous services [5.7]
T: RTSP allows simultaneous invocation of services on the same or
different control channel.
6.2.8. Multiple media sessions [5.8]
T: RTSP allows multiple media sessions.
6.3. Analysis of Duplexing and Parallel Operation Requirements
6.3.1. Duplexing and Parallel Operation Requirements [9]
T: RTSP allows session setup that should fulfill these
requirements.
6.3.2. Full Duplex operation [9.1.1]
T: RTSP can create a full duplex session.
6.3.3. Multiple services in parallel [9.1.2]
T: RTSP can request multiple operations of the same type on the
same session.
6.3.4. Combination of services
T: RTSP can request multiple operations of different types on the
same session.
6.4. Analysis of additional considerations (non-normative)
TBD
Wyld Expires û October 2003 [Page 9]
SPEECHSC Protocol Evaluation Template June 2003
6.5. Analysis of Security considerations
6.5.1. Security Considerations [11]
F: RTSP provides no specific security functionality at all, but
depends on other IETF security protocols (as it uses TCP) to pre-
validate and protect the sessions.
6.6. Interaction Model
Speechsc has certain needs related to the interaction model of the
protocol during the ædataÆ phase (ie after session establishment).
Specifically, speechsc will require that the resource server can
send unsolicited messages/transactions to the resource client to
return results and indicate events.
RTSP messages in the data phase can flow in both directions (client
to server as well as server to client). Transactions can be
initiated/defined by both ends. Currently most of the defined
transactions are C-S; however there already exists an ANNOUNCE
message transaction that is used to transit general content in both
directions (and is in fact used by MRCP to transport its resource
control messages).
RTSP has therefore an interaction model suited to the speechsc
model, which supports peer-peer messaging with a basic
transactional symmetrical request/response model.
7. Session Control : ôWeb Servicesö Complience Evaluation
(Stephane H. Maes)
7.1. General Notes:
Speech engines (speech recognition, speaker, recognition, speech
synthesis, recorders and playback, NL parsers, and any other speech
processing engines (e.g. speech detection, barge-in detection etc)
etc...) as well as audio sub-systems (audio input and output
sub-systems) can be considered as web services that can be
described and asynchronously programmed via WSDL (on top of SOAP),
combined in a flow described via WSFL, discovered via UDDI and
asynchronously controlled via SOAP that also enables
asynchronous exchanges between the engines.
This solution presents the advantage to provide flexibility,
scalability and extensibility while reusing an existing framework
that fits the evolution of the web: web services and XML protocols
[WS1]
According to the web services framework, speech engines (audio
sub-systems, engines, speech processors) can be defined as web
services
that are characterized by an interface that consists of some of the
following ports:
- "control in" port(s): It sets the engine context, i.e. all
the
settings required for a speech engine to run. It may include
addresses where to get or send the streamed audio or results.
- "control out" port(s): It produces the non-audio engine
output
(i.e. results and events). It may also involve some session
control exchanges.
- "audio in" port(s): It receives streamed input data.
Wyld Expires û October 2003 [Page 10]
SPEECHSC Protocol Evaluation Template June 2003
- "audio out" port(s): It produces streamed output data.
Audio sub-systems can also be treated as web services that can
produce streamed data or play incoming streamed data as specified
by
the control parameters.
The "control in" or "control out" messages can be out-of-band or
sent or received interleaved with "audio in or out" data. This can
be determined in the context (setup) of the web services.
Speech engines and audio sub-systems are pre-programmed as web
services and composed into more advanced services. Once programmed
by the application / controller, audio-sub-systems and engines
await
an incoming event (established audio session, etc...) to execute
the
speech processing that they have been programmed to do and send the
results as programmed.
Speech engines as web services are typically programmed to handle
completely a particular speech processing task, including handling
of possible errors. For example, as speech engine is programmed to
perform recognition of the next incoming utterance with a
particular
grammar, to send result to a NL parser and to contact a particular
error recovery process if particular errors occur.
The following sub-sections compare each individual requirement
against the protocol.
7.2. Analysis of General Requirements
7.2.1. Reuse existing protocols [5.1]
T: Web services are is a class of protocols (framework) widely
studied and developed across numerous standard bodies like W3C,
OASIS, WS-I, Liberty, Parlay and adapted to numerous deployment
environments issues at IETF, OMA, 3GPP, 3GPP2, JCP, etcà As an
entry point, we recommend consulting the work at W3C [WS1].
7.2.2. Maintain Existing Protocol Integrity [5.2]
T: Web services is an XML-based framework that is by definition
extensible to support appropriate syntax and semantics.
Web services are bound on underlying transport protocols. Numerous
such binding have been specified. Others are in development. By
handling at SPEECHSC at the level of the
Web services framework, the integrity is maintained for:
- underlying transport protocols (to which the web service are
bound (e.g. SOAP)
- web service framework
This does not prevent introducing bindings to new protocols if
needed. For example, binding to SIP or BEEP could be advantageous
for mobile deployments.
Wyld Expires û October 2003 [Page 11]
SPEECHSC Protocol Evaluation Template June 2003
7.2.3. Avoid Duplicating Existing Protocols [5.3]
T: By definition, the web service framework can be specified to
remote control any web service. Specified syntax can be limited to
avoid duplicating remote control functionalities offered by other
protocols.
At the same time, the extensibility inherent to the framework
guarantees that it is possible to specify (standard) or define
(application specific) remote control for other entities beyond the
current scope of SPEECHSC.
In that context and in view of unifying the remote control
framework exposed to an application developer or a system
integrator, it may be of interest to provide remote control syntax
for special entities like prompt player etcà
7.2.4. Protocol efficiency [5.4]
P+ to P: Web services are by definition more verbose protocols.
Hence, at this stage this does not qualify work a T mark.
However work is in progress (e.g. OMA, JCP) to optimize the
exchanges to handle:
- Client with limited resources
- Constrained bandwidth
These rely on protocol compression and optimization, caching and
gateways.
As such the protocols qualify as P+.
In addition, based on the qualification of efficiency provided in
[WS8], the web service framework proposed for SPEECHSC and
described in [WS1] relies indeed on known efficient techniques:
- Asynchronous pre-programming of the engines as web services to
reduce exchanges and avoid racing conditions
- Possibility to piggy back on response message if transported on
optimized protocols like SIP or BEEP.
- state caching in the engines that are considered as stand-alone,
pre-packaged and pre-programmed engines.
- etcà
7.2.5. Explicit invocation of services [5.5]
T: Web service is typically used in a client-server environment.
Solutions exist for peer to peer (service to service) etcà
Web services have been deigned to support clients and servers at
least one of which is operating directly on behalf of the user
requesting the service.
In addition, work on-going at OMA and JCP addresses some of these
issues in mobile environment with the introduction of possible web
service gateways.
7.2.6. Server Location and Load Balancing [5.6]
T: Web services are widely developed for e-business applications.
Numerous tools and mechanisms have been provided for service
discovery ad advertisement. In addition, numerous offerings provide
Wyld Expires û October 2003 [Page 12]
SPEECHSC Protocol Evaluation Template June 2003
routing and load balancing capabilities as part of the web
application server used to deploy the web service.
Note that web services do not specify server location or load
balancing; but they are deployed on systems that provide such
functionalities. As web services are expected to be widely used in
the future and central to most e-business offerings, it is to
expect that such tools will become even more pervasive and
efficient.
7.2.7. Simultaneous services [5.7]
Web services allow control (interface) and composition of web
services at will (e.g. WSFL).
7.2.8. Multiple media sessions [5.8]
T: The framework proposed does not pre-supposes how many ports or
streams are associated to the engine. Different inbound and
outbound can be used at will
7.3. Analysis of Duplexing and Parallel Operation Requirements
7.3.1. Duplexing and Parallel Operation Requirements [9]
T: As explained, web services allow control (interface) and
composition of web services at will (e.g. WSFL). Also, it does not
pre-supposes how many ports or streams are associated to the
engine. Different inbound and outbound can be used at will; in full
duplex or even between engines as supported by WSFL [WS4] and WSXl
[WS7].
7.3.2. Full Duplex operation [9.1.1]
T:
7.3.3. Multiple services in parallel [9.1.2]
T:
7.3.4. Combination of services
T: As explained, web services allow control (interface) and
composition of web services at will (e.g. WSFL) into complex
parallel, serial or coordinated combinations as supported by WSFL
[WS4] and WSXl [WS7].
7.4. Analysis of additional considerations (non-normative)
The framework proposed supports:
- Use of SDP to describe sessions and streams for the streamed
channels
- Time stamps could be transmitted as part of the control messages
at the web service level or in band (e.g. with dynamic payload
switch or within the payload).
- The framework is compatible with any encoding scheme. This is
illustrated by the work on SRF (Speech Recognition Framework)
driven at 3GPP that supports conventional and DSR optimized codecs
and possible exchange of speech meta-information (e.g. data that
may be required to facilitate and enhance the server-side
processing of the input speech and facilitate the dialog management
in an automated voice service. These may include keypad events
over-riding spoken input, notification that the UE is in hands-free
Wyld Expires û October 2003 [Page 13]
SPEECHSC Protocol Evaluation Template June 2003
mode, client-side collected information (speech/no-speech, barge-
in), etcà.).
- SOAP over SIP or BEEP to support the framework described in
section 1 can also support VCR controls.
- real-time messaging between engine and control is supported
within the framework (e.g. via SOAP or XML events). The framework
also support exchange between engines (same process; see also WSXL
[WS7]).
Although non-normative, the web service framework described
probably deserves marks of P+ to T.
7.5. Analysis of Security considerations
7.5.1. Security Considerations [11]
Web services are evolving to provide security, authentication,
encryption, trust management and privacy . Details can be found for
example in [WS9] and explained in [WS10]. This is now an OASIS
activity [WS11].
This framework would enable SPEECHSC to employ the security
mechanism provided bu WS-Security for the remote control aspects.
Exchanged media can rely on security mechanism at the transport /
streaming level.
The web service framework described probably deserves marks of P+
to T.
7.6. Interaction Model
TBC : TO BE COMPLETED : Analysis of the interaction model of the
protocol during the ædataÆ phase (ie after session establishment)
and its suitability for speechsc.
8. Resource Control : ôMRCPö Complience Evaluation (Sarvi
Shanmugham)
8.1. General
8.1.1. MRCP Framework and General Applicability
The overall MRCP framework, the components involved and their
distribution and relationship to each other meet the framework
specified by SPEECHSC. The primary advantage of MRCP is that it is
a text based protocol designed to meet most of the requirements of
SPEECHSC pertaining to speech recognition and Text to speech.
Though Speaker Recognition (SR) and Speaker Verification (SV) are
not supported in its current form, MRCP was explicitly designed to
be extendable for such needs. The core MRCP definition only deals
with the control of the ASR or TTS resource and the commands and
responses needed to achieve it.
There are multiple interoperable implementations of MRCP and hence
is a proven technology. It leverages existing W3C XML standards for
exchange of data between the client and the server resource. For
Example, its uses the W3C XML grammar format (GRXML) along with W3C
semantic attachments and Natural Language Semantic Markup Language
Wyld Expires û October 2003 [Page 14]
SPEECHSC Protocol Evaluation Template June 2003
to exchange data with speech recognition resource. The W3C Speech
Markup Language is used when dealing with Text to speech engines.
It was designed to work as a tunneled protocol, over RTSP or SIP.
Hence it depends on the carrier protocol to establish a control and
a media path between the client and the ASR or TTS server resource.
Hence it gets most of the security and media pipe management
operations for free. Once these are established, MRCP commands and
responses are tunneled over, controlling the ASR or TTS resource on
the server.
8.1.2. MRCP can be evolved
Though MRCP directly meets many of the needs of SPEECHSC. The
notion that it is a tunneled protocol disallows its independent
operation. Further more the tunneled aspect is also a less
efficient protocol design.
But these can be addressed and the core MRCP messages can be
evolved to either become standalone protocol by itself or
extensions to an existing protocol such as SIP or RTSP. To make
this a standalone protocol and allow MRCP to operate by itself, new
session and media management messages need to be defined to allow
it to operate independently. To evolve MRCP as extensions to SIP or
RTSP would also be relatively simple since it is also a text based
protocol with message format and headers very similar to them. In
this protocol evaluation, the compliance evaluates MRCP from the
perspective of evolution in one of these forms.
The following sub-sections compare each individual requirement
relating to resource management against the protocol.
8.2. Analysis of TTS requirements
8.2.1. Requesting Text Playback [6.1]
T: MRCP has the SPEAK method for the client to request the TTS
resource to playback text as an audio stream.
8.2.2. Text Formats [6.2]
T: When the client requests the TTS resource to playback a text
stream it can provide the content in the following formats and
through the following mechanism.
1. Plain text
2. W3C XML based Speech Markup Language (SSML)
3. This content to be spoken can be provided by value directly
through the control path.
4. It also supports passing the content by reference. This is
achieved having an audio tag inside the SSML markup text.
This URL is then fetched and played on the RTP stream in
sequence with the rest of the text according to the SSML
specification.
When the client sends plain text, SSML or another format of speech
text the content is coded as a mime-type. Hence the server knows
what format the speech content is coded in, and does not have to
figure it out from the content.
Wyld Expires û October 2003 [Page 15]
SPEECHSC Protocol Evaluation Template June 2003
8.2.3. Plain text [6.2.1]
T: see above
8.2.4. SSML [6.2.2]
T: see above
8.2.5. Text in Control Channel [6.2.3]
T : see above
8.2.6. Document Type Indication [6.2.4]
T: see above
8.2.7. Control Channel [6.3]
T: In MRCP, this Reset-Audio-Channel header defined for the ASR
resource allows the recognizer to re-initialize the audio
characteristics that it has learnt till then. This allows a
recognizer resource to be used for multiple recognition sessions.
It can be used for short single utterance recognitions as well.
This is by applying the Reset-Audio-Channel header to every
recognition. I suspect the performance may not be as good, due to
the lack of line characteristics, but this is a recognizer issue.
8.2.8. Playback Controls [6.4]
T: MRCP supports the CONTROL method with the Jump-Target header can
used to achieve, jumping in time or to an exact or relative
location. It supports jumping in paragraphs, sentences, words and
to specific markers that may be embedded in the speech content. The
CONTROL method can be used with the Voice and Prosody parameters,
derived from SSML, and can address the speed of speech or
increasing/decreasing the volume. It also supports the PAUSE/RESUME
methods to pause or resume a current SPEAK request.
8.2.9. Session Parameters [6.5]
T: As mentioned the previous section, MRCP supports voice and
prosody parameters which are directly derived from the W3C SSML
specification. These headers can be sent using the SET-PARAMS
method and applied as a default for the entire session. They can
also be applied in SPEAK requests to apply per usage or in the
CONTROL message to change the parameters of an active SPEAK
request.
8.2.10. Speech Markers [6.6]
T: Specifying speech markers in the content is supported through
SSML. The CONTROL message can then be used to jump to specific
marker points in the text. Also, when the TTS resource reaches
specific markers in the text, the server would generate the SPEECH-
MARKER method to the client.
8.3. Analysis of ASR requirements
8.3.1. Requesting Automatic Speech Recognition [7.1]
T: The client uses the RECOGNIZE method in MRCP to request the
recognition resource to process the audio stream in the pipe. The
RECOGNIZE method also specifies parameters and grammars the
recognizer should match against.
Wyld Expires û October 2003 [Page 16]
SPEECHSC Protocol Evaluation Template June 2003
8.3.2. XML [7.2]
T: Similar to the TTS resource in MRCP, ASR also uses XML data to
exchange information between the client and the recognition
resource. It supports the W3C GRXML to pass grammars from the
client to the server. When the server is done recognizing, it uses
the W3C Natural Language Semantic Markup Language (NLSML) to pass
the results back to the client. It supports other grammar formats
as well, as long as the server allows it. This is possible since,
it uses mime-types to package this data and hence the format type
is specified.
8.3.3. Grammar Specification [7.3.1]
P+: MRCP supports specifying the grammar both by value and by
reference. The RECOGNIZE method can carry with it grammar content
and/or a URI referring to the grammar content. Since MRCP supports
referring a grammar, the referred grammar could be located on the
server itself. With respect to sharing of grammars, the grammars
defined/compiled through the DEFINE-GRAMMAR primitive are not
sharable across sessions on the same server. This needs to be
addressed to meet this set of requirements in full.
8.3.4. Explicit Indication of Grammar Format [7.3.2]
P+: see above
8.3.5. Grammar Sharing [7.3.3]
TBD
8.3.6. Session Parameters [7.4]
T: This requirement as defined is already fully met since MRCP is
the referred standard for compliance.
8.3.7. Input Capture [7.5]
T: This is achieved by setting the Waveform-url header in the
RECOGNIZE method. This tells the server to record the audio of the
recognition and will return a URI to the client in the completion
event, which can be used to retrieve or play back the audio.
8.4. Analysis of Speaker Identification and Verification
Requirements
8.4.1. Requesting SI/SV [8.1]
F: not supported
8.4.2. Identifiers for SI/SV [8.2]
F: not supported
8.4.3. State for multiple utterances [8.3]
F: not supported
8.4.4. Input Capture [8.4]
F: not supported
8.4.5. SI/SV functional extensibility [8.5]
F: not supported
Wyld Expires û October 2003 [Page 17]
SPEECHSC Protocol Evaluation Template June 2003
9. Resource Control : ôRTSPö Complience Evaluation (Brian Wyld)
9.1. General Introduction
RTSP is an existing protocol, orientated towards audio playback and
recording. As such, it has support for RTP session control, with
SDP used for session description, and a message set allowing
operation as a player/recorder with audio ôVCRö controls. This
comparison only addresses the existing resource control elements
here and their applicability to the speechsc requirements.
The current PLAY state machine is exactly as required for TTS
operation. Although by analogy RECORD could initiate an ASR
session, with headers giving the grammer source or references, itÆs
state machine is not nearly as compatible, and not at all for
SV/SI.
9.2. Analysis of TTS requirements
9.2.1. Requesting Text Playback [6.1]
P+: the RTSP PLAY message semantics would require minor extensions
9.2.2. Text Formats [6.2]
P+: Text can be defined as all text types.
9.2.3. Plain text [6.2.1]
T: Plain text may be carried directly in the message payload.
9.2.4. SSML [6.2.2]
T: Text may be in any format.
9.2.5. Text in Control Channel [6.2.3]
T: Text may be attached to the control messages.
9.2.6. Document Type Indication [6.2.4]
T : Via the Content-Type header
9.2.7. Control Channel [6.3]
T: RTSP sessions may use a private or shared TCP connection.
9.2.8. Playback Controls [6.4]
T: RTSP defines playback control messages and a state machine.
9.2.9. Session Parameters [6.5]
T: RTSP defines operations for session parameter control.
9.2.10. Speech Markers [6.6]
P+: Markers may be inserted in the text, but to provide the
required asynchronous events when a marker is synthesized will
require use specific ANNOUNCE type messages for server->client
notification.
9.3. Analysis of ASR requirements
9.3.1. Requesting Automatic Speech Recognition [7.1]
P: The RECORD message and semantics could be used but would require
extensions (stretching the current semantic quite a lot)
Wyld Expires û October 2003 [Page 18]
SPEECHSC Protocol Evaluation Template June 2003
9.3.2. XML [7.2]
P+: Text can be defined as all text types.
9.3.3. Grammar Specification [7.3.1]
P+: Text can be defined as all text types.
9.3.4. Explicit Indication of Grammar Format [7.3.2]
T : Via the Content-Type headers
9.3.5. Grammar Sharing [7.3.3]
F: TBD
9.3.6. Session Parameters [7.4]
T: RTSP defines operations for session parameter control.
9.3.7. Input Capture [7.5]
P+: would require addition of a header to the initiation message.
9.4. Analysis of Speaker Identification and Verification
Requirements
9.4.1. Requesting SI/SV [8.1]
F: not supported
9.4.2. Identifiers for SI/SV [8.2]
F: not supported
9.4.3. State for multiple utterances [8.3]
F: not supported
9.4.4. Input Capture [8.4]
F: not supported
9.4.5. SI/SV functional extensibility [8.5]
F: not supported
10. Security Considerations
Security considerations for the SPEECHSC protocol are covered by
the comparison against the specific Security requirements in the
SPEECHSC requirements document [1].
11. References
[1] Oran, D., "Requirements for Distributed Control of ASR, SI/SV
and TTS Resources", draft-ietf-speechsc-reqts-04, June 6, 2003,
work in progress.
[2] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J.
Peterson, R. Sparks, M. Handley, E.Schooler, SIP: Session
Initiation Protocol, RFC3265, June 2002. (Obsoletes RFC2543)
[WS1] W3C Web Services, http://www.w3c.org/2002/ws/
[WS2] Simple Object Access Protocol (SOAP)
http://www.w3c.org/2002/ws/
[WS3] Web Services Description Language (WSDL 1.1), W3C Note 15
March
Wyld Expires û October 2003 [Page 19]
SPEECHSC Protocol Evaluation Template June 2003
2001, http://www.w3.org/TR/wsdl.
[WS4] Leymann, F., Web Service Flow Language, WSFL 1.0, May 2001,
http://www-
4.ibm.com/software/solutions/webservices/pdf/WSFL.pdf
[WS5] UDDI, http://www.uddi.org/specification.html
[WS6] W3C Voice Activity, http://www.w3c.org/Voice/
[WS7] WSXL - Web Service eXperience Language submitted to OASIS
WSIA
and WSRP - WSXL - Web Service eXperience Language submitted
to
OASIS WSIA and WSRP
[WS8] Requirements for Distributed Control of ASR, SI/SV and TTS
Resources,
draft-ietf-speechsc-reqts-01.txt
[WS9] Security in a Web Services World: A Proposed Architecture and
Roadmap,
April 7, 2002, Version 1.0, http://www.verisign.com/wss/wss.pdf
[WS10] Kapil Apshankar, WS-Security, Security for Web Services,
http://www.webservicesarchitect.com/content/articles/apshankar04.as
p
[WS11] OASIS Web Services Security TC, http://www.oasis-
open.org/committees/wss/
AuthorÆs Address
Brian Wyld
Eloquant SA
ZA Malvaisin Phone: +33 476 77 46 92
Le Versoud, France Email: brian.wyld@eloquant.com
Full Copyright Statement
Copyright (C) The Internet Society (2002). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain
it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph
are included on all such copies and derivative works. However,
this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English. The limited permissions granted above are perpetual and
will not be revoked by the Internet Society or its successors or
assigns. This document and the information contained
herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT
THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR
ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE."
Wyld Expires û October 2003 [Page 20]
SPEECHSC Protocol Evaluation Template June 2003
Wyld Expires û October 2003 [Page 21]