Internet Engineering Task Force                        Q. Xie, Motorola
Audio Video Transport WG                            D. Pearce, Motorola
INTERNET-DRAFT                                  S. Balasuriya, Motorola
                                                      Y. Kim, VerbalTek
                                                        S. H. Maes, IBM
                                               Hari Garudadri, Qualcomm


Expires in six months                                      July 6, 2001


         RTP Payload Format for Distributed Speech Recognition
                    <draft-xie-avt-dsr-00.txt>


Status of this Memo

This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of [RFC2026].

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts. Internet-Drafts are draft documents valid for a maximum of
six months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet- Drafts
as reference material or to cite them other than as "work in
progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.


1. Abstract

This document specifies an RTP payload format for encapsulating a
front-end signal processing feature streams for distributed speech
recognition (DSR) systems, with the ETSI Standard ES 201 108 front-end
being the default codec.


2. Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in
this document are to be interpreted as described in [RFC-2119].


3. Introduction

Motivated by technology advances in the field of speech recognition,
voice interfaces to a variety of services (such as airline
information systems, unified messaging, and the like) are becoming
more and more prevalent. In parallel, the popularity of mobile
computing and communications devices has also increased
dramatically. However, the voice codecs typically employed in mobile
systems were designed to optimize audible voice quality and not
speech recognition accuracy, and using these codecs with speech
recognizers can result in poor recognition performance. For systems
that can be accessed from multiple networks using multiple speech
codecs, recognition system designers are further challenged to
accommodate the characteristics of these differences in a robust
manner. Channel errors and lost data packets in these networks result
in further degradation of the speech signal.

In traditional systems as described above, the entire speech
recognizer lies on the server appliance. It is forced to use
incoming speech in whatever condition it arrives in after the
network decodes the vocoded speech. A solution that combats this
uses a scheme called "distributed speech recognition" (DSR). In this
system, the remote device acts as a thin client in communication
with a speech recognition server, also called a speech engine (SE). The
remote device processes the speech, compresses, and error protects the
bitstream in a manner optimal for speech recognition. The speech engin
then uses this representation directly, minimizing the signal
processing necessary and benefiting from enhanced error concealment.

To achieve interoperability with different client devices and speech
engins, a common format is needed. Within the "Aurora" DSR working
group of the European Telecommunications Standards Institute (ETSI), a
payload has been defined and was published as a standard in February
2000 [ES201108].

For interactive voice user interface dialogues between a caller and a
voice service, low latency is also a high priority along with accurate
speech recognition. While jitter in the speech recognizer input is not
particularly important, many issues related to speech interaction over
an IP-based connection are still relevant.  Therefore, it will be
desirable to use the DSR payload in an RTP-based session.


3.1 Typical Scenarios for Using DSR Payload Format

The following diagrams show some typical use scenarios of the DSR RTP
payload format.


  +--------+                     +----------+
  |IP USER |  IP/UDP/RTP/DSR     |IP SPEECH |
  |TREMINAL|-------------------->|  ENGINE  |
  |        |                     |          |
  +--------+                     +----------+


  +--------+  DSR over      +-------+                +----------+
  | Non-IP |  Circuit link  |       | IP/UDP/RTP/DSR |IP SPEECH |
  |  USER  |:::::::::::::::>|GATEWAY|--------------->|  ENGINE  |
  |TERMINAL|  ETSI payload  |       |                |          |
  +--------+  format        +-------+                +----------+


  +--------+                  +-------+  DSR over       +----------+
  |IP USER |  IP/UDP/RTP/DSR  |       |  circuit link   |  Non-IP  |
  |TREMINAL|----------------->|GATEWAY|::::::::::::::::>|  SPEECH  |
  |        |                  |       |  ETSI payload   |  ENGINE  |
  +--------+                  +-------+  format         +----------+

    Figure 1: Typical Scenarios for Using DSR Payload Format.

For the different scenarios in Figure 1, the speech recognizer resides
in the speech engin, while a DSR front-end encoder inside the User
Terminal performs front-end speech processing and sends the resultant
data to the speech engin in the form of "frame-pairs" (FPs). Each
frame-pair normally contains two sets of encoded speech vectors
representing 20ms of original speech.


4. DSR RTP Payload Format

4.1 Payload Header

Each DSR payload MUST begin with the follow payload header of one
octet length:

    0
    0 1 2 3 4 5 6 7
   +-+-+-+-+-+-+-+-+
   |  FPC  |E|R|R|R|
   +-+-+-+-+-+-+-+-+

   Figure 2: Payload header.

 FPC - Frame-Pair Count, indicating the number of Frame-pairs (FPs)
       included in this payload packet.

 E - End of speech segment flag. When set to 1, indicating the last
     frame pair in this payload packet is the end of the current
     speech segment.

 R - reserved bits. Must be set to 0 by the sender of the payload
     and ignored by the receiver.


4.2 Payload Body

The DSR payload is formed by concatenating the above payload header
and FPC number of frame-pairs.

Each DSR payload MUST be octet-aligned at the end, i.e., if a DSR
payload does not end on an octet boundary, it then MUST be padded at
the end with zeros to the next octet boundary.

The following example shows a DSR payload carrying 3 frame pairs:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | FPC=3 |E|0|0|0|                                               |
   +-+-+-+-+-+-+-+-+                                               +
   |                         FP #1                                 |
   +                                                       +-+-+-+-+
   |                                                       |       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+       +
   |                                                               |
   +                                                               +
   |                         FP #2                                 |
   +                                               +-+-+-+-+-+-+-+-+
   |                                               |               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               +
   |                                                               |
   +                         FP #3                                 +
   |                                                               |
   +                       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       |0|0|0|0|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


In this example, the payload is shown with 4 zeros padded at the end
to make it octet-aligned.

The number of FPs per payload packet should be determined by the
latency and bandwidth requirements of the DSR application.

A decreasing number of FPs per payload packet reduces the bandwidth
efficiency due to the RTP header overhead, while an increacing number
of FPs per packet causes longer end-to-end delay and hence bigger
recognition latency.

Furthermore, an increacing number of FPs per packet rises the
potential of the loss of a large number of consecutive frame-pairs,
which is a situation most speech recogziers have difficult to deal
with.

Therefore, it is RECOMMENDED that the number of FPs per DSR
payload packet be minimized, subject to meeting the application's
requirements on network bandwidth efficiency.

RTP header compression [RFC2508] SHOULD be considered to improve
network bandwidth efficiency.


5. Frame-pair Format

Depending on the type of the DSR front-end encoder to be used in the
present DSR RTP session, the frame-pair format may be different.

When setting up a DSR RTP sessions, the user terminal will inform the
speech engine the type of the front-end encoder, using the
front-end-type MIME parameter as defined in Section 7.

In this memo, we only define the frame-pair format that MUST be used
when the ESTI ES 201 108 Front-end Codec [ES201108] is used. Frame-
pair formats for future DSR front-end codecs may be defined in
separate IETF documents.


5.1. Frame-Pair Format For ETSI ES 201 108 Front-end Codec

The ETSI Standard ES 201 108 for DSR [ES201108] defines a signal
processing front-end and compression scheme for speech input to a
speech recognition system. Some relevant characteristics of this ETSI
DSR front-end codec are summarized below.

The coding algorithm, a standard mel-cepstral technique common to many
speech recognition systems, supports three raw sampling rates: 8 kHz,
11 kHz, and 16 kHz. The mel-cepstral calculation is a frame- based
scheme that produces an output vector every 10 ms.

After calculation of the mel-cepstral representation, the
representation is quantized via split-vector quantization to reduce
the data rate of the encoded stream. This is a lossy compression, with
the output being a frame containing an integer representation of the
encoded speech.

For ES 201 108 Front-end Codec, the following mel-cepstral frame MUST
be used, as defined in [ES201108]:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  idx(0,1) |  idx(2,3) |  idx(4,5) |  idx(6,7) |  idx(8,9) |idx
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   (10,11) |   idx(12,13)  |
   +-+-+-+-+-+-+-+-+-+-+-+-+

The length of a frame is 44 bits representing 10ms of voice.

As defined in [ES201108], pairs of the quantized 10ms mel-cepstral
frames MUST be grouped together and protected with a 4-bit CRC,
forming a 92-bit long frame-pair:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                      Frame #1  (44 bits)                      |
   +                       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       |          Frame #2 (44 bits)           |
   +-+-+-+-+-+-+-+-+-+-+-+-+                       +-+-+-+-+-+-+-+-+
   |                                               | CRC   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Therefore, each frame-pair represents 20ms of original speech.

The 4-bit CRC MUST be calculated using the formula defined in 6.2.4 in
[ES201108].


6. DSR MIME Type Registration

Media Type name:     audio

Media subtype name:  DSR

Required parameters: none

Optional parameters for RTP mode:

 sample-rate: Indicating the sample rate of the speech. Valid values
              include: 8k, 11k, and 16k.

              If this parameter is not present, 8k sample rate is
              assumed.

 front-end-type: Indicating the type of the front-end codec to be used
                 for this DSR session. Valid values are:

                 etsi_mfcc - indicates that ETSI ES 201 108 Front-end
                 Codec as defined in [ES201108] will be used.

                 unspecified - indicates that other front-end codec
                 will be used.

                 If this parameter is absent, ETSI ES 201 108
                 Front-end will be assumed.

 maxptime:  The maximum amount of media which can be encapsulated in
            each packet, expressed as time in milliseconds. The time
            shall be calculated as the sum of the time the media
            present in the packet represents. The time SHOULD be a
            multiple of the frame pair size (i.e., one FP <-> 20ms).

            If this parameter is not present, maxptime will be assumed
            to 60ms.

Encoding considerations : <TBD>

Security considerations : <TBD>

Interoperability considerations : <TBD>

Person & email address to contact for further information: <TBD>

Intended usage: COMMON. It is expected that many VoIP applications
(as well as mobile applications) will use this type.

Author/Change controller:
  <TBD>
  IETF Audio/Video transport working group


7. Security Considerations

Implementations using the payload defined in this specification are
subject to the security considerations discussed in the RTP
specification [RFC1889] and the RTP profile [RFC1890]. This payload
does not specify any different security services.


8. References


[ES201108] European Telecommunications Standards Institute (ETSI)
   Standard ES 201 108, "Speech Processing, Transmission and Quality
   Aspects (STQ); Distributed Speech Recognition; Front-end Feature
   Extraction Algorithm; Compression Algorithms," Ver. 1.1.2, April
   11, 2000. http://webapp.etsi.org/pda/home.asp?wki_id=9948

[RFC1889] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson,
   "RTP: A transport protocol for real-time applications," Internet
   Draft, Internet Engineering Task Force, Feb. 1999 Work in progress,
   revision to RFC 1889.

[RFC1890] H. Schulzrinne and S. Casner, "RTP Profile for Audio and
   Video Conferences with Minimal Control," Internet Draft
   draft-ietf-avt-profile-new-08.txt, Work in Progress January 14,
   2000, revision to RFC 1890.

[RFC2016] Bradner, S., "The Internet Standards Process -- Revision 3",
   BCP 9, RFC 2026, October 1996.

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
   Requirement Levels", BCP 14, RFC 2119, March 1997

[RFC2508] S. Casner and V. Jacobson, "Compressing IP/UDP/RTP Headers
   for Low-Speed Serial Links," RFC 2508, February 1999.



9.   Acknowledgments

The design presented here benefits greatly from an earlier work on DSR
RTP payload design by Jeff Meunier.


10. Author's Addresses

Qiaobing Xie                        Tel:   +1-847-632-3028
Motorola, Inc.                      EMail: qxie1@email.mot.com
1501 W. Shure Drive, 2-F9
Arlington Heights, IL 60004, USA

David Pearce                        Tel: +44 (0)1256 484 436
Motorola Labs                       EMail: bdp003@motorola.com
UK Research Laboratory
Jays Close
Viables Industrial Estate
Basingstoke, HANTS, RG22 4PD

Senaka Balasuriya                   Tel:   +1-630-353-8347
Motorola, Inc.              EMail: Senaka.Balasuriya@motorola.com
1411 Opus Place, Suite 350
Downers Grover, IL 60515, USA

Yoon Kim                            Tel: +1-408-768-4974
VerbalTek, Inc.                     EMail: yoonie@verbaltek.com
2921 Copper Rd.
Santa Clara, CA 95051

Stephane H. Maes                    Tel: +1-914-945-2908
IBM                                 EMail: smaes@us.ibm.com
TJ Watson Research Center
P.O. Box 218,
Yorktown Heights, NY 10598, USA.


Hari Garudadri                      Tel:
Qualcomm                            EMail: hgarudad@qualcomm.com







      This Internet Draft expires in 6 months from July 2001.