Internet Engineering Task Force                         J. van der Meer
Internet Draft                                      Philips Electronics
                                                              D. Mackie
                                                     Cisco Systems Inc.
                                                         V. Swaminathan
                                                  Sun Microsystems Inc.
                                                              D. Singer
                                                         Apple Computer
                                                             P. Gentric
                                                    Philips Electronics

                                                             April 2002
                                                   Expires October 2002

   Document: draft-ietf-avt-mpeg4-simple-02.txt


   Transport of MPEG-4 Elementary Streams

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that
   other groups may also distribute working documents as Internet-
   Drafts. Internet-Drafts are draft documents valid for a maximum of
   six months and may be updated, replaced, or obsoleted by other
   documents at any time. It is inappropriate to use Internet- Drafts
   as reference material or to cite them other than as "work in
   progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt
   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This specification is a product of the Audio/Video Transport working
   group within the Internet Engineering Task Force. Comments are
   solicited and should be addressed to the working group's mailing
   list at avt@ietf.org and/or the authors.

   << Note for the RFC editor: xxxx should be replaced with the RFC
   number that will be assigned. >>

Abstract

   The MPEG Committee (ISO/IEC JTC1/SC29 WG11) is a working group in
   ISO that produced the MPEG-4 standard. MPEG defines tools to
   compress content such as audio-visual information into elementary
   streams. This specification defines a simple, but generic RTP
   payload format for transport of any non-multiplexed MPEG-4
   elementary stream.

van der Meer et al.        Expires October 2002                [Page 1]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002


Table of Contents

1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Carriage of MPEG-4 elementary streams over RTP . . . . . . . .  4
   2.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.2.  MPEG Access Units  . . . . . . . . . . . . . . . . . . . . .  4
   2.3.  Concatenation of Access Units  . . . . . . . . . . . . . . .  4
   2.4.  Fragmentation of Access Units  . . . . . . . . . . . . . . .  5
   2.5.  Interleaving . . . . . . . . . . . . . . . . . . . . . . . .  5
   2.6.  Time stamp information . . . . . . . . . . . . . . . . . . .  6
   2.7.  Carriage of auxiliary information  . . . . . . . . . . . . .  6
   2.8.  MIME format parameters and configuring conditional field . .  6
   2.9.  Global structure of payload format . . . . . . . . . . . . .  7
   2.10. Modes to transport MPEG-4 streams  . . . . . . . . . . . . .  7
   2.11. Alignment with RFC 3016  . . . . . . . . . . . . . . . . . .  8
   3.  Payload format . . . . . . . . . . . . . . . . . . . . . . . .  8
   3.1.  RTP header field usage . . . . . . . . . . . . . . . . . . .  8
   3.2.  RTP payload structure  . . . . . . . . . . . . . . . . . . . 10
   3.2.1.  The AU Header Section  . . . . . . . . . . . . . . . . . . 10
   3.2.1.1.  The AU-header  . . . . . . . . . . . . . . . . . . . . . 10
   3.2.2.  The Auxiliary Section  . . . . . . . . . . . . . . . . . . 12
   3.2.3.  The Access Unit Data Section . . . . . . . . . . . . . . . 13
   3.2.3.1.  Fragmentation  . . . . . . . . . . . . . . . . . . . . . 14
   3.2.3.2.  Interleaving . . . . . . . . . . . . . . . . . . . . . . 14
   3.2.3.3.  Constraints for interleaving . . . . . . . . . . . . . . 15
   3.3.  Usage of this specification  . . . . . . . . . . . . . . . . 15
   3.3.1.  General  . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2.  The generic mode . . . . . . . . . . . . . . . . . . . . . 16
3.3.3.  Constant bit rate CELP . . . . . . . . . . . . . . . . . . 16
   3.3.4.  Variable bit rate CELP . . . . . . . . . . . . . . . . . . 17
   3.3.5.  Low bit rate AAC . . . . . . . . . . . . . . . . . . . . . 18
   3.3.6.  High bit rate AAC  . . . . . . . . . . . . . . . . . . . . 18
   3.3.7.  Additional modes . . . . . . . . . . . . . . . . . . . . . 19
   4.  IANA considerations  . . . . . . . . . . . . . . . . . . . . . 19
   4.1.  MIME type registration . . . . . . . . . . . . . . . . . . . 20
   4.2.  Concatenation of parameters  . . . . . . . . . . . . . . . . 24
   4.3.  Usage of SDP . . . . . . . . . . . . . . . . . . . . . . . . 24
   4.3.1.  The a=fmtp keyword . . . . . . . . . . . . . . . . . . . . 24
   5.  Security considerations  . . . . . . . . . . . . . . . . . . . 25
   6.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 26
   7.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 26
   8.  Author addresses . . . . . . . . . . . . . . . . . . . . . . . 26
       APPENDIX: Usage of this payload format . . . . . . . . . . . . 27
       A.1. Examples of delay analysis with interleave  . . . . . . . 27
       A.1.1 Group interleave . . . . . . . . . . . . . . . . . . . . 27
       A.1.2 Continuous interleave  . . . . . . . . . . . . . . . . . 28






van der Meer et al.        Expires October 2002                [Page 2]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

1. Introduction

   The MPEG Committee is Working Group 11 (WG11) in ISO/IEC JTC1 SC29
   that specified the MPEG-1, MPEG-2 and, more recently, the MPEG-4
   standards [1]. The MPEG-4 standard specifies compression of
   audio-visual data into for example an audio or video elementary
   stream. In the MPEG-4 standard, these streams take the form of
   audiovisual objects that may be arranged into an audio-visual scene
   by means of a scene description. Each MPEG-4 elementary stream
   consists of a sequence of Access Units; examples of an Access Unit
   (AU) are an audio frame and a video picture.

   The MPEG-4 system specification is a rather abstract specification
   in the sense that no transport format for MPEG-4 elementary streams
   is defined. Instead, a conceptual synchronization layer (SL) has
   been specified to store transport specific information such as time
   stamps and random access point information. When transporting an
   MPEG-4 elementary stream, transport information from the SL is
   typically mapped to the actual transport layer. Note that the SL is
   conceptual and may not exist in practice.

   This specification defines a general and configurable payload
   structure to transport MPEG-4 elementary streams such as audio,
   speech, video and BIFS streams. The RTP payload defined in this
   document is simple to implement and reasonably efficient. It allows
   for optional interleaving of Access Units (such as audio frames) to
   increase error resiliency in packet loss.

   Configuration of the payload is provided to accommodate transport
   of any MPEG-4 stream at any possible bit rate. However, for a
   specific MPEG-4 elementary stream typically only very few
   configurations are needed. So as to allow for the design of
   simplified, but dedicated receivers, this specification requires
   that specific modes are defined for transport of MPEG-4 streams.
   This document defines modes for MPEG-4 CELP and AAC streams, as
   well as a generic mode that can be used to transport any MPEG-4
   stream. In the future new RFCs are expected to specify additional
   modes for transport of MPEG-4 streams.

   The RTP payload format defined in this document specifies carriage
   of system-related information that is often equivalent to the
   information that may be contained in the MPEG-4 SL. This
   document does not prescribe how to transcode or map information
   from the SL to fields defined in the RTP payload format. Such
   processing, if any, is left to the discretion of the application.
   However, to anticipate the need for transport of any additional
   system-related information in future, an auxiliary field can be
   configured that may carry any such data.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in
   this document are to be interpreted as described in RFC 2119 [3].

van der Meer et al.        Expires October 2002                [Page 3]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002


2. Carriage of MPEG-4 elementary streams over RTP

2.1 Introduction

   With this payload format a single MPEG-4 elementary stream can be
   transported. Information on the type of MPEG-4 stream carried in
   the payload is conveyed by MIME format parameters, for example in
   an SDP [6] message or by other means. These MIME format parameters
   specify the configuration of the payload. To allow for simplified
   and dedicated receivers, a MIME format parameter is available
   to signal a specific mode of using this payload. A mode definition
   MAY include the type of MPEG-4 elementary stream as well as the
   applied configuration, so as to avoid the need in receivers
   to parse all MIME format parameters. The applied mode MUST be
   signalled.

2.2 MPEG Access Units

   For carriage of compressed audio-visual data MPEG defines Access
   Units. An MPEG Access Unit (AU) is the smallest data entity to
   which timing information is attributed. In case of audio an Access
   Unit may represent an audio frame and in case of video a picture.
   MPEG Access Units are by definition byte aligned. If for example an
   audio frame is not byte aligned, up to 7 zero-padding bits MUST be
   inserted at the end of the frame to achieve a byte-aligned Access
   Unit. MPEG-4 decoders MUST be able to decode AUs in which such
   padding is applied.

   Consistent with the MPEG-4 specification, this document requires
   that each MPEG-4 part 2 video Access Unit includes all the coded
   data of a picture, any video stream headers that may precede the
   coded picture data, and any video stream stuffing that may follow
   it, up to, but not including the startcode indicating the start of
   a new video stream or the next Access Unit.

2.3 Concatenation of Access Units

   Frequently it is possible to carry multiple Access Units in one RTP
   packet. This is particularly useful for audio; for example, when
   AAC is used for encoding of a stereo signal at 64 kbits/sec, AAC
   frames contain on average approximately 200 octets. On a LAN with a
   1500 octet MTU this would allow on average 7 complete AAC frames to
   be carried per AAC packet.

   Access Units may have a fixed size in octets, but a variable size
   is also possible. To facilitate parsing in case of multiple
   concatenated AUs in one RTP packet, the size of each AU is made
   known to the receiver. When concatenating in case of a constant AU
   size, this size is communicated "out of band" through a MIME format



van der Meer et al.        Expires October 2002                [Page 4]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   parameter. When concatenating in case of variable size AUs, the RTP
   payload carries "in band" an AU size field for each contained AU.
   In combination with the RTP payload length the size information
   allows the RTP payload to be split by the receiver back into the
   individual AUs.

   To simplify the implementation of RTP receivers, it is required
   that when multiple AUs are carried in an RTP packet, each AU MUST
   be complete, i.e. the number of AUs in an RTP packet MUST be
   integral.

2.4 Fragmentation of Access Units

   MPEG allows for very large Access Units. Since most IP networks
   have significantly smaller MTU sizes, this payload format allows
   for the fragmentation of an Access Unit over multiple RTP packets
   so as to avoid IP layer fragmentation. To simplify the
   implementation of RTP receivers, an RTP packet SHALL either carry
   one or more complete Access Units or a single fragment of one
   Access Unit.

2.5 Interleaving

   When an RTP packet carries a contiguous sequence of Access Units,
   the loss of such a packet can result in a "decoding gap" for the
   user. One method to alleviate this problem is to allow for the
   Access Units to be interleaved in the RTP packets. For a modest
   cost in latency and implementation complexity, significant error
   resiliency to packet loss can be achieved.

   To support optional interleaving of Access Units, this payload
   format allows for index information to be sent for each Access Unit.
   The RTP sender is free to choose the interleaving pattern without
   propagating this information to the receiver(s). Indeed the sender
   could dynamically adjust the interleaving pattern based on the
   Access Unit size, error rates, etc. The RTP receiver does not need
   to know the interleaving pattern used, it only needs to extract the
   index information of the Access Unit and insert the Access Unit
   into the appropriate sequence in the rendering queue. An example of
   interleaving is given below.

   Assume that an RTP packet contains 3 AUs, and that the AUs are
   numbered 1, 2, 3, 4, etc. If an interleaving group length of 9 is
   chosen, then RTP packet(i) contains the following AU(n):

   RTP packet(1):  AU(1),  AU(4),  AU(7)
   RTP packet(2):  AU(2),  AU(5),  AU(8)
   RTP packet(3):  AU(3),  AU(6),  AU(9)
   RTP packet(4):  AU(10), AU(13), AU(16)
   RTP packet(5):  AU(11), AU(14), AU(17)
   Etc.


van der Meer et al.        Expires October 2002                [Page 5]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

2.6 Time stamp information

   The RTP time stamp MUST carry the sampling instance of the first AU
   (fragment) in the RTP packet. When multiple AUs are carried within
   an RTP packet, the time stamps of subsequent AUs can be calculated
   if the frame period of each AU is known. For audio and video this
   is possible if the frame rate is constant. However, in some cases
   it is not possible to make such calculation, for example for
   variable frame rate video and for MPEG-4 BIFS streams carrying
   composition information. To support such cases, this payload format
   can be configured to carry a time stamp in the RTP payload for each
   contained Access Unit. A time stamp MAY be conveyed in the RTP
   payload only for non-first AUs in the RTP packet, and SHALL NOT be
   conveyed for the first AU (fragment), as the time stamp for the
   latter is carried by the RTP time stamp.

   MPEG-4 defines two type of time stamps, the composition time stamp
   (CTS) and the decoding time stamp (DTS). The CTS represents the
   sampling instance of an AU, and hence the CTS is equivalent to the
   RTP time stamp. The DTS may be used only in MPEG-4 video streams
   that use bi-directional coding, i.e. when pictures are predicted in
   both forward and backward direction by using either a reference
   picture in the past, or a reference picture in the future. The DTS
   cannot be carried in the RTP header. In some cases the DTS can be
   derived from the RTP time stamp using frame rate information; this
   requires deep parsing in the video stream, which may be considered
   objectionable. But if the video frame rate is variable, the required
   information may not even be present in the video stream. For both
   reasons, the capability has been defined to optionally carry the
   DTS in the RTP payload for each contained Access Unit.

   Since RTP time stamps may be re-stamped by RTP devices, each time
   stamp contained in the RTP payload is coded differentially from the
   RTP time stamp, so as to avoid extensive parsing by re-stamping
   devices.

2.7 Carriage of auxiliary information.

   This payload format defines a specific field to carry auxiliary
   data. The auxiliary data field is preceded by a field that specifies
   the length of the auxiliary data, so as to facilitate skipping of
   the data without parsing it. The coding of the auxiliary data is not
   defined in this document, but is left to the discretion of
   applications. Receivers that have knowledge of the auxiliary data
   MAY decode the auxiliary data, but receivers without knowledge of
   such data MUST skip the auxiliary data field.

2.8 MIME format parameters and configuring conditional fields

   To support the features described in the previous sections several
   fields are defined for carriage in the RTP payload. However, their
   use strongly depends on the type of MPEG-4 elementary stream that

van der Meer et al.        Expires October 2002                [Page 6]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   is carried. Sometimes a specific field is needed with a certain
   length, while in other cases such field is not needed at all. To be
   efficient in either case, the fields to support these features are
   configurable by means of MIME format parameters. In general, a MIME
   format parameter defines the presence and length of the associated
   field. A length of zero indicates absence of the field. As a
   consequence, parsing of the payload requires knowledge of MIME
   format parameters. The MIME format parameters are conveyed to the
   receiver via SDP [6] messages or through other means.

2.9 Global structure of payload format

   The RTP payload following the RTP header, contains three byte
   aligned data sections, of which the first two MAY be empty. See
   figure 1.

          +---------+-----------+-----------+---------------+
          | RTP     | AU Header | Auxiliary | Access Unit   |
          | Header  | Section   | Section   | Data Section  |
          +---------+-----------+-----------+---------------+

                    <----------RTP Packet Payload----------->

   Figure 1: Data sections within an RTP packet

   The first data section is the AU (Access Unit) Header Section, that
   contains one or more AU-headers; however, each AU-header MAY be
   empty, in which case the entire AU Header Section is empty. The
   second section is the Auxiliary Section, containing auxiliary data;
   this section MAY also be configured empty. The third section is the
   Access Unit Data Section, containing either a single fragment of
   one Access Unit or one or more complete Access Units. The Access
   Unit Data Section is never empty.

2.10 Modes to transport MPEG-4 streams

   While it is possible to build fully configurable receivers capable
   of receiving any MPEG-4 stream, this specification also allows for
   the design of simplified, but dedicated receivers, that are capable
   for example of receiving only one type of MPEG-4 stream. This
   is achieved by requiring that specific modes be defined for using
   this specification. Each mode may define constraints for transport
   of one or more type of MPEG-4 streams, for instance on the payload
   configuration.

   The applied mode MUST be signalled. Signalling the mode is
   particularly important for receivers that are only capable of
   decoding one or more specific modes. Such receivers need to
   determine whether the applied mode is supported, so as to avoid
   problems with processing of payloads that are beyond the
   capabilities of the receiver.


van der Meer et al.        Expires October 2002                [Page 7]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   In this document several modes are defined for transport of MPEG-4
   CELP and AAC streams, as well as a generic mode that can be used
   for any MPEG-4 stream. In future, new RFCs are expected to specify
   additional modes of using this specification. New modes can be
   defined as deemed appropriate, typically by specifications that are
   hierarchically higher than this payload format. However, each mode
   MUST be in full compliance with this specification.

2.11 Alignment with RFC 3016

   This payload can be configured to be nearly identical to the
   payload format defined in RFC 3016 [5] for the MPEG-4 video
   configurations recommended in RFC 3016. Hence, receivers that
   comply with RFC 3016 can decode such RTP payload, providing that
   additional packets containing video decoder configuration (VO,
   VOL, VOSH) are inserted in the stream, as required by RFC 3016.
   Conversely, receivers that comply with the specification in this
   document SHOULD be able to decode payloads, names and parameters
   defined for MPEG-4 video in RFC 3016. In this respect it is
   strongly recommended to implement the ability to ignore "in band"
   video decoder configuration packets in the RFC 3016 payload.

   For interoperability reasons, applications that transport MPEG-4
   video part 2 over RTP SHOULD use the payload format and associated
   names and parameters defined in RFC 3016 if the functionality
   provided by RFC 3016 can meet the requirements of that application.
   On the other hand, if applications wish to use a single RTP payload
   format for transport of all type of MPEG-4 streams, then the RTP
   payload defined in this document provides a suitable solution, also
   for transport of MPEG-4 video part 2 streams.

   Note that since the "out of band" availability of the video decoder
   configuration as a MIME parameter is optional in RFC 3016, for
   obvious interoperability reasons with this specification it is
   recommended to systematically implement this optional feature.


3 Payload Format

3.1 RTP Header Fields Usage

   Payload Type (PT): The assignment of an RTP payload type for this
   RTP packet format is outside the scope of this document, and will
   not be specified here. It is expected that the RTP profile for a
   particular class of applications will assign a payload type for
   this encoding, or if that is not done, then a payload type in the
   dynamic range shall be chosen.






van der Meer et al.        Expires October 2002                [Page 8]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   Marker (M) bit: The M bit is set to 1 to indicate that the RTP
   packet payload includes the end of each Access Unit of which data
   is contained in this RTP packet. As the payload either carries one
   or more complete Access Units or a single fragment of an Access
   Unit, the M bit is always set to 1, except when the packet carries
   a single fragment of an Access Unit that is not the last one.

   Extension (X) bit: Defined by the RTP profile used.

   Sequence Number: The RTP sequence number SHOULD be generated by
   the sender with a constant random offset.

   Timestamp: Indicates the sampling instance of the first AU
   contained in the RTP payload. This sampling instance is equivalent
   to the CTS in the MPEG-4 time domain. When using SDP the clock rate
   of the RTP time stamp MUST be expressed using the "rtpmap"
   attribute. If an MPEG-4 audio stream is transported, the rate SHOULD
   be set to the same value as the sampling rate of the audio stream.
   If an MPEG-4 video stream is transported, it is RECOMMENDED to set
   the rate to 90 kHz.
   In all cases, the sender SHALL make sure that RTP time stamps
   are identical only if the RTP time stamp refers to fragments of the
   same Access Unit.
   According to RFC 1889 [2] (section 5.1), RTP time stamps are
   recommended to start at a random value for security reasons. This
   is not an issue for synchronization of multiple RTP streams.
   However, in applications where streams from multiple sources are to
   be synchronized (for example one stream from local storage, another
   from a RTP streaming server), synchronization may become impossible.
   To also enable synchronization in such cases, it may be necessary to
   provide the required relationship between time stamps for obtaining
   synchronization by out of band means. The format of such information
   as well as methods to convey such information are beyond the scope
   of this specification.

   SSRC: set as described in RFC1889 [2].

   CC and CSRC fields are used as described in RFC 1889 [2].

   RTCP SHOULD be used as defined in RFC 1889 [2].














van der Meer et al.        Expires October 2002                [Page 9]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

3.2 RTP Payload Structure

3.2.1 The AU Header Section

   When present, the AU Header Section consists of the AU-header-length
   field, followed by a number of AU-headers. See figure 2.

   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+-+
   |AU-headers-length|AU-header|AU-header|      |AU-header|padding|
   |                 |   (1)   |   (2)   |      |   (n)   | bits  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+-+

   Figure 2: The AU Header Section

   The AU-headers are configured using MIME format parameters and MAY
   be empty. If the AU-header is configured empty, the
   AU-headers-length field SHALL not be present and consequently the
   AU Header Section is empty. If the AU-header is not configured
   empty, then the AU-headers-length is a two octet field that
   specifies the length in bits of the immediately following
   AU-headers, excluding the padding bits.

   Each AU-header is associated with a single Access Unit (fragment)
   contained in the Access Unit Data Section in the same RTP packet.
   For each contained Access Unit (fragment) there is exactly one
   AU-header. Within the AU Header Section, the AU-headers are
   bit-wise concatenated in the order in which the Access Units are
   contained in the Access Unit Data Section. Hence, the n-th
   AU-header refers to the n-th AU (fragment). If the concatenated
   AU-headers consume a non-integer number of octets, up to 7
   zero-padding bits MUST be inserted at the end in order to achieve
   byte-alignment of the AU Header Section.

3.2.1.1 The AU-header

   The AU-header contains the fields given in figure 3. The length in
   bits of the above fields with the exception of the CTS-flag and
   the DTS-flag fields is defined by MIME format parameters; see
   section 4.1. If a MIME format parameter has the default value of
   zero, then the associated field is not present.













van der Meer et al.        Expires October 2002               [Page 10]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   +---------------------------------------+
   |     AU-size                           |
   +---------------------------------------+
   |     AU-Index / AU-Index-delta         |
   +---------------------------------------+
   |     CTS-flag                          |
   +---------------------------------------+
   |     CTS-delta                         |
   +---------------------------------------+
   |     DTS-flag                          |
   +---------------------------------------+
   |     DTS-delta                         |
   +---------------------------------------+

   Figure 3: The fields in the AU-header. If used, the AU-Index field
             only occurs in the first AU-header within an AU Header
             Section; in any other AU-header the AU-Index-delta field
             occurs instead.

   AU-size: indicates the size in octets of the associated Access Unit
         in the Access Unit Data Section in the same RTP packet. When
         the AU-size is associated with an AU fragment, the AU size
         indicates the size of the entire AU and not the size of the
         fragment. This can be exploited to determine whether a packet
         contains an entire AU or a fragment, which is particularly
         useful after losing a packet carrying the last fragment of an
         AU.

   AU-Index: indicates the serial number of the associated Access Unit
         (fragment). For each (in decoding order) consecutive AU or AU
         fragment, the serial number is incremented with 1. When
         present, the AU-Index field occurs in the first AU-header in
         the AU Header Section, but MUST NOT occur in any subsequent
         (non-first) AU-header in that Section. To encode the serial
         number in any such non-first AU-header, the AU-Index-delta
         field is used. If each AU-Index field is coded with the value
         0, the serial number of the AU (fragment) is not specified,
         and in that case receivers MAY ignore the AU-Index field.

   AU-Index-delta: The AU-Index-delta field is an unsigned integer
         that specifies the serial number of the associated AU as the
         difference with respect to the serial number of the previous
         Access Unit. Hence, for the n-th (n>1) AU the serial number
         is found from:
         AU-Index(n) = AU-Index(n-1) + AU-Index-delta(n) + 1
         If the AU-Index field is present in the first AU-header in
         the AU Header Section, then the AU-Index-delta field MUST be
         present in any subsequent (non-first) AU-header. When the
         AU-Index-delta is coded with the value 0, it indicates that
         the Access Units are consecutive in decoding order. An
         AU-Index-delta value larger than 0 signals that interleaving
         is applied.

van der Meer et al.        Expires October 2002               [Page 11]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   CTS-flag: Indicates whether the CTS-delta field is present.
         A value of 1 indicates that the field is present, a value
         of 0 that it is not present.
         The CTS-flag field MUST be present in each AU-header if the
         length of the CTS-delta field is signalled to be larger than
         zero. In that case, the CTS-flag field MUST have the value 0
         in the first AU-header and MAY have the value 1 in all
         non-first AU-headers. The CTS-flag field SHOULD be 0 for
         any non-first fragment of an Access Unit.

   CTS-delta: Encodes the CTS by specifying the value of CTS as a 2's
         complement offset (delta) from the time stamp in the RTP
         header of this RTP packet. The CTS MUST use the same clock
         rate as the time stamp in the RTP header.

   DTS-flag: Indicates whether the DTS-delta field is present. A value
         of 1 indicates that DTS-delta is present, a value of 0 that
         it is not present.
         The DTS-flag field MUST be present in each AU-header if the
         length of the DTS-delta field is signalled to be larger than
         zero. The DTS-flag field SHOULD be 0 for any non-first
         fragment of an Access Unit.

   DTS-delta: specifies the value of the DTS as a 2's complement
         offset (delta) from the CTS. The DTS MUST use the
         same clock rate as the time stamp in the RTP header.

   If present, the fields MUST occur in the mutual order given in
   figure 3. In the general case a receiver can only discover the size
   of an AU-header by parsing it since the presence of the CTS-delta
   and DTS-delta fields is signalled by the value of the CTS-flag and
   DTS-flag, respectively.

3.2.2 The Auxiliary Section

   The Auxiliary Section consists of the auxiliary-data-size field
   followed by the auxiliary-data field. Receivers MAY (but are not
   required to) parse the auxiliary-data field; to facilitate skipping
   of the auxiliary-data field by receivers, the auxiliary-data-size
   field indicates the length in bits of the auxiliary-data. If the
   concatenation of the auxiliary-data-size and the auxiliary-data
   fields consume a non-integer number of octets, up to 7 zero padding
   bits MUST be inserted immediately after the auxiliary data in order
   to achieve byte-alignment. See figure 4.

   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+
   | auxiliary-data-size   | auxiliary-data       |padding bits |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+

   Figure 4: The fields in the Auxiliary Section



van der Meer et al.        Expires October 2002               [Page 12]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   The length in bits of the auxiliary-data-size field is configurable
   by a MIME format parameter; see section 4.1. The default length of
   zero indicates that the entire Auxiliary Section is absent.

   auxiliary-data-size: specifies the length in bits of the immediately
         following auxiliary-data field;

   auxiliary-data: the auxiliary-data field contains data of a format
         not defined by this specification.

3.2.3 The Access Unit Data Section

   The Access Unit Data Section contains an integer number of complete
   Access Units or a single fragment of one AU. The Access Unit Data
   Section is never empty. If data of more than one Access Unit is
   present, then the AUs are concatenated into a contiguous string
   of octets. See figure 5. The AUs inside the Access Unit Data
   Section MUST be in decoding order.

   The size and number of Access Units SHOULD be adjusted such that
   the resulting RTP packet is not larger than the path MTU. To handle
   larger packets, this payload format relies on lower layers for
   fragmentation, which may not be desirable.

   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |AU(1)                                                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-                                |
   |                                                                   |
   |     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |               |AU(2)                                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                                   |
   |                                                                   |
   |                            -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                               | AU(n)                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |               |
   |-+-+-+-+-+-+-+-+

   Figure 5: Access Unit Data Section; each AU is byte aligned.

   When multiple Access Units are carried, the size of each AU MUST be
   made available to the receiver. If the AU size is variable then the
   size of each AU MUST be indicated in the AU-size field of the
   corresponding AU-header. However, if the AU size is constant for a
   stream, this mechanism SHOULD NOT be used, but instead the fixed
   size SHOULD be signalled by the MIME format parameter
   "ConstantSize", see section 4.1.

   The absence of both AU-size in the AU-header and the ConstantSize
   MIME format parameter indicates carriage of a single AU (fragment),
   i.e. that a single Access Unit (fragment) is transported in each
   RTP packet for that stream.

van der Meer et al.        Expires October 2002               [Page 13]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

3.2.3.1 Fragmentation

   A packet SHALL carry either one or more Access Units, or a single
   fragment of an Access Unit.  Fragments of the same Access Unit have
   the same time stamp but different RTP sequence numbers. The marker
   bit in the RTP header is 1 on the last fragment of an Access Unit,
   and 0 on all other fragments.

3.2.3.2 Interleaving

   Access Units MAY be interleaved. Senders MAY perform interleaving.
   Receivers MUST support interleaving. When interleaving of Access
   Units is used it SHALL be implemented using the AU-Index and
   AU-Index-delta fields in the AU-header.

   Based on the RTP sequence number, the RTP time stamp, the AU-Index
   and the AU-Index-delta, a receiver can unambiguously reconstruct
   the original order even in case of out-of-order packets, packet
   loss or duplication. Note that for this purpose the AU-Index is
   redundant when the RTP time stamp and the AU-Index-delta values are
   sufficient for placing the AUs correctly in time. In such cases
   receivers MAY ignore the AU-Index value and senders MAY code the
   AU-Index field with the value 0, but only if they code each AU-Index
   field with that value.

   When interleaving is applied, a de-interleave buffer is needed in
   receivers to put the Access Units in their correct logical
   consecutive decoding order. This requires the computation of the
   time stamp for each Access Unit. In case of a fixed time duration
   per Access Unit, the time stamp of the i-th access unit in an RTP
   packet with RTP time stamp T is calculated as follows:

   Timestamp[0] = T
   Timestamp[i, i > 0] = T +(Sum(for k=1 to i of (AU-Index-delta[k]
                         + 1))) * access-unit-duration

   When AU-Index-delta is always 0, this reduces to T + i * (access-
   unit-duration). This is the non-interleaved case, where the frames
   are consecutive in decoding order. Note that the AU-Index field
   (present for the first Access Unit) is not needed in this
   calculation. Hence in cases where the Access-unit-duration has a
   fixed and known value, the AU-Index does not need to provide index
   information and can be coded with the value 0. See also the
   semantics of the AU-Index field in 3.2.1.1.

   When an RTP packet arrives (after any reordering has been done),
   receivers may 'flush' all Access Units from the interleave buffer
   which have a time stamp strictly less than the time stamp of the
   arriving packet. Similarly the first Access Unit of every arriving
   packet can always be flushed (as no following packet can provide

van der Meer et al.        Expires October 2002               [Page 14]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   an earlier Access Unit), and any Access Units which are consecutive
   with it which have already been received. Access Units should also
   be flushed in time to be played; this can be important if there is
   loss before end-of-stream, before a silence interval, or before a
   large drop-out.

3.2.3.3 Constraints for interleaving

   The size of the packets should be suitably chosen to be appropriate
   to both the path MTU and the duration and capacity of the receiver's
   de-interleave buffer. The maximum packet size for a session should
   be chosen not to exceed the path MTU.

   In order to control receiver latency and mitigate the effects of
   loss, there are profile-based limits on the size of the packet.
   This is expressed as a duration: it is calculated from the duration
   of the Access Units contained within a packet. Note that this
   duration is NOT the difference between the time stamps of the first
   and last Access Unit in a packet.

   No matter what interleaving scheme is used, the scheme must be
   analyzed to calculate the minimum number of frames a receiver has
   to buffer in order to de-interleave.

   Three profiles are defined to constrain the latency when interlea-
   ving. The applied profile is signalled by the MIME format parameter
   "Profile", indicating the decimal number of the profile. The maximum
   de-interleave buffer required at the receiver can be determined if
   the maximum packet duration is known. The maximum packet duration
   in milliseconds for the three profiles, shall not exceed:

   Profile 0 --  200 milliseconds
   Profile 1 --  500 milliseconds
   Profile 2 -- 1500 milliseconds

   When interleaving is applied, the applied RTP transport profile
   MUST be signalled by the MIME format parameter "Profile"; see
   section 4.1.

   Note that for low bit-rate material, this duration limit may make
   packets shorter than the MTU size.

3.3 Usage of this specification

3.3.1 General

   Usage of this specification requires definition of a mode. A mode
   defines how to use this specification, as deemed appropriate.
   Senders MUST signal the applied mode via the MIME format parameter
   "Mode". This specification defines a generic mode that can be used
   for any MPEG-4 stream, as well as specific modes for transport of
   MPEG-4 CELP and MPEG-4 AAC streams.

van der Meer et al.        Expires October 2002               [Page 15]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   In any mode compliant to this specification the same requirements
   apply for the rtpmap attributes. The general form of an rtpmap
   attribute is:
   a=rtpmap:<payload type> <encoding name>/<clock rate>[/<encoding
             parameters>]
   For audio streams, <encoding parameters> specifies the number of
   audio channels. This parameter may be omitted if the number of
   channels is one, provided no additional parameters are needed.
   In any mode, the following attributes are REQUIRED:
   a) The encoding name
   b) The RTP clock rate MUST be expressed.
   c) The number of audio channels MUST be specified, for example as
      2 for  stereo material (see RFC 2327) and MAY be specified as 1
      for mono material; 1 is the default.

3.3.2 The generic mode

   The generic mode can be used for any MPEG-4 stream. In this mode
   no mode-specific constraints are applied; hence, the generic mode
   exploits the full flexibility of this specification. The generic
   mode is signalled by mode=generic.

   An example is given below for transport of a BIFS stream. In this
   example carriage of multiple BIFS Access Units is allowed in one
   RTP packet. The AU-header section contains the AU-size field, the
   CTS-flag and, if the CTS flag is set to 1, the CTS-delta field.
   The number of bits of the AU-size and the CTS-delta fields is 15
   and 16 respectively, which results in an AU-header of two or four
   octets per BIFS AU. The RTP time stamp uses a 1 kHz clock.
   In detail:

   m=video 49230 RTP/AVP 96
   a=rtpmap:96 mpeg4-generic/1000
   a=fmtp:96 streamtype=3; profile-level-id=257; mode=generic;
   ObjectType=2; config=BIFSConfiguration(); SizeLength=15;
   CTSDeltaLength=16

3.3.3 Constant bit-rate CELP

   This mode is signalled by mode=CELP-cbr. In this mode one or more
   fixed size CELP frames can be transported in one RTP packet; there
   is no support for interleaving. The RTP payload consist of one or
   more concatenated CELP frames, each of the same size. Both the AU
   Header Section and the Auxiliary Section are empty.

   The MIME format parameter ConstantSize MUST be provided to specify
   the length of each CELP frame.




van der Meer et al.        Expires October 2002               [Page 16]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   For example:

   m=audio 49230 RTP/AVP 96
   a=rtpmap:96 mpeg4-generic/44100/2
   a=fmtp:96 streamtype=5; profile-level-id=15; mode=CELP-cbr; config=
   AudioSpecificConfig(); ConstantSize=xxx;

   The AudioSpecificConfig() specifies that the audio stream type is
   CELP.

3.3.4 Variable bit-rate CELP

   This mode is signalled by mode=CELP-vbr. With this mode one or
   more variable size CELP frames can be transported in one RTP packet
   with optional interleaving. As the largest possible frame size in
   this mode is greater than the maximum CELP frame size, there is no
   support for fragmentation of CELP frames.

   In this mode the RTP payload consists of the AU Header Section,
   followed by one or more concatenated CELP frames. The Auxiliary
   Section is empty. For each CELP frame contained in the payload
   there is a one octet AU-header in the AU Header Section to
   provide:
   (a) the size of each CELP frame in the payload and
   (b) index information for computing the sequence (and hence timing)
       of each CELP frame.
   Transport of CELP frames requires that the AU-size field is coded
   with 6 bits. In this mode therefore 6 bits are allocated to the
   AU-size field, and 2 bits to the AU-Index(-delta) field. Each
   AU-Index field MUST be coded with the value 0. In the AU Header
   Section, the concatenated AU-headers are preceded by the 16-bit
   AU-headers-length field, as specified in 3.2.1.

   In addition to the required MIME format parameters, the following
   parameters MUST be present: SizeLength, IndexLength, and
   IndexDeltaLength.
   When interleaving is applied (AU-Index-delta coded with a value
   larger than 0), the parameter Profile MUST also be present.

   For example:

   m=audio 49230 RTP/AVP 96
   a=rtpmap:96 mpeg4-generic/44100/2
   a=fmtp:96 streamtype=5; profile-level-id=15; mode=CELP-vbr; config=
   AudioSpecificConfig(); SizeLength=6; IndexLength=2;
   IndexDeltaLength=2; Profile=1

   The AudioSpecificConfig() specifies that the audio stream type is
   CELP.




van der Meer et al.        Expires October 2002               [Page 17]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

3.3.5 Low bit-rate AAC

   This mode is signalled by mode=AAC-lbr. This mode supports transport
   of one or more variable size AAC frames with optional support for
   interleaving and fragmenting. The maximum size of an AAC frame
   (fragment) in this mode is 63 octets.

   The payload configuration in this mode is the same as in the
   variable bit-rate CELP mode as defined in 3.3.4. The RTP payload
   consists of the AU Header Section, followed by concatenated AAC
   frames. The Auxiliary Section is empty. For each AAC frame contained
   in the payload the one octet AU-header provides:
   (a) the size of each AAC frame in the payload and
   (b) index information for computing the sequence (and hence timing)
       of each AAC frame.
   In the AU-header, the AU-size is coded with 6 bits and the
   AU-Index(-delta) with 2 bits; the AU-Index field MUST have the
   value 0 in each AU-header.
   In the AU-header Section, the concatenated AU-headers are preceded
   by the 16-bit AU-headers-length field, as specified in 3.2.1.

   In addition to the required MIME format parameters, the following
   parameters MUST be present: SizeLength, IndexLength, and
   IndexDeltaLength.
   When interleaving is applied (AU-Index-delta coded with a value
   larger than 0), also the parameter Profile MUST be present.

   For example:

   m=audio 49230 RTP/AVP 96
   a=rtpmap:96 mpeg4-generic/44100/2
   a=fmtp:96 streamtype=5; profile-level-id=15; mode=AAC-lbr; config=
   AudioSpecificConfig(); SizeLength=6; IndexLength=2;
   IndexDeltaLength=2; Profile=1

   The AudioSpecificConfig() specifies that the audio stream type is
   AAC.

3.3.6 High bit-rate AAC

   This mode is signalled by mode=AAC-hbr. This mode supports transport
   of one or more large variable size AAC frames in one RTP packet with
   optional support for interleaving and fragmenting. The maximum size
   of an AAC frame (fragment) in this mode is 8191 octets.

   In this mode the RTP payload consists of the AU Header Section,
   followed by one or more concatenated AAC frames. The Auxiliary
   Section is empty. For each AAC frame contained in the payload there
   is an AU-header in the AU Header Section to provide:
   (a) the size of each AAC frame in the payload and
   (b) index information for computing the sequence (and hence timing)
       of each AAC frame.

van der Meer et al.        Expires October 2002               [Page 18]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   To code the maximum size of an AAC frame requires 13 bits. Therefore
   in this configuration 13 bits are allocated to the AU-size, and
   3 bits to the AU-Index(-delta) field. Thus each AU-header has a size
   of 2 octets. Each AU-Index field MUST be coded with the value 0. In
   the AU Header Section, the concatenated AU-headers are preceded by
   the 16-bit AU-headers-length field, as specified in 3.2.1.

   In addition to the required MIME format parameters, the following
   parameters MUST be present: SizeLength, IndexLength, and
   IndexDeltaLength.
   When interleaving is applied (AU-Index-delta coded with a value
   larger than 0), also the parameter Profile MUST be present.

   For example:

   m=audio 49230 RTP/AVP 96
   a=rtpmap:96 mpeg4-generic/44100/2
   a=fmtp:96 streamtype=5; profile-level-id=15; mode=AAC-hbr;
   config=AudioSpecificConfig(); SizeLength=13; IndexLength=3;
   IndexDeltaLength=3; Profile=1

   The AudioSpecificConfig() specifies that the audio stream type is
   AAC.

3.3.7 Additional modes

   This specification only defines the modes specified in sections
   3.3.2 up to 3.3.6. Additional modes are expected to be defined in
   future RFCs. Each additional mode MUST be in full compliance with
   this specification.

   When defining a new mode care MUST be taken that an implementation
   of all features of this specification can decode the payload format
   corresponding to this new mode. For this reason a mode MUST NOT
   specify new default values for MIME parameters. In particular, MIME
   parameters that configure the RTP payload MUST be present (unless
   they have the default value), even if its presence is redundant in
   case the mode assigns a fixed value to a parameter. A mode may
   define additionally that some MIME parameters are required instead
   of optional, that some MIME parameters have fixed values (or
   ranges), and that there are rules restricting the usage.


4. IANA considerations

   This section describes the MIME types and names associated with
   this payload format. Section 4.1 registers the MIME types, as per
   RFC 2048.

   This format may require additional information about the mapping to
   be made available to the receiver. This is done using parameters
   also described in the next section.

van der Meer et al.        Expires October 2002               [Page 19]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

4.1 MIME type registration

   MIME media type name: "video" or "audio" or "application"

   "video" MUST be used for MPEG-4 Visual streams (ISO/IEC 14496-2)
   or MPEG-4 Systems streams (ISO/IEC 14496-1) that convey information
   needed for an audio/visual presentation.

   "audio" MUST be used for MPEG-4 Audio streams (ISO/IEC 14496-3)
   or MPEG-4 Systems streams that convey information needed for an
   audio only presentation.

   "application" MUST be used for MPEG-4 Systems streams (ISO/IEC
   14496-1) that serve purposes other than audio/visual presentation,
   e.g. in some cases when MPEG-J streams are transmitted.

   Depending on the required payload configuration, MIME format
   parameters need to be available to the receiver. This is done using
   the parameters described in the next section. There are required
   and optional parameters.

   Optional parameters are of two types: general parameters and
   configuration parameters. The configuration parameters are used to
   configure the fields in the AU Header section and in the auxiliary
   section. The absence of any configuration parameter is equivalent to
   the associated field set to its default value, which is always zero.
   The absence of all configuration parameters resolves into a default
   "basic" configuration with an empty AU-header section and an empty
   auxiliary section in each RTP packet.

   MIME subtype name: mpeg4-generic

   Required parameters:

   MIME format parameters are not case dependent; however for clarity
   both upper and lower case are used in the names of the parameters
   described in this specification.

      StreamType:
      The integer value that indicates the type of MPEG-4 stream that
      is carried; its coding corresponds to the values of the
      streamType as defined for the DecoderConfigDescriptor in
      ISO/IEC 14496-1. The value 6, indicating an MPEG-7 stream, MUST
      NOT be used, as this payload format is not intended for transport
      of MPEG-7 streams.

      Profile-level-id:
      A decimal representation of the MPEG-4 Profile Level indication.
      This parameter MUST be used in the capability exchange or
      session set-up procedure to indicate the MPEG-4 Profile and Level
      combination of which the relevant MPEG-4 media codec is capable
      of.

van der Meer et al.        Expires October 2002               [Page 20]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

      For MPEG-4 Audio streams, this parameter is the decimal value
         from Table 5 (audioProfileLevelIndication Values) in ISO/IEC
         14496-1, indicating which MPEG-4 Audio tool subsets are
         required to decode the audio stream.
      For MPEG-4 Visual streams, this parameter is the decimal value
         from Table G-1 (FLC table for profile and level indication of
         ISO/IEC 14496-2), indicating which MPEG-4 Visual tool subsets
         are required to decode the visual stream.
      For BIFS streams, this parameter is the decimal value that is
         obtained from (SPLI + 256*GPLI), where:
         SPLI is the decimal value from Table 4 in ISO/IEC 14496-1 with
            the applied sceneProfileLevelIndication;
         GPLI is the decimal value from Table 7 in ISO/IEC 14496-1 with
            the applied graphicsProfileLevelIndication.
      For MPEG-J streams, this parameter is the decimal value from
         table 13 (MPEGJProfileLevelIndication) in ISO/IEC 14496-1,
         indicating the profile and level of the MPERG-J stream.
      For OD streams, this parameter is the decimal value from table 3
         (ODProfileLevelIndication) in ISO/IEC 14496-1, indicating the
         profile and level of the OD stream.
      For IPMP streams, this parameter has either the decimal value 0,
         indicating an unspecified profile and level, or a value larger
         than zero, indicating an MPEG-4 IPMP profile and level as
         defined in a future MPEG-4 specification.
      For Clock Reference streams and Object Content Info streams, this
         parameter has the decimal value zero, indicating that profile
         and level information is conveyed through the OD framework.

      Config:
      A hexadecimal representation of an octet string that expresses
      the media payload configuration. Configuration data is mapped
      onto the octet string in an MSB-first basis. The first bit of
      the configuration data SHALL be located at the MSB of the first
      octet. In the last octet, if necessary to achieve byte alignment,
      up to 7 zero-valued padding bits shall follow the configuration
      data.
      For MPEG-4 Audio streams, config is the audio object type
         specific decoder configuration data AudioSpecificConfig() as
         defined in ISO/IEC 14496-3.
      For MPEG-4 Visual streams, config is the MPEG-4 Visual
         configuration information as defined in subclause 6.2.1 Start
         codes of ISO/IEC 14496-2. The configuration information
         indicated by this parameter SHALL be the same as the
         configuration information in the corresponding MPEG-4 Visual
         stream, except for first-half-vbv-occupancy and
         latter-half-vbv-occupancy, if it exists, which may vary in
         the repeated configuration information inside an MPEG-4
         Visual stream (See 6.2.1 Start codes of ISO/IEC 14496-2).
      For BIFS streams, this the BIFSConfig() information as defined
         in ISO/IEC 14496-1. For version 1, BIFSConfig is defined in
         section 9.3.2.4, and for version 2 in section 9.3.5. The MIME
         format parameter ObjectType signals the version of BIFSConfig.

van der Meer et al.        Expires October 2002               [Page 21]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

      For IPMP streams, this is either the decimal value 0, indicating
         the absence of any decoder configuration information, or the
         decimal value 1, followed by IPMPConfiguration() as defined
         in a future MPEG-4 IPMP specification.
      For Object Content Info (OCI) streams, this is the
         OCIDecoderConfiguration() information of the OCI stream, as
         defined in section 8.4.2.4 in ISO/IEC 14496-1.
      For OD streams, Clock Reference streams and MPEG-J streams, this
         is the decimal value 0, indicating that no information on the
         decoder configuration is required.

      Mode:
      The mode in which this specification is used. The following modes
      can be signalled:
      mode=generic,
      mode=CELP-cbr,
      mode=CELP-vbr,
      mode=AAC-lbr and
      mode=AAC-hbr.
      Other modes are expected to be defined in future RFCs. See also
      section 3.3.7.


   Optional general parameters:

      ObjectType:
      The decimal value from Table 8 in ISO/IEC 14496-1, indicating
      the value of the objectTypeIndication of the transported stream.
      For BIFS streams this parameter MUST be present to signal the
      type of BIFSConfiguration(). The ObjectType SHALL not signal a
      non-MPEG-4 stream.

      ConstantSize:
      The constant size in octets of each Access Unit for this stream.
      Simultaneous presence of ConstantSize and the SizeLength
      parameters is not permitted.

      Profile:
      The decimal representation of the applied profile to constrain
      the latency when interleaving; see section 3.2.3.3. Absence of
      this parameter signals that the profile is not specified.

   Optional configuration parameters:

      SizeLength:
      The number of bits on which the AU-size field is encoded in the
      AU-header. Simultaneous presence of SizeLength and the
      ConstantSize parameter is not permitted.





van der Meer et al.        Expires October 2002               [Page 22]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002


      IndexLength:
      The number of bits on which the AU-Index is encoded in the first
      AU-header. The default value of zero indicates the absence of
      the AU-Index and AU-Index-delta fields in each AU-header.

      IndexDeltaLength:
      The number of bits on which the AU-Index-delta field is encoded
      in any non-first AU-header.

      CTSDeltaLength:
      The number of bits on which the CTS-delta field is encoded in
      the AU-header.

      DTSDeltaLength:
      The number of bits on which the DTS-delta field is encoded in
      the AU-header.

      AuxiliaryDataSizeLength:
      The number of bits that is used to encode the auxiliary-data-size
      field.

   Applications MAY use more parameters, in addition to those defined
   above. Receivers MUST tolerate the presence of such additional
   parameters, but these parameters SHALL not impact the decoding of
   receivers that comply to this specification.

   Encoding considerations:
   System bitstreams MUST be generated according to MPEG-4 Systems
   specifications (ISO/IEC 14496-1). Video bitstreams MUST be generated
   according to MPEG-4 Visual specifications (ISO/IEC 14496-2). Audio
   bitstreams MUST be generated according to MPEG-4 Visual
   specifications (ISO/IEC 14496-3). The RTP packets MUST be packetized
   according to the RTP payload format defined in RFC xxxx.

   Security considerations:
   As defined in section 5 of RFC xxxx.

   Interoperability considerations:
   MPEG-4 provides a large and rich set of tools for the coding of
   visual objects.  For effective implementation of the standard,
   subsets of the MPEG-4 tool sets have been provided for use in
   specific applications. These subsets, called 'Profiles', limit the
   size of the tool set a decoder is required to implement. In order to
   restrict computational complexity, one or more 'Levels' are set for
   each Profile. A Profile@Level combination allows:
   . a codec builder to implement only the subset of the standard he
     needs, while maintaining interworking with other MPEG-4 devices
     that implement the same combination, and
   . checking whether MPEG-4 devices comply with the standard
     ('conformance testing').


van der Meer et al.        Expires October 2002               [Page 23]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   A stream SHALL be compliant with the MPEG-4 Profile@Level specified
   by the parameter "profile-level-id". Interoperability between a
   sender and a receiver is achieved by specifying the parameter
   "profile-level-id" in MIME content. In the capability exchange /
   announcement procedure this parameter may mutually be set to the
   same value.

   Published specification:
   The specifications for MPEG-4 streams are presented in ISO/IEC
   14469-1, 14469-2, and 14469-3. The RTP payload format is described
   in RFC xxxx.

   Applications which use this media type:
   Multimedia streaming and conferencing tools, Internet messaging and
   Email applications.

   Additional information: none

   Magic number(s): none

   File extension(s):
   None. A file format with the extension .mp4 has been defined for
   MPEG-4 content but is not directly correlated with this MIME type
   for which the sole purpose is RTP transport.

   Macintosh File Type Code(s): none

   Person & email address to contact for further information:
   Authors of RFC xxxx, IETF Audio/Video Transport working group.

   Intended usage: COMMON

   Author/Change controller:
   Authors of RFC xxxx, IETF Audio/Video Transport working group.

4.2 Concatenation of parameters

   Multiple parameters SHOULD be expressed as a MIME media type string,
   in the form of a semicolon-separated list of parameter=value pairs
   (for parameter usage examples see sections 3.3.2 up to 3.3.6).

4.3 Usage of SDP

4.3.1 The a=fmtp keyword

   It is assumed that one typical way to transport the above-described
   parameters associated with this payload format is via a SDP message
   [6] for example transported to the client in reply to a RTSP
   DESCRIBE or via SAP. In that case the (a=fmtp) keyword MUST be used
   as described in RFC 2327 [7], section 6, the syntax being then:

   a=fmtp:<format> <parameter name>=<value>[; <parameter name>=<value>]

van der Meer et al.        Expires October 2002               [Page 24]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

5. Security Considerations

   RTP packets using the payload format defined in this specification
   are subject to the security considerations discussed in the RTP
   specification [5]. This implies that confidentiality of the media
   streams is achieved by encryption. Because the data compression used
   with this payload format is applied end-to-end, encryption may be
   performed on the compressed data so there is no conflict between the
   two operations. The packet processing complexity of this payload
   type (i.e. excluding media data processing) does not exhibit any
   significant non-uniformity in the receiver side to cause a denial-
   of-service threat.

   However, it is possible to inject non-compliant MPEG streams (Audio,
   Video, and Systems) to overload the receiver/decoder's buffers,
   which might compromise the functionality of the receiver or even
   crash it. This is especially true for end-to-end systems like MPEG
   where the buffer models are precisely defined.

   MPEG-4 Systems supports stream types including commands that are
   executed on the terminal like OD commands, BIFS commands, etc. and
   programmatic content like MPEG-J (Java(TM) Byte Code) and
   ECMAScript. It is possible to use one or more of the above in a
   manner non-compliant to MPEG to crash or temporarily make the
   receiver unavailable.

   Senders SHOULD ensure that packet loss does not cause severe
   problems in application execution when the packet carries OD
   commands, BIFS commands, or programmatic content such as MPEG-J and
   ECMAScript. When such measures cannot be taken, instead of this
   payload format applications SHOULD use more reliable means to
   transport the information.

   Authentication mechanisms can be used to validate the sender and
   the data to prevent security problems due to non-compliant malignant
   MPEG-4 streams.

   In ISO/IEC 14469-1 a security model is defined for MPEG-4 Systems
   streams carrying MPEG-J access units which comprise Java(TM) classes
   and objects. MPEG-J defines a set of Java APIs and a secure
   execution model. MPEG-J content can call this set of APIs and
   Java(TM) methods from a set of Java packages supported in the
   receiver within the defined security model. According to this
   security model, downloaded byte code is forbidden to load libraries,
   define native methods, start programs, read or write files, or read
   system properties.

   Receivers can implement intelligent filters to validate the buffer
   requirements or parametric (OD, BIFS, etc.) or programmatic (MPEG-J,
   ECMAScript) commands in the streams. However, this can increase the
   complexity significantly.


van der Meer et al.        Expires October 2002               [Page 25]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

6. Acknowledgements

   This document evolved through several revisions thanks to
   contributions by people from the ISMA forum, from the IETF AVT
   Working Group and from the 4-on-IP ad-hoc group within MPEG. The
   authors wish to thank all involved people, and in particular Colin
   Perkins, Stephan Wenger and Dorairaj V for their valuable comments
   and support.


7. References

   [1] ISO/IEC International Standard 14496 (MPEG-4); "Information
   technology - Coding of audio-visual objects", January 2000

   [2] Schulzrinne, Casner, Frederick, Jacobson RTP: A Transport
   Protocol for Real Time Applications  RFC 1889, Internet Engineering
   Task Force, January 1996.

   [3] S. Bradner, Key words for use in RFCs to Indicate Requirement
   Levels, RFC 2119, March 1997.

   [4] D. Hoffman, G. Fernando, V. Goyal, M. Civanlar, RTP payload
   format for MPEG1/MPEG2 Video, RFC 2250, January 1998.

   [5] Y. Kikuchi, T. Nomura, S. Fukunaga, Y. Matsui, H. Kimata, RTP
   payload format for MPEG-4 Audio/Visual streams, RFC 3016.

   [6] Handley, Jacobson, SDP: Session Description Protocol, RFC 2327,
   Internet Engineering Task Force, April 1998.


7. Author Adresses

   Jan van der Meer
   Philips Digital Networks
   Cederlaan 4
   5600 JB Eindhoven
   Netherlands
   Email : jan.vandermeer@philips.com

   David Mackie
   Cisco Systems Inc.
   170 West Tasman Dr.
   San Jose, CA 95034
   Email: dmackie@cisco.com

   Viswanathan Swaminathan
   Sun Microsystems Inc.
   901 San Antonio Road, M/S UMPK15-214
   Palo Alto, CA 94303
   Email: viswanathan.swaminathan@sun.com

van der Meer et al.        Expires October 2002               [Page 26]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   David Singer
   Apple Computer, Inc.
   One Infinite Loop, MS:302-3MT
   Cupertino  CA 95014
   Email: singer@apple.com

   Philippe Gentric
   Philips Digital Networks, MP4Net
   51 rue Carnot
   92156 Suresnes
   France
   e-mail: philippe.gentric@philips.com


   Full Copyright Statement

   "Copyright (C) The Internet Society (date). All Rights Reserved.
   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain
   it or assist in its implementation may be prepared, copied,
   published and distributed, in whole or in part, without restriction
   of any kind, provided that the above copyright notice and this
   paragraph are included on all such copies and derivative works.
   However, this document itself may not be modified in any way, such
   as by removing the copyright notice or references to the Internet
   Society or other Internet organizations, except as needed for the
   purpose of developing Internet standards in which case the
   procedures for copyrights defined in the Internet Standards process
   MUST be followed, or as required to translate it into.




APPENDIX: Usage of this payload format

Appendix A. Examples

A.1 Examples of delay analysis with interleave

A.1.1 Group interleave

   An example of regular interleave is when packets are formed into
   groups.  If the number of packets in a group is N, packet 0 contains
   frame 0, frame N, frame 2N, and so on;  packet 1 contains frame 1,
   frame 1+N, 1+2N, and so on.  The AU-Index field is used to document
   the sequence of the packet within the group (or the first frame in
   the packet, which is the same thing in this scheme), and all the
   AU-Index-delta fields contain N-1.





van der Meer et al.        Expires October 2002               [Page 27]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   Receivers can tell when a new interleave group is starting, by
   noting that the computed time stamp of the first frame in a packet
   is later than any previously computed time stamp. This is because no
   following packet can contain an earlier RTP time stamp (RTP rules),
   and the second and subsequent frames in a packet have larger
   time stamps (the frames in a packet are also in time-order).

   If the group size is 3, then packets are formed as follows:

   Packet   Time stamp   Frame Numbers       AU-Index, AU-Index-delta
   0        T[0]         0, 3, 6             0, 2, 2
   1        T[1]         1, 4, 7             0, 2, 2
   2        T[2]         2, 5, 8             0, 2, 2
   3        T[9]         9,12,15             0, 2, 2


   In this case, the receiver would have to buffer 4 frames at least
   from packets 0 and 1, and can flush all frames when packet 2
   arrives. (Frame 0 can be flushed as packet 0 arrives, since it is
   the earliest frame we hold, and likewise frame 1 from packet 1; we
   are therefore holding 3,4,6,7 until packet 2 arrives).

   If there is loss, then the receiver may wait longer than is strictly
   necessary before it emits frames.  For example, say packet 1 is lost
   from the above example.  Packet 0 allows frame 0 to be emitted, and
   then packet 2 arrives, allowing us to notice the loss of frame 1,
   and emit frame 2 and 3. Then it is not until the arrival of packet 3
   (which has a time-stamp beyond the times of all the frames seen so
   far), that we can finish dealing with the loss, even though the
   first group has, in fact, ended. (This is in contrast to schemes
   which signal the group size explicitly;  if the receiver knows that
   this is packet 3 of 3, then even if 2 of 3 is missing, it can
   de-interleave this group without waiting for the next one to start).

   In the above example the AU-Index is coded with the value 0, as
   required for the modes defined in this document. To reconstruct the
   original order, the RTP time stamp and the AU-Index-delta are used.
   See also 3.2.3.2.


A.1.2 Continuous interleave

   In continuous interleave, once the scheme is 'primed', the number of
   frames in a packet exceeds the 'stride' (the distance between them).
   This shortens the buffering needed, smooths the data-flow, and gives
   slightly larger packets -- and thus lower overhead -- for the same
   interleave.  For example, here is a continuous interleave also over
   a stride of 3 frames, but with 4 frames per packet, for a run of 20
   frames.  This shows both how the scheme 'starts up' and how it
   finishes.



van der Meer et al.        Expires October 2002               [Page 28]


RFC xxxx        Transport of MPEG-4 Elementary Streams       April 2002

   Packet   Time-stamp   Frame Numbers       AU-Index, AU-Index-delta
   0        T[0]                     0       0
   1        T[1]                 1   4       0  2
   2        T[2]             2   5   8       0  2  2
   3        T[3]          3   6   9  12      0  2  2  2
   4        T[7]          7  10  13  16      0  2  2  2
   5        T[11]        11  14  17  20      0 2  2  2
   6        T[15]        15  18              0 2
   7        T[19]        19                  0

   In this case, the receiver has to buffer only 3 frames, not 4. Say
   we are waiting for packet 4.  We can flush frames 0, 1, 2, 3, 4, 5,
   6;  we are holding therefore 8, 9, 12.   Packet 4 arrives, allowing
   us to emit 7,8,9,10, and we are holding 12,13,16.  Each arriving
   packet contains 4 frames, and allows 4 frames to be flushed.

   In the above example the AU-Index is coded with the value 0, as
   required for the modes defined in this document. To reconstruct the
   original order, the RTP time stamp and the AU-Index-delta are used.
   See also 3.2.3.2.

   If there is loss, again the receiver has to wait to emit the erasure
   frames.  In this case, say packet 3 is lost.  We were holding frames
   4, 5, and 8.  On the arrival of packet 4, (time-stamp of frame 7),
   we now know frame 3 was lost, we can emit frames 4,5, and we know 6
   must be lost, and emit 7, which is in the packet that arrived. Then
   on the arrival of packet 5 (time-stamp 11) we can emit 8, indicate
   loss of 9, and emit 10 and 11. Finally, the arrival of packet 6
   (time-stamp 15) indicates that 12 must be lost;  we have now
   detected all the lost frames.























van der Meer et al.        Expires October 2002               [Page 29]