Internet Engineering Task Force            Kretschmer-AT&T/Basso-AT&T
INTERNET DRAFT                             Civanlar-AT&T/Quackenbush-AT&T
File:draft-ietf-avt-rtp-mpeg2aac-00.txt    Snyder-AT&T
                                           June 25, 1999
                                           Expires: December 25, 1999


                RTP Payload Format for MPEG-2 AAC Streams

                         STATUS OF THIS MEMO

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups.  Note that other
groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet- Drafts as reference
material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.

                                 Abstract

This document describes a payload format for transporting MPEG-2 AAC
encoded data using RTP. MPEG-2 AAC is a recent standard from ISO/IEC
for the coding of multi-channel audio data. Several services provided
by RTP are beneficial for MPEG-2 AAC encoded data transport over the
Internet. Additionally, the use of RTP makes it possible to
synchronize MPEG-2 AAC data with other real-time data types.

Kretschmer/Basso/Civanlar/Quackenbush/Snyder                     [Page 1]


INTERNET-DRAFT    RTP Payload Format for MPEG-2 AAC Streams     June 1999

1. Introduction

The ISO/IEC MPEG-2 Advanced Audio Coding (AAC) [1] technology delivers
unsurpassed audio quality at rates at or below 64 kbps/channel.  It
has a very flexible bitstream syntax that supports from 1 to 48 audio
channels, up to 16 subwoofer channels and up to 16 embedded data
channels.  AAC supports a wide range of sampling frequencies (from 16
kHz to 96 kHz) which enables it to have an extremely wide range of
bitrates.  This permits it to support applications ranging from
professional or home theater sound systems to Internet music broadcast
systems.

The benefits of using RTP for MPEG-2 AAC data stream transport include:

    i. Ability to synchronize MPEG-2 AAC streams with other RTP payloads

    ii. Monitoring MPEG-2 AAC delivery performance through RTCP

    iii. Combining MPEG-2 AAC and other real-time data streams received
    from multiple end-systems into a set of consolidated streams
    through RTP mixers

    iv. Converting data types, etc. through the use of RTP translators.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [3].

1.1 Overview of MPEG-2 AAC

AAC combines the coding efficiencies of a high resolution filter bank,
a powerful model of audio perception, backward-adaptive prediction,
joint channel coding, and Huffman to delivering excellent signal
compression. In 1998 the MPEG Audio subgroup tested the family of MPEG
audio coders (see http://www.tnt.uni-hannover.de/project/mpeg/audio/
public/w2006.pdf). The test results indicate that for a stereo signal,
AAC at 96 kb/s has audio quality comparable to MPEG-3 Layer 3 ("mp3")
at 128 kb/s.  Therefore at equivalent quality levels, AAC offers
approximately 1/3 greater compression than Layer 3.

AAC is a block oriented, variable rate coding algorithm, which means
that the AAC encoder reads 1024 samples of the input signal file and
writes a variable number of compressed output bits that represent that
block of input data. A sample can be one or more channels. Rate
control can be used in the encoder such that the output bit rate is
averaged to a predetermined rate, as would be required for
constant-rate communication channels. Each block of AAC compressed
bits is called a "raw data block", and it has the nice property that
it can be decoded "stand-alone", that is, without knowledge of
information in prior bitstream blocks. This is ideal for packet
communication channels, in that if the payload of a packet is a single

Kretschmer/Basso/Civanlar/Quackenbush/Snyder                     [Page 2]


INTERNET-DRAFT    RTP Payload Format for MPEG-2 AAC Streams     June 1999

raw data block, packet framing facilitates encoder and decoder
synchronization and, most importantly, loss of a single packet does
not impair the decodability of adjacent packets.

1.2 Bitstream Syntax

As already stated, a raw data block represents audio data for a time
period of 1024 samples and may also contain related information and
other data. The syntax of an AAC bitstream is as follows:

<bitstream>        => <raw_data_block><bitstream>
<raw_data_block>   => [<element>]<END><PAD>

where <bitstream> indicates the AAC bitstream, <lowercase> indicates
intermediate tokens, <UPPERCASE> indicates terminal tokens and []
indicates one or more occurance. <END> is a token that indicates the
end of a raw_data_block and <PAD> is a variable length token that
forces the total length of a raw_data_block to be an integral number
of byes. In general, intermediate tokens are not an integral number of
bytes in length.

The <element> tokens are a string of bits of varying length, and can
be any of the following:

<single_channel_element>     represent a single audio channel
<channel_pair_element>       represent a stereo presentation (2 channels)
<coupling_channel_element>   a mechanism for multi-channel compression
<lfe_channel_element>        represent a special effects channel
<data_stream_element>        represent "user data"
<program_config_element>     a mechanism for describing the bitstream
                             content
<fill_element>               a mechanism to use bits (for constant rate
                             channels)

The <elements> above can occur several times in a single
raw_data_block. For example, the raw_data_block for a 5.1 surround
sound signal would be:

<single_channel_element><channel_pair_element>...
                     .
                     .
                     .
...<channel_pair_element><lfe_channel_element><END>

corresponding to the center, left and right, left surround and right
surround and effects channels. Multiple occurances of the
<channel_pair_element> are dis-ambiguated by means of a unique 4-bit
id inside the <channel_pair_element>.

Kretschmer/Basso/Civanlar/Quackenbush/Snyder                     [Page 3]


INTERNET-DRAFT    RTP Payload Format for MPEG-2 AAC Streams     June 1999

2. Issues covered by this Payload Format

2.1 Repair Information to reconstruct lost AAC Frames

Typically, a smart AAC decoder can mitigate the effects of lost
packets using techniques such as interpolation in the spectral domain.
However if the raw_data_block in a packet is perceptually very
significant and also highly unpredictable (e.g. the onset of a symbol
crash) then the encoder may choose to send RepairData associated with
that raw_data_block. The RepairData in a given packet is typically
associated with a raw_data_block in the FUTURE, such that the decoder
has the RepairData when faced with the loss of the corresponding
packet. The association is indicated by the RSEQ field, which is equal
to the SEQ field of the corresponding raw_data_block.

The syntax of the RepairData bits is exactly that of the AAC
raw_data_block. However, in practical use, the RepairData would be a
highly compressed monophonic version of the signal being transmitted.
For example, an AAC stereo signal coded to an average rate of 96 kb/s
corresponds to a raw_data_block size of 279 bytes. A RepairData
version of that block, compressed to 16 kb/s would be 46 bytes. Given
that perceptually critical blocks might occur only once per 100 or
more blocks, the average rate imposed by the RepairData is very low.

RepairData MAY be provide for every frame but, in general, its
provision is OPTIONAL.

2.2 Fragmentation of AAC Frames

For many reasons the packet size on a communications channel may have
a practical maximum size (e.g. Ethernet packet size limits). Since it
is advantagous to put one AAC raw_data_block per packet, it is
desirable to try to limit the size of the AAC raw_data_block. If this
is not possible, the raw_data_blockcan be fragmented across several
packets. In this case, the raw_data_block can be fragmented at
<element> boundaries and the LEN field used to indicate the length of
the <element> to within a byte and the UBITS field used to indicate
the length of the <element> to a the bit. The LEN and UBITS information
permits re-assembly of the raw_data_block without knowledge of the
syntax of the bits within each <element> in the raw_data_block.

2.3 Priority of AAC Frames

Depending on the signal's characteristics AAC uses different encoding
strategies. Stationary signals are processed using a 1024 sample
FFT. For transient signals a 128 sample FFT is used. Lost AAC frames
containing stationary signals can relatively easy be reconstructed,
hence they are less important to the decoder than frames containing
transient signals which can not or can just roughly be reconstructed.

Kretschmer/Basso/Civanlar/Quackenbush/Snyder                     [Page 4]


INTERNET-DRAFT    RTP Payload Format for MPEG-2 AAC Streams     June 1999

This priority information is very important for AAC streaming over
lossy channels since it allows to adapt the reconstruct resp.
retransmit behavior of the streaming application or the forwarding
strategies inside the network (DiffServ). In order to flexibly respond
to packet loss and/or given bandwidth constraints four priority levels
are defined: 'low', 'lower', 'higher', 'high'. 'Low' priority denotes
frames with low perceptual entropy while 'high' priority denotes
frames with high perceptual entropy. 'Lower' and 'higher' priority
levels MUST be assigned to frames whose perceptual entropy is between
'high' and 'low', accordingly.

2.4 Interleaving of AAC Frames

Instead of using a static interleaving scheme (i.e. 7x7) only frames
with the same priority MUST be grouped.  The sequence numbers SEQ of
the AAC frames and RSEQ of REPAIRDATA are used to restore the actual
order on the receiver side. Hence, the interleaving scheme does not
have to be defined rigidly.

2.5 Example RTP Packet Sequence

The below example shows how a sequence of AAC packets (a...p) with
assigned priorities (0=low, 3=high) MAY be grouped. RepairData is not
provided for low priority packets:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 0 | 0 | 0 | 2 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 3 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Proposed interleaving/grouping of AAC frames and assigned RepairData
R(x) being sent within the following RTP packet:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|a g j|b h k|c i l|  d  |  e  |  f  | m q |  n  |  o  |  p  |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|     |R(d) |R(e) |R(f) |     |R(n) |R(o) |R(p) |     |     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Kretschmer/Basso/Civanlar/Quackenbush/Snyder                     [Page 5]


INTERNET-DRAFT    RTP Payload Format for MPEG-2 AAC Streams     June 1999

3. Payload Format

The RTP payload consists of a 32 or 64 bit header, a variable number
of RepairData containing information needed to reconstruct lost AAC
frames and a variable number of AAC frames. The header basically
contains a vector of Priority Quantizers (PQ) specifying the priority
of the current and previous packets to the decoder to reconstruct the
original signal. The X bit specifies if the header contains 12 or 28
PQs. REPAIRLEN specifies the total number of 32bit words containing
RepairData. REPAIRLEN MUST be set to 0 if there is no RepairData.
Every REPAIRDATA or AAC FRAME is preceded by a sequence number (R)SEQ
and a length specifier (R)LEN. In case of fragmented AAC frames UBITS
specifies the number of unused bits in the last byte since frame
fragments may not be byte aligned. UBITS MUST be set to 0 if the
corresponding frame is not fragmented.


0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|X|REPAIRLEN    |PRI VECTOR                                     | Header
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|PRI VECTOR (continued), if X==1                                |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|RSEQ           |RLEN           |REPAIRDATA 1                   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
|                               .                               | Repair
|                               .                               | Data
|                               .                               |
|               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               |RSEQ           |RLEN           |REPAIRDATA N   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
|                                                               |
|                                                               |
|                                                               |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|SEQ            |LEN                    |UBITS  |AAC FRAME 1    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
|                               .                               |
|                               .                               |
|                               .                               |  AAC
|               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  Frames
|               |SEQ            |LEN                    |UBITS  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|AAC FRAME N                                                    |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Kretschmer/Basso/Civanlar/Quackenbush/Snyder                     [Page 6]


INTERNET-DRAFT    RTP Payload Format for MPEG-2 AAC Streams     June 1999

PRI VECTOR: The actual priority vector. It contains either 12 or 28
            Priority Quantifiers (PQ). An PQ element describes the
            priority of the current packet. The size of an PQ is 2 bit.
            Hence, four different priority levels can be assigned to
            an RTP packet. 0 means low and 3 means high priority.
            The first PQ refers to the current packet. The following
            PQs refer to the most recent previous packets.
            So, the vector looks like this: {PQ(t), PQ(t-1), PQ(t-2)...}

X:          Vector Extension, the priority vector uses 56 instead of 24
            bits. Hence, another 32bit word is required.

REPAIRLEN:  The total number of 32bit words containing Repair
            Data for previous/future frames. If REPAIRLEN==0 then
            there is no repair information.

RSEQ:       The SEQ number of the AAC frame REPAIRDATA belongs to.

RLEN:       The length in bytes of REPAIRDATA.

REPAIRDATA: An 8bit aligned data array containing RepairData.
            This information can be ignored and is not mandatory.
            The syntax of the RepairData bits is exactly that of the AAC
            raw_data_block. However, it SHOULD be a highly compressed
            monophonic version of the signal being transmitted.

SEQ:        8 bit. The sequence number of the AAC frame.
            The application has to make sure that the sequence number of
            interleaved frames do not overlap.

LEN:        12 bit. The length of the actual AAC frame

UBITS:      4 bit. The number of unused bits in the last byte of the AAC
            frame if the frame is fragmented. The RTP M-Bit is used as
            a 'fragmented' tag. UBITS MUST be set to 0, if the frame is
            not fragmented.

Kretschmer/Basso/Civanlar/Quackenbush/Snyder                     [Page 7]


INTERNET-DRAFT    RTP Payload Format for MPEG-2 AAC Streams     June 1999

4. References

  [1] ISO/IEC 13818-7 Advanced Audio Coding (AAC)

  [2] Schulzrinne, Casner, Frederick, Jacobson RTP: A
  Transport Protocol for Real Time Applications  RFC 1889,
  Internet Engineering Task Force, January 1996.

  [3] S. Bradner, Key words for use in RFCs to Indicate
  Requirement Levels, RFC 2119, March 1997.

5. Authors' Addresses

Mathias Kretschmer
AT&T Labs - Research
180 Park Ave.
Florham Park, NJ 07932
USA
e-mail: mathias@research.att.com

Andrea Basso
AT&T Labs - Research
100 Schultz Drive
Red Bank, NJ 07701
USA
e-mail: basso@research.att.com

M. Reha Civanlar
AT&T Labs - Research
100 Schultz Drive
Red Bank, NJ 07701
USA
e-mail: civanlar@research.att.com

Schuyler R. Quackenbush
AT&T Labs - Research
180 Park Ave.
Florham Park, NJ 07932
USA
e-mail: srq@research.att.com

James H. Snyder
AT&T Labs - Research
180 Park Ave.
Florham Park, NJ 07932
USA
e-mail: jhs@research.att.com

Kretschmer/Basso/Civanlar/Quackenbush/Snyder                     [Page 8]