Internet Engineering Task Force Kretschmer-AT&T/Basso-AT&T INTERNET DRAFT Civanlar-AT&T/Quackenbush-AT&T File:draft-ietf-avt-rtp-mpeg2aac-00.txt Snyder-AT&T June 25, 1999 Expires: December 25, 1999 RTP Payload Format for MPEG-2 AAC Streams STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes a payload format for transporting MPEG-2 AAC encoded data using RTP. MPEG-2 AAC is a recent standard from ISO/IEC for the coding of multi-channel audio data. Several services provided by RTP are beneficial for MPEG-2 AAC encoded data transport over the Internet. Additionally, the use of RTP makes it possible to synchronize MPEG-2 AAC data with other real-time data types. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 1]
INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 1. Introduction The ISO/IEC MPEG-2 Advanced Audio Coding (AAC) [1] technology delivers unsurpassed audio quality at rates at or below 64 kbps/channel. It has a very flexible bitstream syntax that supports from 1 to 48 audio channels, up to 16 subwoofer channels and up to 16 embedded data channels. AAC supports a wide range of sampling frequencies (from 16 kHz to 96 kHz) which enables it to have an extremely wide range of bitrates. This permits it to support applications ranging from professional or home theater sound systems to Internet music broadcast systems. The benefits of using RTP for MPEG-2 AAC data stream transport include: i. Ability to synchronize MPEG-2 AAC streams with other RTP payloads ii. Monitoring MPEG-2 AAC delivery performance through RTCP iii. Combining MPEG-2 AAC and other real-time data streams received from multiple end-systems into a set of consolidated streams through RTP mixers iv. Converting data types, etc. through the use of RTP translators. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [3]. 1.1 Overview of MPEG-2 AAC AAC combines the coding efficiencies of a high resolution filter bank, a powerful model of audio perception, backward-adaptive prediction, joint channel coding, and Huffman to delivering excellent signal compression. In 1998 the MPEG Audio subgroup tested the family of MPEG audio coders (see http://www.tnt.uni-hannover.de/project/mpeg/audio/ public/w2006.pdf). The test results indicate that for a stereo signal, AAC at 96 kb/s has audio quality comparable to MPEG-3 Layer 3 ("mp3") at 128 kb/s. Therefore at equivalent quality levels, AAC offers approximately 1/3 greater compression than Layer 3. AAC is a block oriented, variable rate coding algorithm, which means that the AAC encoder reads 1024 samples of the input signal file and writes a variable number of compressed output bits that represent that block of input data. A sample can be one or more channels. Rate control can be used in the encoder such that the output bit rate is averaged to a predetermined rate, as would be required for constant-rate communication channels. Each block of AAC compressed bits is called a "raw data block", and it has the nice property that it can be decoded "stand-alone", that is, without knowledge of information in prior bitstream blocks. This is ideal for packet communication channels, in that if the payload of a packet is a single Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 2]
INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 raw data block, packet framing facilitates encoder and decoder synchronization and, most importantly, loss of a single packet does not impair the decodability of adjacent packets. 1.2 Bitstream Syntax As already stated, a raw data block represents audio data for a time period of 1024 samples and may also contain related information and other data. The syntax of an AAC bitstream is as follows: <bitstream> => <raw_data_block><bitstream> <raw_data_block> => [<element>]<END><PAD> where <bitstream> indicates the AAC bitstream, <lowercase> indicates intermediate tokens, <UPPERCASE> indicates terminal tokens and [] indicates one or more occurance. <END> is a token that indicates the end of a raw_data_block and <PAD> is a variable length token that forces the total length of a raw_data_block to be an integral number of byes. In general, intermediate tokens are not an integral number of bytes in length. The <element> tokens are a string of bits of varying length, and can be any of the following: <single_channel_element> represent a single audio channel <channel_pair_element> represent a stereo presentation (2 channels) <coupling_channel_element> a mechanism for multi-channel compression <lfe_channel_element> represent a special effects channel <data_stream_element> represent "user data" <program_config_element> a mechanism for describing the bitstream content <fill_element> a mechanism to use bits (for constant rate channels) The <elements> above can occur several times in a single raw_data_block. For example, the raw_data_block for a 5.1 surround sound signal would be: <single_channel_element><channel_pair_element>... . . . ...<channel_pair_element><lfe_channel_element><END> corresponding to the center, left and right, left surround and right surround and effects channels. Multiple occurances of the <channel_pair_element> are dis-ambiguated by means of a unique 4-bit id inside the <channel_pair_element>. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 3]
INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 2. Issues covered by this Payload Format 2.1 Repair Information to reconstruct lost AAC Frames Typically, a smart AAC decoder can mitigate the effects of lost packets using techniques such as interpolation in the spectral domain. However if the raw_data_block in a packet is perceptually very significant and also highly unpredictable (e.g. the onset of a symbol crash) then the encoder may choose to send RepairData associated with that raw_data_block. The RepairData in a given packet is typically associated with a raw_data_block in the FUTURE, such that the decoder has the RepairData when faced with the loss of the corresponding packet. The association is indicated by the RSEQ field, which is equal to the SEQ field of the corresponding raw_data_block. The syntax of the RepairData bits is exactly that of the AAC raw_data_block. However, in practical use, the RepairData would be a highly compressed monophonic version of the signal being transmitted. For example, an AAC stereo signal coded to an average rate of 96 kb/s corresponds to a raw_data_block size of 279 bytes. A RepairData version of that block, compressed to 16 kb/s would be 46 bytes. Given that perceptually critical blocks might occur only once per 100 or more blocks, the average rate imposed by the RepairData is very low. RepairData MAY be provide for every frame but, in general, its provision is OPTIONAL. 2.2 Fragmentation of AAC Frames For many reasons the packet size on a communications channel may have a practical maximum size (e.g. Ethernet packet size limits). Since it is advantagous to put one AAC raw_data_block per packet, it is desirable to try to limit the size of the AAC raw_data_block. If this is not possible, the raw_data_blockcan be fragmented across several packets. In this case, the raw_data_block can be fragmented at <element> boundaries and the LEN field used to indicate the length of the <element> to within a byte and the UBITS field used to indicate the length of the <element> to a the bit. The LEN and UBITS information permits re-assembly of the raw_data_block without knowledge of the syntax of the bits within each <element> in the raw_data_block. 2.3 Priority of AAC Frames Depending on the signal's characteristics AAC uses different encoding strategies. Stationary signals are processed using a 1024 sample FFT. For transient signals a 128 sample FFT is used. Lost AAC frames containing stationary signals can relatively easy be reconstructed, hence they are less important to the decoder than frames containing transient signals which can not or can just roughly be reconstructed. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 4]
INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 This priority information is very important for AAC streaming over lossy channels since it allows to adapt the reconstruct resp. retransmit behavior of the streaming application or the forwarding strategies inside the network (DiffServ). In order to flexibly respond to packet loss and/or given bandwidth constraints four priority levels are defined: 'low', 'lower', 'higher', 'high'. 'Low' priority denotes frames with low perceptual entropy while 'high' priority denotes frames with high perceptual entropy. 'Lower' and 'higher' priority levels MUST be assigned to frames whose perceptual entropy is between 'high' and 'low', accordingly. 2.4 Interleaving of AAC Frames Instead of using a static interleaving scheme (i.e. 7x7) only frames with the same priority MUST be grouped. The sequence numbers SEQ of the AAC frames and RSEQ of REPAIRDATA are used to restore the actual order on the receiver side. Hence, the interleaving scheme does not have to be defined rigidly. 2.5 Example RTP Packet Sequence The below example shows how a sequence of AAC packets (a...p) with assigned priorities (0=low, 3=high) MAY be grouped. RepairData is not provided for low priority packets: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 0 | 0 | 0 | 2 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 3 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Proposed interleaving/grouping of AAC frames and assigned RepairData R(x) being sent within the following RTP packet: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |a g j|b h k|c i l| d | e | f | m q | n | o | p | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | |R(d) |R(e) |R(f) | |R(n) |R(o) |R(p) | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 5]
INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 3. Payload Format The RTP payload consists of a 32 or 64 bit header, a variable number of RepairData containing information needed to reconstruct lost AAC frames and a variable number of AAC frames. The header basically contains a vector of Priority Quantizers (PQ) specifying the priority of the current and previous packets to the decoder to reconstruct the original signal. The X bit specifies if the header contains 12 or 28 PQs. REPAIRLEN specifies the total number of 32bit words containing RepairData. REPAIRLEN MUST be set to 0 if there is no RepairData. Every REPAIRDATA or AAC FRAME is preceded by a sequence number (R)SEQ and a length specifier (R)LEN. In case of fragmented AAC frames UBITS specifies the number of unused bits in the last byte since frame fragments may not be byte aligned. UBITS MUST be set to 0 if the corresponding frame is not fragmented. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |X|REPAIRLEN |PRI VECTOR | Header +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |PRI VECTOR (continued), if X==1 | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |RSEQ |RLEN |REPAIRDATA 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . | Repair | . | Data | . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |RSEQ |RLEN |REPAIRDATA N | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | | | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |SEQ |LEN |UBITS |AAC FRAME 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . | | . | | . | AAC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Frames | |SEQ |LEN |UBITS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |AAC FRAME N | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 6]
INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 PRI VECTOR: The actual priority vector. It contains either 12 or 28 Priority Quantifiers (PQ). An PQ element describes the priority of the current packet. The size of an PQ is 2 bit. Hence, four different priority levels can be assigned to an RTP packet. 0 means low and 3 means high priority. The first PQ refers to the current packet. The following PQs refer to the most recent previous packets. So, the vector looks like this: {PQ(t), PQ(t-1), PQ(t-2)...} X: Vector Extension, the priority vector uses 56 instead of 24 bits. Hence, another 32bit word is required. REPAIRLEN: The total number of 32bit words containing Repair Data for previous/future frames. If REPAIRLEN==0 then there is no repair information. RSEQ: The SEQ number of the AAC frame REPAIRDATA belongs to. RLEN: The length in bytes of REPAIRDATA. REPAIRDATA: An 8bit aligned data array containing RepairData. This information can be ignored and is not mandatory. The syntax of the RepairData bits is exactly that of the AAC raw_data_block. However, it SHOULD be a highly compressed monophonic version of the signal being transmitted. SEQ: 8 bit. The sequence number of the AAC frame. The application has to make sure that the sequence number of interleaved frames do not overlap. LEN: 12 bit. The length of the actual AAC frame UBITS: 4 bit. The number of unused bits in the last byte of the AAC frame if the frame is fragmented. The RTP M-Bit is used as a 'fragmented' tag. UBITS MUST be set to 0, if the frame is not fragmented. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 7]
INTERNET-DRAFT RTP Payload Format for MPEG-2 AAC Streams June 1999 4. References [1] ISO/IEC 13818-7 Advanced Audio Coding (AAC) [2] Schulzrinne, Casner, Frederick, Jacobson RTP: A Transport Protocol for Real Time Applications RFC 1889, Internet Engineering Task Force, January 1996. [3] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, RFC 2119, March 1997. 5. Authors' Addresses Mathias Kretschmer AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: mathias@research.att.com Andrea Basso AT&T Labs - Research 100 Schultz Drive Red Bank, NJ 07701 USA e-mail: basso@research.att.com M. Reha Civanlar AT&T Labs - Research 100 Schultz Drive Red Bank, NJ 07701 USA e-mail: civanlar@research.att.com Schuyler R. Quackenbush AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: srq@research.att.com James H. Snyder AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: jhs@research.att.com Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 8]