Internet Engineering Task Force Audio-Video Transport Working Group
Internet Draft H. Schulzrinne
ietf-avt-profile-05.txt GMD Fokus
July 7, 1995
Expires: 12/1/95
RTP Profile for Audio and Video Conferences with Minimal Control
STATUS OF THIS MEMO
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas, and
its working groups. Note that other groups may also distribute working
documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents at
any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as ``work in progress''.
To learn the current status of any Internet-Draft, please check the
``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
ftp.isi.edu (US West Coast).
Distribution of this document is unlimited.
ABSTRACT
This note describes a profile for the use of the real-time
transport protocol (RTP) and the associated control proto-
col, RTCP, within audio and video multiparticipant confer-
ences with minimal control. It provides interpretations of
generic fields within the RTP specification suitable for
audio and video conferences. In particular, this document
defines a set of default mappings from payload type numbers
to encodings.
The document also describes how audio and video data
may be carried within RTP. It defines a set of standard
encodings and their names when used within RTP. However, the
definitions are independent of the particular transport
mechanism used. The descriptions provide pointers to refer-
ence implementations and the detailed standards. This docu-
ment is meant as an aid for implementors of audio, video and
H. Schulzrinne [Page 1]
Internet Draft AV Profile July 7, 1995
other real-time multimedia applications.
Changes
(This section will not become part of the RFC.)
o Video frequency changed to 90 kHz.
o Short reference labels for profile definitions.
o Explain differences between Intel/IMA DVI format and the one
used for this profile; name changed from IDVI to DVI4.
o Minor editorial clarifications.
1. Introduction
This profile defines aspects of RTP left unspecified in the RTP
protocol definition (RFC TBD). This profile is intended for the use
within audio and video conferences with minimal session control. In par-
ticular, no support for the negotiation of parameters or membership con-
trol is provided. Other profiles may make different choices for the
items specified here. The profile specifies the use of RTP over unicast
and multicast UDP. (This does not preclude the use of these definitions
when RTP is carried by other lower-layer protocols.) Use of this pro-
file occurs by use of the appropriate applications; there is no explicit
indication by port number, protocol identifier or the like.
2. RTP and RTCP Packet Forms and Protocol Behavior
This profile follows the default and/or recommended aspects of the
RTP specification for these items:
Header:
The standard format of the fixed RTP data header is used (one
marker bit).
Extension:
No additional fixed fields are appended to the RTP data
header.
RTCP report interval:
The suggested constants are to be used for the RTCP report
interval calculation.
H. Schulzrinne [Page 2]
Internet Draft AV Profile July 7, 1995
SR/RR extension:
No extension section is defined for the RTCP SR or RR packet.
RTCP packet types:
No additional RTCP packet types are defined by this profile
specification.
Security:
The RTP default security services are also the default under
this profile.
Mapping:
The standard mapping of RTP and RTCP to transport-level
addresses is used.
Encapsulation:
No encapsulation of RTP packets is specified.
RTP header extensions:
No RTP header extensions are defined, but applications operat-
ing under this profile may use such extensions. Thus, applica-
tions should not assume that the RTP header X bit is always
zero and should be prepared to ignore the header extension. If
a header extension is defined in the future, that definition
must specify the contents of the first 16 bits.
SDES use:
Applications may use any of the SDES items described.
New encodings are to be registered with the Internet Assigned
Numbers Authority. When registering a new encoding, the following infor-
mation should be provided:
o name and description of encoding, in particular the RTP times-
tamp clock rate;
o indication of who has change control over the encoding (for
example, CCITT/ITU, other international standardization bodies, a
consortium or a particular company or group of companies);
o any operating parameters;
o a reference to a further description, if available, for example
(in order of preference) an RFC, a published paper, a patent fil-
ing, a technical report or a computer manual;
o for proprietary encodings, contact information (postal and email
address).
H. Schulzrinne [Page 3]
Internet Draft AV Profile July 7, 1995
o the payload type value for this profile.
3. Audio
3.1. Encoding-independent recommendations
The first packet of a talkspurt is distinguished by a set marker
bit in the RTP data header.
The following recommendations are default operating parameters.
Applications should be prepared to handle other values. The ranges given
are meant to give guidance to application writers, allowing a set of
applications conforming to these guidelines to interoperate without
additional negotiation. These guidelines are not intended to restrict
operating parameters for applications that can negotiate a set of
interoperable parameters, e.g., through a conference control protocol.
For packetized audio, the default packetization interval should
have a duration of 20 ms, unless otherwise noted when describing the
encoding. The packetization interval determines the minimum end-to-end
delay; longer packets introduce less header overhead but higher delay
and make packet loss more noticeable. For non-interactive applications
such as lectures or links with severe bandwidth constraints, a higher
packetization delay may be appropriate.
For N-channel encodings, each sampling period (say, 1/8000 of a
second) generates N samples. (This terminology is standard, but somewhat
confusing, as the total number of samples generated per second is then
the sampling rate times the channel count.)
If multiple audio channels are used, channels are numbered left-
to-right, starting at one. In RTP audio packets, information from
lower-numbered channels precedes that from higher-numbered channels. For
more than two channels, the convention followed by the AIFF-C audio
interchange format should be followed [1]. For two-channel stereo, the
numbering sequence is left, right; for three channels, left, right,
center; for quadrophonic systems, front left, front right, rear left,
rear right; for four-channel systems, left, center, right, and surround
sound; for six-channel systems left, left center, center, right, right
center and surround sound. All channels belonging to a single sampling
instance must be within the same packet.
The sampling frequency should be drawn from the set: 8000, 11025,
16000, 22050, 44100 and 48000 Hz. (The Apple Macintosh computers have
native sample rates of 22254.54 and 11127.27, which can be converted to
H. Schulzrinne [Page 4]
Internet Draft AV Profile July 7, 1995
22050 and 11025 with acceptable quality by dropping 4 or 2 samples in a
20 ms frame.)
A receiver should accept packets representing between 0 and 200 ms
of audio data.[1] Receivers should be prepared to accept multi-channel
audio, but may choose to only play a single channel.
3.2. Guidelines for Sample-Based Audio Encodings
In sample-based encodings, each audio sample is represented by a
fixed number of bits. Within the compressed audio data, codes for indi-
vidual samples may span octet boundaries. An RTP audio packet may con-
tain any number of audio samples, subject to the constraint that the
number of bits per sample times the number of samples per packet yields
an integral octet count. Fractional encodings produce less than one
octet per sample.
For sample-based encodings producing one or more octets per sample,
samples from different channels, but the same sampling instant are con-
secutive. For example, for a two-channel encoding, the octet sequence is
(left channel, first sample), (right channel, first sample), (left chan-
nel, second sample), (right channel, second sample), .... For multi-
octet encodings, octets are transmitted in network byte order (i.e.,
most significant octet first).
The packing order for fractional encodings is that described for
the IMA Wave types [2]. For audio encodings yielding four bits per sam-
ple, eight such compressed samples from channel 1 are packet into one
32-bit word, followed by eight compressed samples from channel 2, until
all channels have been accomodated and the packing resumes at channel 1.
For audio encodings yielding three bits per sample, 32 such compressed
samples at three bits each from channel 1 are packed into 12 octets,
followed by 32 samples from channel 2, etc.
3.3. Guidelines for Frame-Based Audio Encodings
Frame-based encodings encode a fixed-length block of audio into
another block of compressed data, typically also of fixed length. For
frame-based encodings, the sender may choose to combine several such
frames into a single message. The receiver can tell the number of frames
contained in a message since the frame duration is defined as part of
_________________________
[1] This restriction allows reasonable buffer sizing
for the receiver.
H. Schulzrinne [Page 5]
Internet Draft AV Profile July 7, 1995
the encoding.
For frame-based codecs, the channel order is defined for the whole
block. That is, for two-channel audio, right and left samples are coded
independently, with the encoded frame for the left channel preceding
that for the right channel.
All frame-oriented audio codecs should be able to encode and decode
several consecutive frames within a single packet. Since the frame size
for the frame-oriented codecs is given, there is no need to use a
separate designation for the same encoding, but with different number of
frames per packet.
3.4. Audio Encodings
encoding sample/frame bits/sample ms/frame
__________________________________________________________
1016 frame N/A 30
G721 sample 4
G722 sample 8
G728 frame N/A 2.5 ms/frame
GSM frame N/A 20
DVI4 sample 4
LPC frame N/A 20
L8 sample 8
L16 sample 16
MPA frame N/A
PCMU sample 8
PCMA sample 8
Table 1: Properties of Audio Encodings
1016: Encoding 1016 is a frame based encoding using code-excited
linear prediction (CELP) and is specified in Federal Standard
FED-STD 1016 [3,4,5,6].
The U. S. DoD's Federal-Standard-1016 based 4800 bps code
excited linear prediction voice coder version 3.2 (CELP 3.2) For-
tran and C simulation source codes are available for worldwide dis-
tribution at no charge (on DOS diskettes, but configured to compile
on Sun SPARC stations) from: Bob Fenichel, National Communications
System, Washington, D.C. 20305, phone +1-703-692-2124, fax +1-703-
746-4960. and
ftp://ftp.super.org/pub/speech/celp_3.2a.tar.Z.
H. Schulzrinne [Page 6]
Internet Draft AV Profile July 7, 1995
G721: G721 is specified in ITU recommendation G.721. Reference
implementations for G.721 are available as part of the
CCITT/ITU-T Software Tool Library (STL) from the ITU General
Secretariat, Sales Service, Place du Nations, CH-1211 Geneve
20, Switzerland. The library is covered by a license and is
available at
ftp://gaia.cs.umass.edu/pub/hgschulz/ccitt/ccitt_tools.tar.Z
G722: G722 is specified in ITU-T recommendation G.722, "7 kHz
audio-coding within 64 kbit/s".
G728: G728 is specified in ITU-T recommendation G.728, "Coding of
speech at 16 kbit/s using low-delay code excited linear pred-
iction".
GSM: GSM (group speciale mobile) denotes the European GSM 06.10
provisional standard for full-rate speech transcoding, prI-ETS
300 036, which is based on RPE/LTP (residual pulse
excitation/long term prediction) coding at a rate of 13 kb/s.
A reference implementation was written by Carsten Borman and
Jutta Degener (TU Berlin, Germany) and is available at
ftp://ftp.cs.tu-berlin.de/pub/local/kbs/tubmik/gsm/.
DVI4: DVI4 is specified, with pseudo-code, in [2]as the ADPCM wave
type. However, the encoding defined here as DVI4 differs in
two respects from the IMA recommendation:
- The header contains the predicted value rather than the
first sample value.
- IMA ADPCM blocks contain odd number of samples, since the
first sample of a block is contained just in the header
(uncompressed), followed by an even number of compressed
samples. DVI4 has an even number of compressed samples only,
using the 'predict' word from the header to decode the first
sample.
Each packet contains a single DVI block. The profile only
defines the 4-bit-per-sample version, while IMA also specifies a
3-bit-per-sample encoding.
The "header" word for each channel has the following struc-
ture:
H. Schulzrinne [Page 7]
Internet Draft AV Profile July 7, 1995
int16 predict; /* predicted value of first sample
from the previous block (L16 format) */
u_int8 index; /* current index into stepsize table */
u_int8 reserved; /* set to zero by sender, ignored by receiver */
Header words for all channels precede the compressed data.
An implementation is available from Jack Jansen via anonymous
ftp from
ftp://ftp.cwi.nl/local/pub/audio/adpcm.shar.
L8: L8 denotes linear audio data, using 8-bits of precision with
an offset of 128, that is, the most negative signal is encoded
as 0.
L16: L16 denotes uncompressed audio data, using 16-bit signed
representation with 65535 equally divided steps between
minimum and maximum signal level, ranging from -32768 to
32767. The value is represented in two's complement notation
and network byte order.
MPA: MPA denotes MPEG-I or MPEG-II audio encapsulated as elementary
streams. The encoding is defined in ISO standards ISO/IEC
11172-3 and 13818-3. The encapsulation is specified in RFC
TBD, Section 3. Sampling rate and channel count are contained
in the payload.
PCMU: PCMU is specified in CCITT/ITU-T recommendation G.711. Audio
data is encoded as eight bits per sample, after companding.
Code to convert between linear and mu-law companded data is
available in [2].
PCMA: PCMA is specified in CCITT/ITU-T recommendation G.711. Audio
data is encoded as eight bits per sample, after companding.
Code to convert between linear and A-law companded data is
available in [2].
LPC: LPC designates an experimental linear predictive encoding
written by Ron Frederick, Xerox PARC, available from
ftp://parcftp.xerox.com/pub/net-research/lpc.tar.Z.
VDVI: VDVI is a variable-rate version of DVI4, yielding speech bit
rates of between 10 and 25 kbps. It is specified for single-
H. Schulzrinne [Page 8]
Internet Draft AV Profile July 7, 1995
channel operation only. It uses the following encoding:
DVI4 codeword VDVI bit pattern
0 00
1 010
2 1100
3 11100
4 111100
5 1111100
6 11111100
7 11111110
8 10
9 011
10 1101
11 11101
12 111101
13 1111101
14 11111101
15 11111111
TSP0: TSP0 designates the proprietary variable-rate, frame-based
encoding called True Speech. The encoding is defined for a
sampling rate of 7200 Hz and has an average data rate of 7200
bits per second. Further information is available by contact-
ing VocalTec (see VSC encoding) or the address: DSP Group,
Inc.
email: tsplayer@dsgp.com
VSC: VSC designates the proprietary variable-rate encoding called
Vocaltec Software Compression. The encoding is defined for a
sampling rate of 5500 Hz and has an average data rate of 963
bytes per second. Further information is available by contact-
ing Alon Cohen
VocalTec Ltd.
Maskit 1, Herzliya
Israel
phone: +972-9-5612121
email: alon@vocaltec.com
The standard audio encodings and their payload types are listed in
Table 5.
H. Schulzrinne [Page 9]
Internet Draft AV Profile July 7, 1995
4. Video
The following video encodings are currently defined, with their
abbreviated names used for identification:
CelB: The CELL-B encoding is a proprietary encoding proposed by Sun
Microsystems. The byte stream format is described in RFC TBD.
CPV: This proprietary encoding, "Compressed Packet Video is imple-
mented by Concept, Bolter, and ViewPoint Systems video codecs.
For further information, contact: Glenn Norem, President
ViewPoint Systems, Inc.
2247 Wisconsin Street, Suite 110
Dallas, TX 75229-2037
United States
Phone: +1-214-243-0634
JPEG: The encoding is specified in ISO Standards 10918-1 and
10918-2. The RTP payload format is as specified in RFC TBD.
H261: The encoding is specified in CCITT/ITU-T standard H.261. The
packetization and RTP-specific properties are described in RFC
TBD.
HDCC: The HDCC encoding is a proprietary encoding used by Silicon
Graphics. Contact
inperson@sgi.com for further details.
MPV: MPV designates the use MPEG-I and MPEG-II video encoding ele-
mentary streams as specified in ISO Standards ISO/IEC 11172
and 13818-2, respectively. The RTP payload format is as speci-
fied in RFC TBD, Section 3.
MP2T: MP2T designates the use of MPEG-II transport streams, for
either audio or video. The encapsulation is described in RFC
TBD, Section 2.
nv: The encoding is implemented in the program 'nv' developed at
Xerox PARC by Ron Frederick.
CUSM: The encoding is implemented in the program CU-SeeMe developed
at Cornell University by Dick Cogger, Scott Brim, Tim Dorcey
and John Lynn.
PicW: The encoding is implemented in the program PictureWindow
developed at Bolt, Beranek and Newman (BBN).
H. Schulzrinne [Page 10]
Internet Draft AV Profile July 7, 1995
RGB8: 8-bit encoding of RGB values, sequenced TBD. Each pixel can
assume values from 0 to 255. Each frame is prefixed by a
header containing TBD.
5. Payload Type Definitions
Table 5 defines this profile's static payload type values for the
PT field of the RTP data header. To assign a new value from the range
marked 'unassigned' in the table, register your RTP Payload Format
specification with the IANA.
In addition, payload type values in the range 96--127 may be
defined dynamically through a conference control protocol, which is
beyond the scope of this document. The payload type range marked
'reserved' has been set aside so that RTCP and RTP packets can be reli-
ably distinguished (see Section "Summary of Protocol Constants" of the
RTP protocol specification).
An RTP source emits a single RTP payload type at any given time;
the interleaving of several RTP payload types in a single RTP session is
not allowed, but multiple RTP sessions may be used in parallel to send
multiple media. The payload types currently defined in this profile
carry either audio or video, but not both. However, it is allowed to
define payload types that combine several media, e.g., audio and video,
with appropriate separation in the payload format. Session participants
agree through mechanisms beyond the scope of this specification on the
set of allowable payload types in a given session. This set may, for
example, be defined by the capabilities of the applications used, nego-
tiated by a conference control protocol or established by agreement
between the human participants.
Audio applications operating under this profile SHOULD at minimum
be able to send and receive payload types 0 (mu-law) and 5 (DVI). This
allows interoperability without format negotiation and successful nego-
tation with a conference control protocol.
All current video encodings use a timestamp frequency of 90000 Hz,
the same as the MPEG presentation time stamp frequency. This frequency
yields exact integer timestamp increments for the typical 24, 25, and 30
Hz frame rates and 50 and 60 Hz field rates and only 1 ppm error for the
29.97 Hz NTSC frame rate. While 90 kHz is the recommended rate for
future video encodings used within this profile, other rates are possi-
ble. However, it is not sufficient to use the video frame rate (typi-
cally between 15 and 30 Hz) because that does not provide adequate reso-
lution for typical synchronization requirements when calculating the RTP
timestamp corresponding to the NTP timestamp in an RTCP SR packet [8].
The timestamp resolution must also be sufficient for the jitter estimate
H. Schulzrinne [Page 11]
Internet Draft AV Profile July 7, 1995
contained in the receiver reports.
The standard video encodings and their payload types are listed in
Table 5.
PT encoding audio/video clock rate channels
name (A/V) (Hz) (audio)
___________________________________________________________________
0 PCMU A 8000 1
1 1016 A 8000 1
2 G721 A 8000 1
3 GSM A 8000 1
4 unassigned A 8000 1
5 DVI4 A 8000 1
6 DVI4 A 16000 1
7 LPC A 8000 1
8 PCMA A 8000 1
9 G722 A 8000 1
10 L16 A 44100 2
11 L16 A 44100 1
12 TSP0 A 7200 1
13 VSC A 5500 1
14 MPA A 90000 (see text)
15 G728 A 8000 1
16--22 unassigned A
23 RGB8 V 90000 N/A
24 HDCC V 90000 N/A
25 CelB V 90000 N/A
26 JPEG V 90000 N/A
27 CUSM V 90000 N/A
28 nv V 90000 N/A
29 PicW V 90000 N/A
30 CPV V 90000 N/A
31 H261 V 90000 N/A
32 MPV V 90000 N/A
33 MP2T V 90000 N/A
34--71 unassigned V N/A
72--76 reserved N/A N/A N/A
77--95 unassigned ?
96--127 dynamic ? N/A
Table 2: Payload types (PT) for standard audio and video encodings
H. Schulzrinne [Page 12]
Internet Draft AV Profile July 7, 1995
6. Port Assignment
As specified in the RTP protocol definition, RTP data is to be car-
ried on an even UDP port number and the corresponding RTCP packets are
to be carried on the next higher (odd) port number.
Applications operating under this profile may use any such UDP port
pair. For example, the port pair may be allocated randomly by a session
management program. A single fixed port number pair cannot be required
because multiple applications using this profile are likely to run on
the same host, and there are some operating systems that do not allow
multiple processes to use the same UDP port with different multicast
addresses.
However, port numbers 5004 and 5005 have been registered for use
with this profile for those applications that choose to use them as the
default pair. Applications that operate under multiple profiles may use
this port pair as an indication to select this profile if they are not
subject to the constraint of the previous paragraph. Applications need
not have a default and may require that the port pair be explicitly
specified. The particular port numbers were chosen to lie in the range
above 5000 to accomodate port number allocation practice within the Unix
operating system, where port numbers below 1024 can only be used by
privileged processes and port numbers between 1024 and 5000 are automat-
ically assigned by the operating system.
7. Acknowledgements
The comments and careful review of Steve Casner are gratefully ack-
nowledged.
8. Address of Author
Henning Schulzrinne
GMD Fokus
Hardenbergplatz 2
D-10623 Berlin
Germany
electronic mail: schulzrinne@fokus.gmd.de
H. Schulzrinne [Page 13]