Internet Engineering Task Force      Audio-Video Transport Working Group
Internet Draft                                            H. Schulzrinne
ietf-avt-profile-05.txt                                        GMD Fokus
                                                            July 7, 1995
                                                        Expires: 12/1/95


    RTP Profile for Audio and Video Conferences with Minimal Control

STATUS OF THIS MEMO

     This document is an  Internet-Draft.  Internet-Drafts  are  working
documents  of the Internet Engineering Task Force (IETF), its areas, and
its working groups.  Note that other groups may also distribute  working
documents as Internet-Drafts.

     Internet-Drafts are draft documents valid  for  a  maximum  of  six
months  and may be updated, replaced, or obsoleted by other documents at
any time.  It is  inappropriate  to  use  Internet-Drafts  as  reference
material or to cite them other than as ``work in progress''.

     To learn the current status of any Internet-Draft, please check the
``1id-abstracts.txt''  listing  contained  in the Internet-Drafts Shadow
Directories   on   ftp.is.co.za   (Africa),   nic.nordu.net    (Europe),
munnari.oz.au   (Pacific  Rim),  ds.internic.net  (US  East  Coast),  or
ftp.isi.edu (US West Coast).

     Distribution of this document is unlimited.

                                ABSTRACT


      This note describes a profile for the use of  the  real-time
      transport  protocol  (RTP) and the associated control proto-
      col, RTCP, within audio and video  multiparticipant  confer-
      ences  with  minimal control. It provides interpretations of
      generic fields within the  RTP  specification  suitable  for
      audio  and  video  conferences. In particular, this document
      defines a set of default mappings from payload type  numbers
      to encodings.

           The document also describes how audio  and  video  data
      may  be  carried  within  RTP.  It defines a set of standard
      encodings and their names when used within RTP. However, the
      definitions  are  independent  of  the  particular transport
      mechanism used. The descriptions provide pointers to  refer-
      ence  implementations and the detailed standards. This docu-
      ment is meant as an aid for implementors of audio, video and



H. Schulzrinne                                                [Page 1]


Internet Draft                 AV Profile                   July 7, 1995


      other real-time multimedia applications.


Changes

     (This section will not become part of the RFC.)

     o  Video frequency changed to 90 kHz.

     o  Short reference labels for profile definitions.

     o  Explain differences between Intel/IMA DVI  format  and  the  one
       used for this profile; name changed from IDVI to DVI4.

     o  Minor editorial clarifications.


1.  Introduction


     This profile defines aspects of RTP left  unspecified  in  the  RTP
protocol  definition  (RFC  TBD).  This  profile is intended for the use
within audio and video conferences with minimal session control. In par-
ticular, no support for the negotiation of parameters or membership con-
trol is provided. Other profiles may  make  different  choices  for  the
items  specified here. The profile specifies the use of RTP over unicast
and multicast UDP. (This does not preclude the use of these  definitions
when  RTP  is carried by other lower-layer protocols.)  Use of this pro-
file occurs by use of the appropriate applications; there is no explicit
indication by port number, protocol identifier or the like.

2.  RTP and RTCP Packet Forms and Protocol Behavior


     This profile follows the default and/or recommended aspects of  the
RTP specification for these items:

     Header:
          The standard format of the fixed RTP data header is used  (one
          marker bit).

     Extension:
          No additional fixed  fields  are  appended  to  the  RTP  data
          header.

     RTCP report interval:
          The suggested constants are to be used  for  the  RTCP  report
          interval calculation.



H. Schulzrinne                                                [Page 2]


Internet Draft                 AV Profile                   July 7, 1995


     SR/RR extension:
          No extension section is defined for the RTCP SR or RR packet.

     RTCP packet types:
          No additional RTCP packet types are defined  by  this  profile
          specification.

     Security:
          The RTP default security services are also the  default  under
          this profile.

     Mapping:
          The standard  mapping  of  RTP  and  RTCP  to  transport-level
          addresses is used.

     Encapsulation:
          No encapsulation of RTP packets is specified.

     RTP header extensions:
          No RTP header extensions are defined, but applications operat-
          ing under this profile may use such extensions. Thus, applica-
          tions should not assume that the RTP header X  bit  is  always
          zero and should be prepared to ignore the header extension. If
          a header extension is defined in the future,  that  definition
          must specify the contents of the first 16 bits.

     SDES use:
          Applications may use any of the SDES items described.

     New encodings are to  be  registered  with  the  Internet  Assigned
Numbers Authority. When registering a new encoding, the following infor-
mation should be provided:

     o  name and description of encoding, in particular the  RTP  times-
       tamp clock rate;

     o  indication of who has change  control  over  the  encoding  (for
       example, CCITT/ITU, other international standardization bodies, a
       consortium or a particular company or group of companies);

     o  any operating parameters;

     o  a reference to a further description, if available, for  example
       (in order of preference) an RFC, a published paper, a patent fil-
       ing, a technical report or a computer manual;

     o  for proprietary encodings, contact information (postal and email
       address).



H. Schulzrinne                                                [Page 3]


Internet Draft                 AV Profile                   July 7, 1995


     o  the payload type value for this profile.


3.  Audio


3.1.  Encoding-independent recommendations


     The first packet of a talkspurt is distinguished by  a  set  marker
bit in the RTP data header.

     The following recommendations  are  default  operating  parameters.
Applications should be prepared to handle other values. The ranges given
are meant to give guidance to application writers,  allowing  a  set  of
applications  conforming  to  these  guidelines  to interoperate without
additional negotiation. These guidelines are not  intended  to  restrict
operating  parameters  for  applications  that  can  negotiate  a set of
interoperable parameters, e.g., through a conference control protocol.

     For packetized audio, the  default  packetization  interval  should
have  a  duration  of  20 ms, unless otherwise noted when describing the
encoding. The packetization interval determines the  minimum  end-to-end
delay;  longer  packets  introduce less header overhead but higher delay
and make packet loss more noticeable. For  non-interactive  applications
such  as  lectures  or links with severe bandwidth constraints, a higher
packetization delay may be appropriate.

     For N-channel encodings, each sampling period  (say,  1/8000  of  a
second) generates N samples. (This terminology is standard, but somewhat
confusing, as the total number of samples generated per second  is  then
the sampling rate times the channel count.)

     If multiple audio channels are used, channels  are  numbered  left-
to-right,  starting  at  one.  In  RTP  audio  packets, information from
lower-numbered channels precedes that from higher-numbered channels. For
more  than  two  channels,  the  convention followed by the AIFF-C audio
interchange format should be followed [1]. For two-channel  stereo,  the
numbering  sequence  is  left,  right;  for three channels, left, right,
center; for quadrophonic systems, front left, front  right,  rear  left,
rear  right; for four-channel systems, left, center, right, and surround
sound; for six-channel systems left, left center, center,  right,  right
center  and  surround sound. All channels belonging to a single sampling
instance must be within the same packet.

     The sampling frequency should be drawn from the set:  8000,  11025,
16000,  22050,  44100  and 48000 Hz. (The Apple Macintosh computers have
native sample rates of 22254.54 and 11127.27, which can be converted  to



H. Schulzrinne                                                [Page 4]


Internet Draft                 AV Profile                   July 7, 1995


22050  and 11025 with acceptable quality by dropping 4 or 2 samples in a
20 ms frame.)

     A receiver should accept packets representing between 0 and 200  ms
of audio data.[1] Receivers should be prepared to  accept  multi-channel
audio, but may choose to only play a single channel.

3.2.  Guidelines for Sample-Based Audio Encodings


     In sample-based encodings, each audio sample is  represented  by  a
fixed  number of bits. Within the compressed audio data, codes for indi-
vidual samples may span octet boundaries. An RTP audio packet  may  con-
tain  any  number  of  audio samples, subject to the constraint that the
number of bits per sample times the number of samples per packet  yields
an  integral  octet  count.  Fractional  encodings produce less than one
octet per sample.

     For sample-based encodings producing one or more octets per sample,
samples  from different channels, but the same sampling instant are con-
secutive. For example, for a two-channel encoding, the octet sequence is
(left channel, first sample), (right channel, first sample), (left chan-
nel, second sample), (right channel, second  sample),  ....  For  multi-
octet  encodings,  octets  are  transmitted in network byte order (i.e.,
most significant octet first).

     The packing order for fractional encodings is  that  described  for
the  IMA Wave types [2]. For audio encodings yielding four bits per sam-
ple, eight such compressed samples from channel 1 are  packet  into  one
32-bit  word, followed by eight compressed samples from channel 2, until
all channels have been accomodated and the packing resumes at channel 1.
For  audio  encodings yielding three bits per sample, 32 such compressed
samples at three bits each from channel 1 are  packed  into  12  octets,
followed by 32 samples from channel 2, etc.

3.3.  Guidelines for Frame-Based Audio Encodings



     Frame-based encodings encode a fixed-length  block  of  audio  into
another  block  of  compressed data, typically also of fixed length. For
frame-based encodings, the sender may choose  to  combine  several  such
frames into a single message. The receiver can tell the number of frames
contained in a message since the frame duration is defined  as  part  of
_________________________

  [1] This restriction allows reasonable buffer  sizing
for the receiver.



H. Schulzrinne                                                [Page 5]


Internet Draft                 AV Profile                   July 7, 1995


the encoding.

     For frame-based codecs, the channel order is defined for the  whole
block.  That is, for two-channel audio, right and left samples are coded
independently, with the encoded frame for  the  left  channel  preceding
that for the right channel.

     All frame-oriented audio codecs should be able to encode and decode
several  consecutive frames within a single packet. Since the frame size
for the frame-oriented codecs is given,  there  is  no  need  to  use  a
separate designation for the same encoding, but with different number of
frames per packet.

3.4.  Audio Encodings



       encoding     sample/frame     bits/sample     ms/frame
       __________________________________________________________
       1016         frame            N/A             30
       G721         sample           4
       G722         sample           8
       G728         frame            N/A             2.5 ms/frame
       GSM          frame            N/A             20
       DVI4         sample           4
       LPC          frame            N/A             20
       L8           sample           8
       L16          sample           16
       MPA          frame            N/A
       PCMU         sample           8
       PCMA         sample           8



Table 1: Properties of Audio Encodings

     1016: Encoding 1016 is a frame based  encoding  using  code-excited
          linear  prediction (CELP) and is specified in Federal Standard
          FED-STD 1016 [3,4,5,6].

          The U. S. DoD's  Federal-Standard-1016  based  4800  bps  code
     excited  linear  prediction voice coder version 3.2 (CELP 3.2) For-
     tran and C simulation source codes are available for worldwide dis-
     tribution at no charge (on DOS diskettes, but configured to compile
     on Sun SPARC stations) from: Bob Fenichel, National  Communications
     System,  Washington, D.C. 20305, phone +1-703-692-2124, fax +1-703-
     746-4960. and
               ftp://ftp.super.org/pub/speech/celp_3.2a.tar.Z.



H. Schulzrinne                                                [Page 6]


Internet Draft                 AV Profile                   July 7, 1995


     G721: G721 is specified  in  ITU  recommendation  G.721.  Reference
          implementations  for  G.721  are  available  as  part  of  the
          CCITT/ITU-T Software Tool Library (STL) from the  ITU  General
          Secretariat,  Sales  Service, Place du Nations, CH-1211 Geneve
          20, Switzerland. The library is covered by a  license  and  is
          available at
           ftp://gaia.cs.umass.edu/pub/hgschulz/ccitt/ccitt_tools.tar.Z

     G722: G722 is specified  in  ITU-T  recommendation  G.722,  "7  kHz
          audio-coding within 64 kbit/s".

     G728: G728 is specified in ITU-T recommendation G.728,  "Coding  of
          speech  at 16 kbit/s using low-delay code excited linear pred-
          iction".

     GSM: GSM (group speciale mobile) denotes  the  European  GSM  06.10
          provisional standard for full-rate speech transcoding, prI-ETS
          300  036,  which  is  based   on   RPE/LTP   (residual   pulse
          excitation/long  term prediction) coding at a rate of 13 kb/s.
          A reference implementation was written by Carsten  Borman  and
          Jutta Degener (TU Berlin, Germany) and is available at
               ftp://ftp.cs.tu-berlin.de/pub/local/kbs/tubmik/gsm/.

     DVI4: DVI4 is specified, with pseudo-code, in [2]as the ADPCM  wave
          type.  However,  the  encoding defined here as DVI4 differs in
          two respects from the IMA recommendation:

          - The header contains the  predicted  value  rather  than  the
            first sample value.

          - IMA ADPCM blocks contain odd number of  samples,  since  the
            first  sample  of  a  block  is contained just in the header
            (uncompressed), followed by an  even  number  of  compressed
            samples. DVI4 has an even number of compressed samples only,
            using the 'predict' word from the header to decode the first
            sample.

          Each packet contains a single  DVI  block.  The  profile  only
     defines  the  4-bit-per-sample  version, while IMA also specifies a
     3-bit-per-sample encoding.

          The "header" word for each channel has  the  following  struc-
     ture:








H. Schulzrinne                                                [Page 7]


Internet Draft                 AV Profile                   July 7, 1995



               int16  predict;  /* predicted value of first sample
                                   from the previous block (L16 format) */
               u_int8 index;    /* current index into stepsize table */
               u_int8 reserved; /* set to zero by sender, ignored by receiver */





          Header words for all channels precede the compressed data.

          An implementation is available from Jack Jansen via  anonymous
     ftp from
                ftp://ftp.cwi.nl/local/pub/audio/adpcm.shar.

     L8:  L8 denotes linear audio data, using 8-bits of  precision  with
          an offset of 128, that is, the most negative signal is encoded
          as 0.

     L16: L16 denotes  uncompressed  audio  data,  using  16-bit  signed
          representation   with  65535  equally  divided  steps  between
          minimum and maximum  signal  level,  ranging  from  -32768  to
          32767.  The  value is represented in two's complement notation
          and network byte order.

     MPA: MPA denotes MPEG-I or MPEG-II audio encapsulated as elementary
          streams.  The  encoding  is  defined  in ISO standards ISO/IEC
          11172-3 and 13818-3. The encapsulation  is  specified  in  RFC
          TBD,  Section 3. Sampling rate and channel count are contained
          in the payload.

     PCMU: PCMU is specified in CCITT/ITU-T recommendation G.711.  Audio
          data  is  encoded  as eight bits per sample, after companding.
          Code to convert between linear and mu-law  companded  data  is
          available in [2].

     PCMA: PCMA is specified in CCITT/ITU-T recommendation G.711.  Audio
          data  is  encoded  as eight bits per sample, after companding.
          Code to convert between linear and  A-law  companded  data  is
          available in [2].

     LPC: LPC designates  an  experimental  linear  predictive  encoding
          written by Ron Frederick, Xerox PARC, available from
               ftp://parcftp.xerox.com/pub/net-research/lpc.tar.Z.

     VDVI: VDVI is a variable-rate version of DVI4, yielding speech  bit
          rates  of  between 10 and 25 kbps. It is specified for single-



H. Schulzrinne                                                [Page 8]


Internet Draft                 AV Profile                   July 7, 1995


          channel operation only. It uses the following encoding:


                        DVI4 codeword     VDVI bit pattern
                                    0     00
                                    1     010
                                    2     1100
                                    3     11100
                                    4     111100
                                    5     1111100
                                    6     11111100
                                    7     11111110
                                    8     10
                                    9     011
                                   10     1101
                                   11     11101
                                   12     111101
                                   13     1111101
                                   14     11111101
                                   15     11111111



     TSP0: TSP0 designates the  proprietary  variable-rate,  frame-based
          encoding  called  True  Speech.  The encoding is defined for a
          sampling rate of 7200 Hz and has an average data rate of  7200
          bits  per second. Further information is available by contact-
          ing VocalTec (see VSC encoding) or the  address:   DSP  Group,
          Inc.
          email: tsplayer@dsgp.com

     VSC: VSC designates the proprietary variable-rate  encoding  called
          Vocaltec  Software  Compression. The encoding is defined for a
          sampling rate of 5500 Hz and has an average data rate  of  963
          bytes per second. Further information is available by contact-
          ing Alon Cohen
          VocalTec Ltd.
          Maskit 1, Herzliya
          Israel
          phone: +972-9-5612121
          email: alon@vocaltec.com

     The standard audio encodings and their payload types are listed  in
Table 5.







H. Schulzrinne                                                [Page 9]


Internet Draft                 AV Profile                   July 7, 1995


4.  Video

     The following video encodings are  currently  defined,  with  their
abbreviated names used for identification:

     CelB: The CELL-B encoding is a proprietary encoding proposed by Sun
          Microsystems. The byte stream format is described in RFC TBD.

     CPV: This proprietary encoding, "Compressed Packet Video is  imple-
          mented by Concept, Bolter, and ViewPoint Systems video codecs.
          For further information, contact:  Glenn Norem, President
          ViewPoint Systems, Inc.
          2247 Wisconsin Street, Suite 110
          Dallas, TX 75229-2037
          United States
          Phone: +1-214-243-0634

     JPEG: The encoding  is  specified  in  ISO  Standards  10918-1  and
          10918-2. The RTP payload format is as specified in RFC TBD.


     H261: The encoding is specified in CCITT/ITU-T standard H.261.  The
          packetization and RTP-specific properties are described in RFC
          TBD.

     HDCC: The HDCC encoding is a proprietary encoding used  by  Silicon
          Graphics. Contact

          inperson@sgi.com for further details.

     MPV: MPV designates the use MPEG-I and MPEG-II video encoding  ele-
          mentary  streams  as  specified in ISO Standards ISO/IEC 11172
          and 13818-2, respectively. The RTP payload format is as speci-
          fied in RFC TBD, Section 3.

     MP2T: MP2T designates the use of  MPEG-II  transport  streams,  for
          either  audio  or video. The encapsulation is described in RFC
          TBD, Section 2.

     nv:  The encoding is implemented in the program 'nv'  developed  at
          Xerox PARC by Ron Frederick.

     CUSM: The encoding is implemented in the program CU-SeeMe developed
          at  Cornell  University by Dick Cogger, Scott Brim, Tim Dorcey
          and John Lynn.

     PicW: The encoding is  implemented  in  the  program  PictureWindow
          developed at Bolt, Beranek and Newman (BBN).



H. Schulzrinne                                               [Page 10]


Internet Draft                 AV Profile                   July 7, 1995


     RGB8: 8-bit encoding of RGB values, sequenced TBD.  Each pixel  can
          assume  values  from  0  to  255.  Each frame is prefixed by a
          header containing TBD.

5.  Payload Type Definitions


     Table 5 defines this profile's static payload type values  for  the
PT  field  of  the RTP data header. To assign a new value from the range
marked 'unassigned' in the  table,  register  your  RTP  Payload  Format
specification with the IANA.

     In addition, payload type  values  in  the  range  96--127  may  be
defined  dynamically  through  a  conference  control protocol, which is
beyond the scope  of  this  document.  The  payload  type  range  marked
'reserved'  has been set aside so that RTCP and RTP packets can be reli-
ably distinguished (see Section "Summary of Protocol Constants"  of  the
RTP protocol specification).

     An RTP source emits a single RTP payload type at  any  given  time;
the interleaving of several RTP payload types in a single RTP session is
not allowed, but multiple RTP sessions may be used in parallel  to  send
multiple  media.  The  payload  types  currently defined in this profile
carry either audio or video, but not both. However,  it  is  allowed  to
define  payload types that combine several media, e.g., audio and video,
with appropriate separation in the payload format. Session  participants
agree  through  mechanisms beyond the scope of this specification on the
set of allowable payload types in a given session.  This  set  may,  for
example,  be defined by the capabilities of the applications used, nego-
tiated by a conference control  protocol  or  established  by  agreement
between the human participants.

     Audio applications operating under this profile SHOULD  at  minimum
be  able  to send and receive payload types 0 (mu-law) and 5 (DVI). This
allows interoperability without format negotiation and successful  nego-
tation with a conference control protocol.

     All current video encodings use a timestamp frequency of 90000  Hz,
the  same  as the MPEG presentation time stamp frequency. This frequency
yields exact integer timestamp increments for the typical 24, 25, and 30
Hz frame rates and 50 and 60 Hz field rates and only 1 ppm error for the
29.97 Hz NTSC frame rate. While 90  kHz  is  the  recommended  rate  for
future  video encodings used within this profile, other rates are possi-
ble. However, it is not sufficient to use the video  frame  rate  (typi-
cally between 15 and 30 Hz) because that does not provide adequate reso-
lution for typical synchronization requirements when calculating the RTP
timestamp  corresponding  to the NTP timestamp in an RTCP SR packet [8].
The timestamp resolution must also be sufficient for the jitter estimate



H. Schulzrinne                                               [Page 11]


Internet Draft                 AV Profile                   July 7, 1995


contained in the receiver reports.

     The standard video encodings and their payload types are listed  in
Table 5.


  PT         encoding       audio/video     clock rate     channels
             name           (A/V)           (Hz)           (audio)
  ___________________________________________________________________
  0          PCMU           A               8000           1
  1          1016           A               8000           1
  2          G721           A               8000           1
  3          GSM            A               8000           1
  4          unassigned     A               8000           1
  5          DVI4           A               8000           1
  6          DVI4           A               16000          1
  7          LPC            A               8000           1
  8          PCMA           A               8000           1
  9          G722           A               8000           1
  10         L16            A               44100          2
  11         L16            A               44100          1
  12         TSP0           A               7200           1
  13         VSC            A               5500           1
  14         MPA            A               90000          (see text)
  15         G728           A               8000           1
  16--22     unassigned     A
  23         RGB8           V               90000          N/A
  24         HDCC           V               90000          N/A
  25         CelB           V               90000          N/A
  26         JPEG           V               90000          N/A
  27         CUSM           V               90000          N/A
  28         nv             V               90000          N/A
  29         PicW           V               90000          N/A
  30         CPV            V               90000          N/A
  31         H261           V               90000          N/A
  32         MPV            V               90000          N/A
  33         MP2T           V               90000          N/A
  34--71     unassigned     V                              N/A
  72--76     reserved       N/A             N/A            N/A
  77--95     unassigned     ?
  96--127    dynamic        ?                              N/A



Table 2: Payload types (PT) for standard audio and video encodings






H. Schulzrinne                                               [Page 12]


Internet Draft                 AV Profile                   July 7, 1995


6.  Port Assignment


     As specified in the RTP protocol definition, RTP data is to be car-
ried  on  an even UDP port number and the corresponding RTCP packets are
to be carried on the next higher (odd) port number.

     Applications operating under this profile may use any such UDP port
pair.  For example, the port pair may be allocated randomly by a session
management program. A single fixed port number pair cannot  be  required
because  multiple  applications  using this profile are likely to run on
the same host, and there are some operating systems that  do  not  allow
multiple  processes  to  use  the same UDP port with different multicast
addresses.

     However, port numbers 5004 and 5005 have been  registered  for  use
with  this profile for those applications that choose to use them as the
default pair. Applications that operate under multiple profiles may  use
this  port  pair as an indication to select this profile if they are not
subject to the constraint of the previous paragraph.  Applications  need
not  have  a  default  and  may require that the port pair be explicitly
specified. The particular port numbers were chosen to lie in  the  range
above 5000 to accomodate port number allocation practice within the Unix
operating system, where port numbers below 1024  can  only  be  used  by
privileged processes and port numbers between 1024 and 5000 are automat-
ically assigned by the operating system.

7.  Acknowledgements

     The comments and careful review of Steve Casner are gratefully ack-
nowledged.

8.  Address of Author

Henning Schulzrinne
GMD Fokus
Hardenbergplatz 2
D-10623 Berlin
Germany
electronic mail: schulzrinne@fokus.gmd.de











H. Schulzrinne                                               [Page 13]