Internet Draft                                                  J. Rey
   draft-ietf-avt-rtp-3gpp-timed-text-02.txt                    Y. Matsui
                                                               Matsushita
   Expires: January 6, 2005                                  July 6, 2004


                  RTP Payload Format for 3GPP Timed Text

   Status of this Memo

   By submitting this Internet-Draft, we certify that any applicable
   patent or other IPR claims of which we are aware have been disclosed,
   and any of which we become aware will be disclosed, in accordance
   with RFC 3668 (BCP 79).

   By submitting this Internet-Draft, we accept the provisions of
   Section 3 of RFC 3667 (BCP 78).

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html


   Abstract

   This document specifies an RTP payload format for the transmission of
   3GPP (3rd Generation Partnership Project) timed text.  3GPP timed
   text is a time-lined decorated text media format with defined storage
   in a 3GP file.  Timed Text can be synchronized with audio/video
   contents.  As of today, 3GP files containing timed text contents can
   only be downloaded via HTTP.  There is no available mechanism for
   streaming 3GPP timed text contents neither out of 3GP files nor
   directly from live content.  In the following sections the problems
   of streaming timed text are addressed and a payload format for
   streaming 3GPP timed text over RTP is specified.








                 IETF draft - Expires January 6, 2005         [Page 1]


Table of Contents

   1. Terminology.....................................................2
   2. Introduction....................................................4
   3. RTP Payload Format for 3GPP Timed Text.........................13
   4. Resilient Transport............................................23
   5. Congestion control.............................................24
   6. Scene Description..............................................24
   7. MIME Type usage Registration...................................25
   8. SDP usage......................................................28
   9. IANA Considerations............................................30
   10. Security considerations.......................................30
   11. References....................................................31
   12. Annexes.......................................................32
   13. Acknowledgements..............................................36
   14. Author's Addresses............................................36
   15. IPR Notices...................................................36
   16. Full Copyright Statement......................................37
   17. Acknowledgement...............................................37


   [Note to the RFC Editor: please delete the Change Log section upon
   publication of this document as RFC]
   [Note to the RFC Editor: please replace "RFCXXXX" with the RFC
   designation of this document when published]

   Change Log

   Changes from draft-ietf-avt-rtp-3gpp-timed-text-01

   - editorial nits and clarifications to address comments as per email
   to the AVT mailing list, May 18, 2004:
      http://www1.ietf.org/mail-archive/web/avt/current/msg03729.html


1. Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [5].

   Furthermore, the following terms are used and have specific meaning
   within the context of this document:

   text sample or whole text sample:

        this refers to a unit of timed text data as contained in the
        source 3GP file.  Its equivalent in audio/video would be a


                 IETF draft - Expires January 6, 2005         [Page 2]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


        frame.  A text sample contains text strings followed by zero or
        more modifier boxes.


   fragment or text sample fragment:

        a fraction of a text sample.  A fragment may contain either text
        strings or modifier (decoration) contents, but not both at the
        same time.


   sample contents:

        general term to identify timed text data transported when using
        this payload format.


   text strings:

        text strings is the term used to denote the actual text
        characters encoded either as UTF-8 [18] or UTF-16 [19].  This
        text string MUST NOT contain any byte order mark; this is not
        needed as explained in Section 3.1.2.


   decoration/modifiers:

        the terms "decoration" and "modifiers" are used interchangeably
        throughout the document to denote the contents of the text
        sample that modify the default text formatting.  Modifiers may,
        for example, specify different font size for a particular
        sequence of characters or define karaoke timing for the sample.


   sample description:

        this term is used to denote information that applies to a text
        sample as a whole and per default.  Examples of such are
        scrolling direction, text box position, delay value, default
        font, background color, etc.  This information may also apply to
        different text samples.


   units or access units:

        the payload headers specified in this document encapsulate text
        samples, fragments thereof and sample descriptions by prepending
        a specific payload header and so building what is called a
        (access) unit.


   aggregation / aggregate packet


   Rey & Matsui                                               [Page 3]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


        An aggregate RTP packet consists of several units.


   track / stream

        3GP files contain audio/video and text tracks.  This document
        enables to stream these tracks using RTP.  Therefore both terms
        are exchanged in this document in the context of 3GP files.


   Media Header Box / Track Header Box / ...

        the 3GP file format makes use of these structures defined in the
        ISO Base File Format [2].  When referring to these in this
        document, initials are capitalized for clarity.


2. Introduction

   3GPP timed text is a media format for time-lined decorated text
   specified in [1].  3GPP Timed text contents may be stored in 3GP
   files or may be generated in real time.  The 3GP file format itself
   is based on the ISO Base Media File Format recommendation [2].
   Section 12.2 gives some insight into the 3GP file structure.

   The purpose of this draft is to provide a means to stream 3GPP timed
   text contents using RTP.  This includes the streaming of timed text
   being read out of a 3GP file as well as the streaming of timed text
   generated in real time, a.k.a. live streaming.

2.1 General Overview of the 3GPP Timed Text format

   The 3GPP timed text format was developed for use in the services
   specified in the 3GPP Transparent End-to-end Packet-switched
   Streaming Services (3GPP PSS) [16].  Besides plain text, the 3GPP
   timed text format allows the display of decorated text: like for
   karaoke applications, scrolling text for newscasts or hyperlinked
   text.  Furthermore, these contents may or may not be synchronized
   with other media, like audio or video.

   The scope of the 3GPP PSS includes both downloading and streaming of
   multimedia content over 3G packet-switched networks.  However, due to
   the lack of an appropriate RTP payload format, the current usage of
   the 3GPP timed text file format is limited to downloading via HTTP.

   The 3GPP PSS adopts multimedia codecs (such as MPEG-4 Visual, AMR,
   MPEG-4 AAC, and JPEG) and protocols like SMIL [9] for presentation
   layouts or RTP [3] for streaming.  In general, a multimedia
   presentation might consist of several audio/video/text streams (or
   tracks in ISO file format jargon).  Different streams may have
   different contents.  The media may be spatially synchronized either
   using the information within the streams or a scene description
   language like SMIL.

   Rey & Matsui                                               [Page 4]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004



   An example of this would be a media session with three different
   media streams: 1 audio, 1 video and 1 timed text that reproduces a
   music video with karaoke subtitles.  For each stream some information
   is needed, which defines the regions where each media is displayed,
   how the media looks like and its synchronization, among other things.
   In karaoke, for example, the song lyrics are displayed below the
   music video and the words are highlighted synchronized with the music
   track.

   Four differentiated functional components might be identified in the
   3GPP timed text media format:

        o initial spatial layout information related to the text track:
          these are the height and width of the text region where text
          is displayed, the position of the text region in the display
          and the layer or proximity of the text to the user.  These
          pieces of information are contained in the Track Header Box.
          Sections 6.1 and 12 provide further details.

        o default settings for formatting and positioning of text:
          style (font, size, colour,...), background colour, horizontal
          and vertical justification, line width, scrolling, etcetera.
          Sample descriptions contain such settings.

        o the actual text: encoded characters using either UTF-8 [18]
          or UTF-16 [19] encoding and,

        o the decoration inside the modifier boxes: if some characters
          have different style, some delay, blink, etcetera... this
          needs to be indicated by appending the modifier boxes to the
          text strings.  Modifier boxes are only present in the text
          samples if they are actually needed.  Otherwise, the default
          settings in the corresponding sample description apply.  At
          the time of writing this payload format the following
          decorations or modifiers are specified in the 3GPP timed text
          media format specification [1]:

            - text highlight,
            - highlight color,
            - blinking text,
            - karaoke feature,
            - hyperlink,
            - text delay,
            - text style and,
            - positioning of the text box and,
            - text wrap indication.

   Section 12.3 specifies where to find these values in the 3GP file and
   how these are mapped to the payload format.  For live streaming,
   appropriate values using the same formats and units shall be used.



   Rey & Matsui                                               [Page 5]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   For further details on the 3GPP Timed Text media format, refer to
   [1].


2.2 Requirements for a timed text payload format

   In this section a set of requirements is listed.  A justification for
   each of them is also given.  An RTP Payload Format for 3GPP timed
   text SHALL:

        1.  Keep the 3GP text sample structure.  This requirement means
   that it SHALL be possible for an RTP receiver using this payload
   format to rebuild the text samples upon the received RTP packets.

        2.  Transmit the text sample size, sample duration and sample
   description index in-band.  In RTP it is important to transmit it in-
   band because this information might change from sample to sample.

        3.  Enable the transmission of the sample descriptions both by
   out-of-band and in-band means.  Typically, a set of default
   formatting settings is transmitted once at the initialization phase.
   If more sample descriptions are need, these may be sent also out-of-
   band.  However, sending them in-band is easier and more efficient.
   Sample descriptions may become large so that out-of-band transmission
   might not be the most appropriate transport method.  Additionally,
   out-of-band channels might not be always available.  For these
   reasons, the payload format SHALL also enable in-band transmission of
   sample description information.  This is especially useful for live
   streaming, where contents are not known a priori.

        4.  Enable the aggregation of units into an RTP packet.  In a
   mobile communication environment a typical text sample size is around
   100-200 bytes.  Transporting several units in one RTP packet makes
   the transport more efficient.

        5.  Enable the fragmentation and reassembly of a text sample
   into several RTP packets in order to cover a wide range of
   applications and network environments.  In general, fragmentation
   should be a rare event given the low bit rates and text sample sizes.
   However, the 3GPP Timed Text media format does allow for larger text
   samples.  The payload format SHALL take this into account and provide
   a means for coping with fragmentation.

        6.  Enable the use of resilient transport mechanisms, such as
   repetition, retransmissions and FEC.  Additional mechanisms like FEC
   [7] or retransmission [11] can be used to protect the information.
   RFC 2354 [8] discusses available mechanisms for packet loss
   resiliency.


2.3 General Remarks



   Rey & Matsui                                               [Page 6]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   Before going into the details of the payload headers, some general
   observations are made in this section.  These should help the reader
   in understanding the design decisions.

   In order to understand the next sections a minimal description of the
   units is needed:

        o a TYPE 1 unit contains one complete text sample,
        o a TYPE 2 unit transports a complete text string or a fragment
          thereof,
        o a TYPE 3 unit contains the complete modifiers or only the
          first fragment thereof,
        o a TYPE 4 unit contains one modifier fragment other than the
          first and,
        o a TYPE 5 unit contains one sample description.

2.3.1 Character Counting

   This payload format does not enable an RTP receiver to find out the
   exact number of text characters lost.

   Note that the fragment size included in the payload headers does not
   help in finding the number of lost characters, since the UTF-8/16
   encodings used for the text strings yield a variable number of bytes
   per character.

   For character counting an additional field in TYPES 2 would have to
   be defined.  However, this has not been done: it is a design choice
   not to do this, since fragmentation should be a rare event.  Thus,
   the gain of including such an additional field versus the overhead
   added by making the sender count the characters upon fragmentation is
   not justified.

2.3.2 On the length indication in the units

   Usually, RTP applications use the information on packet size from UDP
   or lower layers to find out the length of the RTP payload.  While
   this information can still be used, this payload format includes an
   explicit length indication for each unit in the payload as a fixed
   field in the payload headers.

   This design choice allows easy interoperability with the RTP Payload
   Format for Transport of MPEG-4 Elementary Streams, RFC 3640 [12],
   which does require an explicit length indication for each unit (see
   AU-header in RFC 3640).

2.3.3 Fragmentation of Timed Text Samples

   This section justifies why text samples may have to be fragmented and
   discusses some of the possible approaches to do it.  A solution is
   proposed together with rules and recommendations for fragmenting and
   transporting text samples using this payload format.


   Rey & Matsui                                               [Page 7]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   3GPP Timed Text applications are expected to operate at low bit rates.
   This fact added to the small size of timed text samples (typically
   one or two hundred bytes) makes fragmentation of text samples a rare
   event.  Samples should usually fit into the MTU size of the used
   network path.

   Nevertheless, some text strings (e.g. ending roll in a movie) and
   some modifier boxes (i.e. for hyperlinks, for karaoke or for styles)
   might become large.  This may also apply for future modifier boxes.
   In such cases, the first question that has to be solved is whether it
   is possible to adjust the encoding (e.g. the size of sample) in such
   a way that fragmentation is avoided.  If so, this is preferred to
   fragmentation and SHOULD be done.

   Otherwise, if this is not possible or other constraints avoid doing
   this, fragmentation MAY be used and the basic guidelines given in
   this document MUST be followed.

   A minimum set of fragmentation rules and recommendations SHALL be
   observed:

   o whenever possible, whole text samples SHOULD be aggregated into
     RTP packets, using the payload headers defined in this document.
     This increases transport efficiency.

   o it is RECOMMENDED that text samples are fragmented as seldom as
     possible, i.e. the least possible number of fragments are created
     out of a text sample.  As an example, if a packet has some free
     space, which would fit only a small part of the next text sample,
     a new RTP packet SHOULD be sent, instead of creating two or more
     fragments out of a sample.  This reduces complexity by minimizing
     the number of fragments and also increases the packet loss
     robustness.  Similarly, this idea can also be applied for the case
     of a single text sample being fragmented into several pieces.  In
     particular, since the modifiers are the less resilient part of the
     text sample (they are useless if a fragment is lost) the best
     option would be to minimize the number of fragments created out of
     the modifiers.

   o if there is some bitrate and space in the payload available, units
     of past or following (if available) text samples MAY be
     aggregated.  As explained further in this document sample
     descriptions (TYPE 5 units) MAY be placed anywhere in an aggregate
     payload, since the sample index (SIDX) is used to associate them
     to their text samples.  This is different for the other unit
     TYPEs, where the timestamp is used to collect the different
     fragments of a text sample.  In this case, strict ordering
     requirements apply and care MUST be taken to guarantee that the
     correct timestamp value is resolved for each unit.

   o text strings MUST split at character boundaries.  Otherwise, it is
     not possible to display the text contents of a fragment if a
     previous fragment was lost.  As a consequence, text string

   Rey & Matsui                                               [Page 8]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


     fragmentation requires knowledge of UTF-8/-16 encoding formats to
     determine character boundaries.

   o unlike text strings, the modifier boxes are NOT REQUIRED to split
     at meaningful boundaries, nor there is a possibility to apply
     partially received modifier contents to text strings.  In fact,
     enabling this would require that: a) senders understand the
     semantics of the modifier boxes and b) specific fragment headers
     for each of the modifier boxes are defined, in addition to the
     payload formats defined below.  However, given the low probability
     of fragmentation and the desire to keep the requirements low, this
     does not seem to be a reasonable to define such additional
     headers.

   o as a consequence of the above, the modifiers are only useful if
     received complete, i.e. all fragments are received.  In order to
     ensure enhanced resiliency against packet loss it is RECOMMENDED
     that modifier fragments be especially protected using FEC [7],
     retransmission [11], packet repetition or an equivalent technique.
     Similarly, these techniques MAY also be applied to text strings
     and sample descriptions.

   o an additional requirement when fragmenting text samples is that
     the start of the modifiers MUST be indicated using the payload
     header defined for that purpose, i.e. a TYPE 3 unit MUST be used
     (see below).  Otherwise, if packets are lost, a client may be
     unable to identify where the modifiers start and the text ends or
     whether either text strings or modifiers were received completely
     or not.

   o finally, sample descriptions SHALL NOT be fragmented, because they
     contain important information that may affect several text
     samples.

2.3.4 On aggregate payloads

   As a general recommendation, units SHOULD be aggregated whenever
   possible.  The aggregation of units MUST follow certain guidelines.
   This is important since the different fragments of a text sample are
   associated together using the timestamp, and thus their ordering in
   the aggregate SHALL allow resolving the correct timestamp for the
   different units.  The following rules apply:

     1. In an aggregate payload, the timestamp of a unit MUST be
        obtained by adding the RTP timestamp to the durations of the
        previous units in that aggregate (for the first unit, this is
        the RTP timestamp).  Since the duration field always expresses
        positive duration values, contiguous units in an aggregate
        payload MUST belong to different text samples, i.e. have
        different timestamps.  These text samples MUST, consequently, be
        also contiguous in time, i.e. be displayed directly one after
        the other.


   Rey & Matsui                                               [Page 9]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


     2. There is an exception to the timestamp calculation mechanism:
        TYPE 5 units MAY be placed anywhere in the aggregate and they
        SHALL NOT be regarded for calculating the timestamp of the
        subsequent units.  This is because they usually do not belong to
        any text sample in particular, but are shared by several.  For
        timestamp calculations, TYPE 5 units MUST simply be ignored,
        i.e. by jumping to the next unit.

     3. The aggregate payloads SHOULD be kept simple.  Figure 2 shows an
        example of how not to aggregate the units of Figure 1.  The
        interleaving shown may, under certain circumstances, provide
        higher packet loss resilience than the aggregate shown in Figure
        1; it is however a more complex option and thus SHOULD be
        avoided.  As a rule of thumb, units SHOULD be aggregated in
        display order (with the exception of TYPE 5 units).

   Some possibilities for aggregate payloads are illustrated in the
   Figure 1 below.

                                           TS3    TS4
                                      +----------+---------------+
       TS1    TS1   TS2     TS3       |TYPE2     | TYPE1/2       |a)
    +-------+-----+------+-----+      +----------+---------------+
    |TYPE1  |TYPE5|TYPE1 |TYPE2|         sdur3      sdur4
    +-------+-----+------+-----+
      sdur1   N/A  sdur2  sdur3
                                        TS3       TS4
                                      +----------+---------------+
                                      |TYPE3     | TYPE1/2       |b)
                                      +----------+---------------+
                                          sdur3     sdur4
                                               TS3
                                      +--------------------------+
                                      |    TYPE3                 |c)
                                      +--------------------------+
                                            sdur3
                                            TS3      TS4
                                      +----------+---------------+
                                      | TYPE4    |TYPE1/2        |d)
                                      +----------+---------------+
                                          sdur3     sdur4

    |-----------PAYLOAD 1------|      |---PAYLOAD 2--------------|
        RTP Timestamp=rtpts1                 RTP Timestamp=rtpts2

                   Figure 1. Example aggregate payloads.


        TS1    TS1   TS2       TS3                TS3   TS4
     +-------+-----+----------+-----+            +-----+---------------+
     |TYPE1  |TYPE5|TYPE1     |TYPE2|            |TYPE2| TYPE1/2       |
     +-------+-----+----------+-----+            +-----+---------------+
       sdur1  N/A     sdur2     sdur3             sdur3  sdur4

   Rey & Matsui                                              [Page 10]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


         /     |       `-.._        `.            /          /\
        /      |        \   `--._     `.         /          /  \
      TS1     TS1      TS2      TS2    TS3     TS3      TS4      TS4
   +-------+ +-----+ +-----+ +-----+ +-----+ +-----+ +-------+ +-------+
   |TYPE1  | |TYPE5| |TYPE2| |TYPE2| |TYPE2| |TYPE2| | TYPE2 | | TYPE2 |
   +-------+ +-----+ +-----+ +-----+ +-----+ +-----+ +-------+ +-------+
     sdur1     N/A    sdur2   sdur2  .-/sdur3 sdur3    sdur4     sdur4
       |             .-'         |.-/         .'       .'         .'
       |          .-'        _.-'|          .'      .-'         .'
       |       .-'        _.-'   |        .'      .'          .'
      TS1    TS2    TS3.-'     TS2  TS3 .'  TS4.-'     TS4  .'  TS1
   +-------+-----+-----+     +-----+-----+-------+    +-------+-----+
   |TYPE1  |TYPE2|TYPE2|     |TYPE2|TYPE2| TYPE2 |    | TYPE2 |TYPE5|
   +-------+-----+-----+     +-----+-----+-------+    +-------+-----+
     sdur1 sdur2  sdur3      sdur2  sdur3   sdur4        sdur4  N/A

        Figure 2. Example of how NOT to aggregate units.  We use as
        starting point PAYLOAD 1 and PAYLOAD 2a) in Figure 1.


   Legend: TYPE 1/2 indicates that either a TYPE 1 or TYPE 2 unit
           MAY be used.  TSx indicates the unit belongs to text sample
           x.  sdurx indicates the duration for text sample x.

   Referring to the Figure 1 we will illustrate how the timestamp for
   each unit is found.  Assuming rtptsy represents the standard RTP
   timestamp of PAYLOAD y, sdurx the duration of unit x and tsx is the
   timestamp for unit x, the tsx can be found as the sum of rtptsy plus
   the cumulative sum of the durations of preceding units in the
   payload:
          1. for the units in the first aggregate payload, PAYLOAD 1:

                        ts1= rtpts1,
                        ts2= rtpts1 + sdur1,
                        ts3= rtpts1 + sdur1 + sdur2,

          (Note that no timestamp calculation is needed for TYPE 5
           units, nor they are taken into account in the calculation.)

          2. for PAYLOAD 2:

                        ts3= rtpts1 + sdur1 + sdur2 = rtpts2,
                        ts4= rtpsts2 + sdur3


2.3.5 Reassembling text samples at the receiver

   The payload headers defined in this document allow reassembling
   fragmented text samples.  For this purpose the following fields are
   used: the standard RTP timestamp, the duration indication (see SDUR)
   and the total and subtotal counters (see TOTAL, THIS) field of the
   payload headers.  Fragments of the same text sample MUST resolve to
   the same timestamp value.

   Rey & Matsui                                              [Page 11]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004



   The process for collecting the different fragments of a text sample
   is as follows:

     1. within the received units, search for those having the same
        timestamp value.  The timestamp indicates the time when the unit
        first becomes active.  The duration field indicates how long the
        contents are active and, at the same time, is used to calculate
        the timestamp of subsequent units.  This mechanism is
        illustrated in the section above.

     2. check whether any of the fragments of the text sample is missing.
        This is done using the TOTAL and THIS fields; the TOTAL field
        indicates how many fragments were created out of the text sample
        and the THIS field indicates the position of this fragment in
        the text sample.  As result of this operation two outcomes are
        possible:
          a. no fragment is missing.  Then the THIS field SHALL be used
             to order the fragments and reassemble the text sample
             before forwarding it to the decoding application.

          b. one or more fragments are missing: check whether this
             fragment belongs to the text string or to the modifiers:
             TYPE 2 units identify text strings, TYPE 3 and 4 modifiers.
             Thus:
              i. if the fragment or fragments missing belong to the
                  text string and the modifiers were received complete,
                  then the remaining text characters MAY still be
                  displayed as plain text or, in certain cases, applying
                  modifiers.  Modifiers MAY only be applied as long as
                  it is possible to identify the character numbers, e.g.
                  if only last text string fragment is lost.  This is
                  the case for modifiers defining specific font styles
                  ('styl'), highlighted characters ('hlit'), karaoke
                  feature ('krok)' and blinking characters ('blnk').
                  Other modifiers such as 'dlay' or 'tbox' can be
                  applied without the need of the character number.
                  This is an application issue.  See [1] for details.

             ii. if the fragment missing belongs to the modifiers and
                  the text strings were received complete, then the
                  modifiers MUST NOT be used.  Since modifiers are split
                  without observing meaningful boundaries, it is not
                  possible to apply partially received modifiers to the
                  text strings.  Therefore, the text string MAY be
                  discarded or it MAY be displayed as plain text.  This
                  is an application choice.

            iii. a third possibility is that it is not possible to
                  discern whether modifiers or text strings were
                  received complete.  E.g. if the TYPE 3 unit plus the
                  following or preceding packet is lost, there is no way
                  for the RTP receiver to know if one if these lost

   Rey & Matsui                                              [Page 12]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


                  packets belongs to the text strings or to the
                  modifiers.  FEC, retransmission or other protection
                  mechanisms as per section 4 are RECOMMENDED to avoid
                  this situation.

             iv. finally, if it is sure that neither text strings nor
                  modifiers were received complete, then the text
                  strings MAY, similarly to the case above, be displayed
                  partially or MAY be discarded.  This is an application
                  choice.  The modifiers MUST not be used.

     3. Sample descriptions SHALL NOT be fragmented, thus they are
        directly associated via the sample description index (SIDX) with
        the reassembled text samples.

2.3.6 Live streaming vs. Streaming from a 3GP file

   This section addresses the differences between streaming live content
   and streaming text tracks from a 3GP file.

   For the purpose of this document, the term live streaming refers to
   those scenarios where the sender application creates the media
   contents on the spot and without necessarily storing them in a 3GP
   file.  The sender application SHALL encapsulate these contents into
   RTP packets following the guidelines given in this document. At the
   receiving side, a buffer is typically used to cancel the network
   delay and delay jitter.  If the receiver buffer is large enough and
   the sender uses packet loss resilience (i.e. retransmission [11],
   packet FEC [7] or other) it may also be possible to recover from
   packet losses.  Note that the form in which the contents are actually
   stored in both sender and receiver and how the buffers are
   dimensioned are implementation design choices.

   Section 12.3 specifies how the 3GP file parameter values are mapped
   to the fields of the payload header.  For live streaming, appropriate
   values complying with the format and units described in [1] shall be
   used.  Where needed, clarifications on appropriate values are given
   in this document.


3. RTP Payload Format for 3GPP Timed Text

   The format of an RTP packet containing 3GPP timed text is shown
   below:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |V=2|P|X| CC    |M|    PT       |        sequence number        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                           timestamp                           |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           synchronization source (SSRC) identifier            |

   Rey & Matsui                                              [Page 13]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      +                      RTP payload                              |
      |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Marker bit (M): the marker bit SHALL be set to 1 if the RTP packet
   includes one or more whole text samples or the last fragment of a
   text sample; otherwise set to 0.

   Timestamp: the timestamp MUST indicate the sampling instant of the
   earliest (or only) unit contained in the RTP packet.  The initial
   value MUST be randomly determined.  Units MUST be placed in play-out
   order, i.e. earliest first in the payload.  The timestamp of the
   subsequent units MUST be obtained by adding the timed text sample
   duration of previous samples to the RTP timestamp value.  An example
   of how to calculate the timestamp of units in an aggregate payload
   can be found in Section 2.3.4.

   Note that the timestamp clockrate does not match the sampling rate,
   as it is usual in other media such as audio or video.

   The timestamp clockrate of the samples in each text track is the
   value of the "timescale" parameter in the Media Header Box for that
   text track.  Note that each track in a 3GP file MAY have its own
   clockrate as specified in the Media Header Box.

   For live streaming an appropriate timestamp clockrate SHALL be used.
   A default value of 1000 Hz is RECOMMENDED.  This value should provide
   enough timing resolution for synchronizing text with other media and
   expressing the duration of text samples.  Other clockrates MAY be
   used.  Timestamp clockrates MUST be signaled by out-of-band means at
   session setup, e.g. using "rate" attribute in SDP.  See Section 8 for
   details.

   The 3GPP Timed Text format does not mandate any sampling rate, but it
   is the real time encoder that SHALL choose an appropriate sampling
   rate such that the text samples meet the application needs.  E.g.
   samples may be tailored to match the packet MTU as close as possible
   or to provide a given redundancy for the available bit rate.  The
   encoding application MUST also take into account the delay
   constraints of the real-time session and assess whether FEC,
   retransmission or other similar techniques are reasonable options for
   repair.

   The following example shall illustrate how a real-time encoder may
   choose its settings:

        Imagine a news program scenario, where the news is transcribed
        and synchronized with the image of the reporter and the
        headlines in the background.  Assuming that a person can read an
        average of 4-6 words per second, at an average word length of 5
        characters plus one space per word, an available IP MTU of 576

   Rey & Matsui                                              [Page 14]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


        bytes, characters are encoded using 2-bytes, no modifiers are
        used and a rate of 576*8bits per second=4.6Kbps is available, a
        text sample covering 60 seconds of text would theoretically be
        optimum: IP/UDP/RTP+(text sample)=20+8+18 (12+6, TYPE 1 header)
        + ~250*2= ~546 bytes < 576 bytes.  However, a delay of sixty
        seconds might be too much and just one packet per sample too low
        of a redundancy.  In practice, the allowed delay for real time
        communications is typically a few seconds, e.g. 3s.  Thus, the
        encoder could sample text every 1s (yielding RTP payloads of
        ~14-18 bytes), encapsulate the current and last two samples in
        every RTP packet (accounting to an IP packet size of 98 bytes)
        and send the packet six times, thus exhausting the available bit
        rate and increasing packet loss resilience.

        This example illustrates how the encoding application shall
        adapt to the scenario constraints.

   Payload Type (PT): the payload type is set dynamically and sent by
   out-of-band means.

   The usage of the remaining RTP header fields follows the rules of RTP
   [3] and the profile in use.

3.1 Payload Header Definitions

   An RTP packet using the payload headers defined in this document has
   the following format:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |V=2|P|X| CC    |M|    PT       |        sequence number        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                           timestamp                           |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           synchronization source (SSRC) identifier            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   | TYPE|                                               :
      +-+-+-+-+-+-+-+-+                                               :
      :        (variable payload header depending on TYPE value)      :
      :                                                               :
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      :                     SAMPLE CONTENTS                           :
      :                                                               :
      :                                                               :
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                        Figure 3 RTP Packet Format.


   The payload headers specified in this document consist of a set of
   common fields followed by specific fields for each header type and
   sample contents.  See Figure 4.

   Rey & Matsui                                              [Page 15]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004



   In this manner, the structure of the payload headers resembles that
   of the 'access units' (AU) in RFC 3640 [12].  This similarity is
   intentional to improve interoperability.  The 'AU header' of that
   document finds an equivalent in the common header fields: U, R, TYPE
   and LEN.  Similarly, the specific fields plus the sample contents
   would be equivalent to the 'AU data section' in RFC 3640.  This is
   illustrated in the figure below.

        unit header =~'AU header'                   | unit payload
                                                       =~'AU data'
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-....+
    |U|   R   |TYPE |             LEN               |   (variable)    |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-....+

                        Figure 4 Unit Format.

   As an example, an aggregate RTP payload containing two complete text
   samples and a text sample fragment would schematically look like
   this:

                                        +----------------------+
                                        |                      |
                                        |   RTP Header         |
                                        |                      |
                               --------_+----------------------+
                               |        |                      |
                            _  |        |    Payload Header 1  |
                               |        ........................
                        UNIT 1 -        |                      |
                               |        |    Text Sample 1     |
                               |  _     |                      |
                               |------- ........................
                                ------- |                      |
                               |        |    Payload Header 2  |
                               |        ........................
                        UNIT 2 -        |                      |
                               |        |    Text Sample 2     |
                               |        |                      |
                               |  _     |                      |
                               ---------........................
                               -------- |                      |
                               |        |    Payload Header 3  |
                               |        ........................
                        UNIT 3 -        |                      |
                               |        | Text Sample Fragment |
                               |_       |                      |
                               |     _  |                      |
                               ---------+----------------------+
                      Figure 5 Example RTP packet.

   Rey & Matsui                                              [Page 16]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004



3.1.1 Unit Header Format

   The unit header has the following format:

            0                   1                   2
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |U|   R   |TYPE |             LEN               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                     Figure 6 Unit Header Format.


   Where:

   o U (1 bit) "UTF Transformation flag": indicates whether the text
     characters are encoded using UTF-8 (U=0) or UTF-16 (U=1).  This is
     used to inform RTP receivers whether UTF-8 or UTF-16 was used to
     encode the text string and so enable to display text string
     fragments.  The U bit is only used in TYPE 1 and TYPE 2 headers,
     otherwise it MUST be set to zero and ignored.

   o R (4 bits) "Reserved bits": for future extensions.  This field
     MUST be set to zero (0x0) and MUST be ignored by receivers.

   o TYPE (3 bits) "Type Field": this field specifies which specific
     header fields follow.  The following TYPE values are defined:

        - TYPE 1, for a whole text sample
        - TYPE 2, for a text string fragment
        - TYPE 3, for a whole modifier box or the first fragment of a
        modifier box
        - TYPE 4, for a modifier fragment other than first.
        - TYPE 5, for a sample description.  One header per sample
          description.
        - TYPE 0, 6 and 7 are reserved.

        Two TYPEs (1 & 2) are defined for units containing text strings,
        another two (3 & 4) for units not containing text strings and a
        final TYPE 5 for sample descriptions.  See details in
        subsections below.

   o Finally, the LEN (16 bits) "Length Field": indicates the size (in
     bytes) of this header field and all the fields following, i.e. the
     LEN field followed by the unit payload.  For whole text samples
     stored content in 3GP files, the sample length is given by SLEN
     value with one exception (see Section 12.3).  The LEN value is
     obtained by adding SLEN to the length of the LEN field itself (2
     bytes).

     For live streaming, both sample length and the LEN value for the
     current fragment MUST be calculated during the sampling process or
     during fragmentation.

   Rey & Matsui                                              [Page 17]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004



     LEN may take the following values:

      - TYPE = 1, LEN >= 8,
      - TYPE = 2, LEN >= 9,
      - TYPE = 3, LEN > 6,
      - TYPE = 4, LEN > 6 and,
      - TYPE = 5, LEN > 3.

     In the next subsection the different payload headers for the
     values of TYPE are specified.

3.1.2 TYPE 1 Header

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE |       LEN  (always >=8)       |    SIDX       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                      SDUR                     |     TLEN      |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |      TLEN     |
      +-+-+-+-+-+-+-+-+

   This header type is used to transport whole text samples.  If several
   text samples are sent in an RTP packet, every sample has its own
   header (see Figure 5).

   Note that also empty text samples are considered whole text samples,
   although they do not contain sample contents.  In this particular
   case, TYPE 1 units MUST NOT include any sample contents and the LEN
   field SHALL have a value of 8 (0x0008).  Otherwise, the LEN field
   SHALL be always greater than 8 (0x0008).

   The fields above have the following meaning:

   o U, R and TYPE as defined above.

   o SIDX (8 bits) "Text Sample Entry Index": this is an index used to
     identify the sample descriptions.

     The SIDX field is used to find the sample description
     corresponding to the unit's payload.  There are two types of SIDX
     values: static and dynamic.

     Static SIDX values are used to identify sample descriptions that
     MUST be sent out-of-band and MUST remain active during the whole
     session.  The transport of sample descriptions out-of-band is a
     MANDATORY feature.  A static SIDX value is unequivocally linked to
     one particular sample description during the whole session.  It
     SHOULD be avoided that many sample descriptions are carried out-
     of-band, since these may become large and, ultimately, transport
     is not the goal of the out-of-band channel.  Thus, this feature

   Rey & Matsui                                              [Page 18]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


     MUST be limited to those sample descriptions that provide a set of
     minimum default format settings.  Static SIDX values MUST fall in
     the interval [129,254].  The first SIDX value assigned to a static
     sample description MUST be 129.

     Dynamic SIDX values are used for sample descriptions sent in-band.
     Sample descriptions MAY be sent in-band for several reasons:
     because they are generated in real time, for transport resiliency
     or both.  A dynamic SIDX value is unequivocally linked to one
     particular sample description during the period in which this is
     active in the session and it SHALL NOT be modified during that
     period.  This period MAY be smaller or equal to the session
     duration.  A maximum of 64 dynamic active SIDX is allowed at any
     moment.  Dynamic SIDX values MUST fall in the interval [0,127].
     This should be enough for both, recorded content and live
     streaming applications.  Nevertheless, a wraparound mechanism is
     provided in Section 12 to handle sessions where more than 64 SIDX
     values might be needed in a session.  Clients MUST be able to
     receive and interpret dynamic sample descriptions; whether they
     make use of them or send them themselves is a design choice.

     SIDX values 128 and 255 are reserved for future use.

   o SDUR (24 bits) "Text Sample Duration": indicates the sample
     duration in RTP timestamp units of the text sample.  For this
     field, a length of 3 bytes is preferred to 2 bytes.  This is
     because, for a typical clockrate of 1000 Hz, 16 bits would allow
     for a maximum duration of just 65 seconds, which might be too
     short for some streams.

     Apart from defining the time period during which the text is
     displayed, the duration field is also used to find the timestamp
     of any subsequent units within the RTP packet.

     Text samples have generally a known duration at the time of
     transmission.  However, in some cases like live streaming, the
     time for which a text piece shall be shown might not be known.  In
     order to cover this exception, the value zero (0x000000) is
     reserved to signal unknown duration.  For all other cases SDUR
     MUST be different from 0x000000.  As seen in the next example,
     units of unknown duration MUST remain valid until the next unit
     arrives.

        Example: let us revisit the previous example, imagine now you
        are in an airport watching the latest news report while you wait
        for your plane.  Airports are loud, so the news report is
        transcribed in the lower area of the screen.  This area displays
        two lines of text: the headlines and the words spoken by the
        news speaker.  As usual, the headlines are shown for a longer
        time than the rest.  This time is, in principle, unknown to the
        stream server.  A headline is just replaced when the next
        headline arrives.


   Rey & Matsui                                              [Page 19]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


     Additionally, samples of unknown duration SHALL NOT use features,
     such as scrolling or karaoke, which would need to know the
     duration of the sample up front.  Furthermore, only TYPE 5 MAY
     follow units of unknown duration in the same aggregate payload.
     Otherwise, for other unit types it would not be possible to
     resolve their timestamp.

     For text stored in 3GP files, see Section 12.3 for details on how
     to extract the duration value.  For live streaming, live encoders
     SHALL assign appropriate values and units according to [1] and
     later releases.

   o TLEN (16 bits), "Text String Length", is a byte-count of the text
     string.  TLEN is needed by the decoder to know where the modifiers
     in the payload start.

   o Finally, the unit payload following the SDUR field consists of a
     string of characters encoded using either the UTF-8 or UTF-16
     encodings followed by zero or more modifiers.  If UTF-16 is used,
     the text string in the unit is different from that included in the
     original 3GP file sample in that the 16-bit byte order mark
     (0xFEFF, see [1]) preceding the actual characters MUST NOT be
     included.  Instead of this 16-bit, the U bit in the unit header
     MUST be used to indicate UTF-16 encoding.  The byte order mark
     MUST be added back at the receiver if the original text sample
     format shall be reestablished.

3.1.3 TYPE 2 Header

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE |          LEN( always >9)      |    SIDX       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                    SDUR                       | TOTAL | THIS  |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |               SLEN            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   This header type is used to transport text string fragments:

   o The U, R, TYPE, SIDX, and SDUR fields have identical
     interpretation as above.  The U, SIDX and SDUR fields are
     meaningful since partial text strings can also be displayed.

   o The LEN field (16 bits) has the same meaning as above.  The LEN
     field MUST be greater than nine (0x0009).

     As mentioned in Section 2.3.3 text strings MUST be split at
     character boundaries to allow the display of text fragments.
     Hence, as a minimum a text fragment MUST contain one character in
     either UTF-8 or UTF-16.  This is just a formalism, since by


   Rey & Matsui                                              [Page 20]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


     following the fragmentation guidelines much larger fragments
     should be created.

     Note also, that TYPE 2 units do not contain an explicit text
     string length (like TLEN in TYPE 1); this can be obtained
     deductively from the LEN of the received fragments.  Also, the
     text string length is not needed to find the modifiers in a
     fragmented sample because the TYPE 3 header indicates the start of
     the modifiers.

   o The SLEN field (16 bits) indicates the size (in bytes) of the
     original (whole) text sample to which this fragment belongs.  This
     length comprises the text string plus any modifier boxes present.
     Clients MAY use SLEN to buffer space for the remaining fragments
     of the text sample.

     For stored content, see Section 12.3 for details on how to find
     the SLEN value in a 3GP file.  For live content, the SLEN MUST be
     obtained during the sampling process.

   o The fields TOTAL (4 bits) and THIS (4 bits) indicate the total
     number of fragments in which the original text sample has been
     fragmented and which order occupies the current fragment in that
     sequence, respectively.  The usual "byte offset" field is not used
     here for two reasons: a) it would take one more byte and b) it
     does not provide any information on the character offset.  UTF-
     8/16 text strings have, in general, a variable character length
     ranging from 1 to 6 bytes.  Therefore, the TOTAL/THIS solution is
     preferred.  It could also be argued that the LEN and SLEN fields
     be used for this purpose, but while they would provide information
     about the completeness of the text sample, they do not specify the
     order of the fragments.

   o Finally, the unit payload following the SLEN field consists of the
     fragment of the UTF-8/-16 encoded character string.

3.1.4 TYPE 3 Header

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE |        LEN( always >6)        |TOTAL  |  THIS |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                      SDUR                     |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   This header type is used to transport either the entire modifier
   contents present in a text sample or just the first fragment of these.
   This depends on whether the modifier boxes fit in the current RTP
   payload.

   If a text sample is fragmented this header MUST be used to transport
   the first fragment or the complete modifiers.

   Rey & Matsui                                              [Page 21]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004



   In detail:

   o The U, R, TOTAL/THIS and LEN fields are used as above.  The LEN
     field MUST be greater than six (0x0006).

   o The TOTAL/THIS field has the same meaning as for TYPE 2.
     Therefore, if TOTAL=THIS, then all modifiers are included here.
     In this case, TOTAL=THIS MUST be greater than one, because TOTAL
     indicates the total number of fragments of the text sample
     (logically, always >1), not just of the modifiers.

     Otherwise, if TOTAL is different from THIS, this unit just
     contains the first fragment of the modifiers.

   o The SDUR has the same definition as above.  This field is needed
     to obtain the timestamp of subsequent units in an aggregate
     payload.

   Note that the SLEN and SIDX fields are not present.  This is because:
   a) these fragments do not contain text strings and b) these types of
   fragments are applied over text string fragments, which already
   contain this information.

3.1.5 TYPE 4 Header

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE |        LEN( always >6)        |TOTAL  |  THIS |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                      SDUR                     |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   This header type is prepended to modifier fragments, other than the
   first one.

   The U, R, TOTAL/THIS and LEN fields are used as above.  The LEN field
   MUST be greater than six (0x0006).

   Regarding the SDUR field and the absence of the SLEN and SIDX fields,
   the same reasons as for TYPE 3 apply.

3.1.6 TYPE 5 Header

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE |      LEN( always >3)          |   SIDX        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+




   Rey & Matsui                                              [Page 22]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   This header type is used to transport (dynamic) sample descriptions.
   The LEN field MUST be greater than three (0x0003).  Every sample
   description MUST have its own TYPE 5 header.

   All servers and clients MUST implement this header, since it adds
   minimum complexity and it may increase the robustness of the
   streaming session.


4. Resilient Transport

   Apart from the basic fragmentation guidelines described in the
   section above, the simplest option for packet loss resilient
   transport is repetition.

   A server MAY decide to use repetition as a measure for packet loss
   resilience.  Thereby, a server MAY send the same RTP packet payloads
   (RECOMMENDED) or just parts of it, i.e. single units.

   As for the case of complete payloads, repeating specific units is
   only allowed if exactly the same units are sent, as in the first
   transmission.  Only then, a receiver can use the already received and
   the newly repeated units to reconstruct the original text samples.
   Note that since the RTP timestamp is used to group together the
   fragments of a sample, care must taken to preserve the timing of
   units when constructing new RTP packets.

        E.g. if a text sample was originally sent as a single non-
        fragmented text sample (one TYPE 1 unit), a repetition of that
        sample MUST be sent also as a single non-fragmented text sample
        in one unit.  Likewise, if the original text sample was
        fragmented and spread over several RTP packets, say a total of 3
        units, then the repeated fragments SHALL also have the same byte
        boundaries and use the same headers and bytes per fragment.

        In the latter case the 3 RTP packets originally containing the
        intended units can be repeated or new packets can be build
        containing each a unit.  For simplicity, it is RECOMMENDED that
        the three original packets be re-sent, instead of building new
        packets.  Anyhow, if new packets should be built, care MUST be
        taken to preserve the timing of each unit and also the payload
        aggregation guidelines MUST be observed.

   With repetition, repeated units resolve to the same timestamp as
   their originals.  Where redundant units are available, the receiver
   SHOULD use those units received in the RTP packet with the highest
   sequence number.

   Regarding the RTP header fields:

   o if the whole RTP payload is repeated, all payload-specific fields
     in the RTP header (the M, TS and PT fields) MUST keep their


   Rey & Matsui                                              [Page 23]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


     original values except the sequence number that MUST be increased
     to comply with RTP.

   o in packets containing repeated units, the general rules in Section
     3 for assigning values to the RTP header fields apply.
     Particularly relevant here is to keep the value of the RTP
     timestamp to preserve the timing of the units.

   Apart from repetition other mechanisms such as FEC [7],
   retransmission [11] or similar techniques SHOULD be used to cope with
   packet losses.  Note that interleaving as defined in RFC 2354 [8]
   SHOULD NOT be used, since text samples have unpredictable sizes.
   This makes it a difficult task to reorder them in such a way that
   both the correct timestamp calculation and the guaranteed distance
   between fragments is ensured (see Figure 2).  Additionally,
   interleaving may require fragmenting text samples more often than it
   is actually recommended.

   Finally, an observation regarding sample descriptions: if sample
   descriptions for a given SIDX value are not available at the
   receiver, it is a matter of implementation whether the text sample
   contents are displayed.  For example, an application MAY provide a
   static default sample description to be used for these cases.  This
   is, however, an implementation issue and out of the scope of this
   document.


5. Congestion control

   Applications implementing this payload format SHALL implement
   congestion control.  Congestion control for RTP SHALL be implemented
   in accordance with RTP [3], and the applicable RTP profile, e.g.
   RTP/AVP [17].  The RTP profile under which this payload format is
   used defines an appropriate congestion control mechanism in different
   environments.  Following the rules under the profile, an RTP
   application can determine its acceptable bitrate and packet rate in
   order to be fair to other TCP or RTP flows.


6. Scene Description

6.1 Text rendering position and composition

   In order to stream timed text, either stored in a 3GP file or live,
   some initial layout information is needed by the client to correctly
   display the text.  These are the width, height and position of the
   text area in the client's display and the layer or proximity of the
   text to the user.

   These pieces of information MUST be conveyed in a reliable form
   previous to the start of the session.  An example of a reliable
   transport may be the out-of-band channel used for SDP.  Any SDP
   description containing a 3GPP timed text stream MUST include the

   Rey & Matsui                                              [Page 24]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   parameters listed above.  Section 7 and 8 provide details on the
   usage in SDP descriptions.

   For stored content, some values contained in the Track Header Box
   SHALL be used.  See Section 12.3 for details on finding these values
   in a 3GP file.  For live streaming appropriate values SHALL be used.

6.2 SMIL usage

   The attributes contained in the Track Header Boxes of a 3GP file only
   specify the spatial relationship of the tracks within the given 3GP
   file.  If several media streams are sent, they require spatial
   synchronization.  For such purpose, SMIL [9] SHOULD be used.

   SMIL assigns regions in the display to each of those files and places
   the tracks within those regions.  The original track header
   information is used for each track within its region.  Therefore,
   even if SMIL scene description is used, the track header information
   pieces SHOULD be sent anyway as they represent the intrinsic media
   properties.

   See [1] and the 3GPP SMIL Language Profile in [16] for details.


7. MIME Type usage Registration

7.1 3GPP Timed Text MIME Registration

   MIME type: video

   MIME subtype: 3gpp-tt

   Required parameters:

   rate: the RTP timestamp clockrate is equal to the clockrate of the
        media.  If RTP packets are generated out of a 3GP file, the
        clockrate of the text media MUST be copied from the 3GP file,
        i.e. the clockrate is the value of "timescale" parameter in the
        Media Header Box describing that text track.  Other tracks
        (audio/video/text) in the 3GP file may have their own clockrates
        as indicated in their corresponding Media Header Box.  For live
        encoding, a clockrate of 1000 Hz is RECOMMENDED but other values
        MAY be used.

   sver=<Z1(x1*256+y1)>, <Z2(x2*256+y2), ..., <Zi(xi*256+yi)>,...
        The parameter "sver" specifies the list of supported backwards-
        compatible versions of the timed text format specification (3GPP
        TS 26.245), which the sender supports (or is willing to accept).
        The first value is the current value used or the preferred
        value.  This MAY be followed by a comma-separated list of
        increasingly older versions that SHOULD be used as alternatives.
        The order is meaningful, being first most preferred and last
        least preferred.  Regarding the value calculation: "Zi" is the

   Rey & Matsui                                              [Page 25]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


        number of the Release, "xi" and "yi" are taken from the 3GPP
        specification version, i.e. vZi.xi.yi.  For example, for 3GPP TS
        26.245 v6.0.0, Zi(xi*256+yi)=6(0), the version value is "60".
        Note that "60" is the concatenation of the values Zi=6 and
        (xi*256+yi)=0 and not its product.

   width=<integer-value>, indicates the width in pixels of the text
        track or area where the text is actually displayed.  This is a
        16 bit integer.

   height=<integer-value>, indicates the height in pixels of the text
        track.  This is a 16 bit integer.

   tx=<integer-value>, indicates the horizontal translation offset in
        pixels of the text track with respect to the origin of the video
         track.  This is a 16 bit integer.
        .
   ty=<integer-value>, indicates the vertical translation offset in
        pixels of the text track.  This is a 16 bit integer.

   layer=<integer-value>, indicates the proximity of the text track to
        the viewer.  Higher values means closer to the viewer.  This
        parameter has no units.  This is a 16 bit integer.

   Optional parameters:

   spldesc=<value> indicates the way the server sends the sample
        descriptions.  This parameter MAY not be present, this meaning
        that the value "both" is used.  In detail:

        o "out": all sample descriptions are sent out-of-band, e.g. in
           the SDP.  This may be used when the total number of sample
           descriptions used is low.  This is useful, e.g., for those
           clients that want to choose a simple text stream.

        o "both":, where both, in- and out-of-band, mechanisms MAY be
           used.  Note that "spldesc=both" indicates that both in-band
           and out-of-band sample descriptions MAY be sent for that
           stream,  and not that both are necessarily sent during a
           session.  This corresponds to the default case.  This is the
           default case.

   tx3g=<base64-value-1>, <base64-value-2>,...This parameter MUST be
        used for conveying sample descriptions out-of-band.  The list of
        sample entries MAY follow any particular order and it MAY be
        empty.  The absence of this parameter is equivalent to an empty
        list of sample descriptions.  The <base64-value-i> represents
        the base64 encoding of the concatenation of the SIDX and the
        sample description for that SIDX, in this order.  The format of
        a sample description entry can be found in 3GPP TS 26.245
        Release 6 and later releases.  All servers and clients MUST
        understand this parameter and MUST be capable of using the


   Rey & Matsui                                              [Page 26]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


        sample description(s) contained in it.  Please refer to RFC 3548
        [6] for details on the base64 encoding.

   brand=<brand-name>, where <brand-name> indicates the "best use" of
        the original 3GP file from which the timed text contents are
        read.

   cbrand=<brand-name-1>,<brand-name-2>,..."Compatible brands" contains
        a list of compatible brands for the 3GP file..

   mver=<version-value>, "Minor version" where <version-value> is a
        positive integer.  It identifies the version of the
        specification of the file brand.

        Note that the parameters "brand", "cbrand", and "mver" are
        merely informational, as they only provide information about the
        original 3GP file being read from.  Details on these parameters
        can be found in the 3GP file format section of 3GPP TS 26.234
        Release 5 specification and corresponding specifications in
        later Releases.

   Encoding considerations: this type is only defined for transfer via
   RTP.

   Security considerations: please refer to Section 10 of RFCXXXX.

   Interoperability considerations: the 3GPP Timed Text media format for
   which this payload format is defined is specified in Release 6 of
   3GPP TS 26.245 "Transparent end-to-end packet switched streaming
   service (PSS); Timed Text Format (Release 6)".  The 3GPP file format
   (3GP) referred to in this document and the used SMIL language profile
   can be found in Release 5 of 3GPP TS 26.234 and in the corresponding
   specifications for later Releases.  Note also that 3GPP may in future
   Releases specify extensions or updates to the media format in a
   backwards-compatible way, e.g. new modifier boxes or extensions to
   the sample descriptions.  The payload format defined in RFCXXXX
   allows for such extensions.  For future 3GPP Releases of the Timed
   Text Format, the parameter "sver" is used to identify the exact
   specification used.

   Published specification: RFC XXXX

   Applications which use this media type: multimedia streaming
   applications.

   Additional information: the 3GPP Timed Text media format is specified
   in 3GPP TS 26.245 "Transparent end-to-end packet switched streaming
   service (PSS); Timed Text Format (Release 6)".  This document and
   future extensions to the 3GPP Timed Text format are publicly
   available at http://www.3gpp.org.

   Magic number(s): None.


   Rey & Matsui                                              [Page 27]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   File extension(s): 3GPP Timed Text tracks are stored in files
   conforming the 3GP file format.  The 3GPP file format (3GP) referred
   to in this document can be found in Release 5 of 3GPP TS 26.234 and
   in the corresponding specifications for later Releases.

   Macintosh File Type Code(s): None.

   Person & email address to contact for further information:
   Jose Rey, rey@panasonic.de
   Yoshinori Matsui, matsui.yoshinori@jp.panasonic.com
   Audio/Video Transport Working Group.

   Intended usage: COMMON

   Author/Change controller:
   Jose Rey
   Yoshinori Matsui
   IETF AVT WG


8. SDP usage

8.1 Mapping to SDP

   The information carried in the MIME media type specification has a
   specific mapping to fields in SDP [4].  If SDP is used to specify
   sessions using this payload format, the mapping is done as follows:

   o The MIME type ("video") goes in the SDP "m=" as the media name.
     The "video" MIME Type is used as timed text is considered visual
     media.

       m=video <port number> RTP/<RTP profile> <dynamic payload type>

   o The MIME subtype ("3gpp-tt") and the timestamp rate go in SDP
     "a=rtpmap" line as the encoding name and (clock) rate,
     respectively:

       a=rtpmap:<payload type> 3gpp-tt/<rate>

   o The MANDATORY fmtp parameters "sver", "width", "height", "tx",
     "ty" and "layer" go in the SDP "a=fmtp" attribute by copying them
     directly from the MIME media type string as a semicolon separated
     list of parameter=value pairs.

   o The OPTIONAL parameters "spldesc", "tx3g", "brand", "cbrand" and
     "mver" go in the SDP "a=fmtp" attribute by copying them directly
     from the MIME media type string as a semicolon separated list of
     parameter=value pairs (note that some parameters MAY need to
     specify a list of values, e.g. "tx3g"):

       a=fmtp:<dynamic payload type> <parameter name>=<value>[
       ; <parameter name>=<value>]

   Rey & Matsui                                              [Page 28]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004



   o Any unknown parameter SHALL be ignored.

8.2 Parameter Usage in the SDP Offer/Answer Model

   In this section the meaning of the SDP parameters defined in this
   document within the Offer/Answer (O/A) [13] context is explained.

   In unicast, sender and receiver typically negotiate the streams, i.e.
   which codecs and parameter values are used in the session.  This is
   also possible in multicast to a lesser extend.

   As stated in the O/A model, some "fmtp" (payload-format-specific)
   parameters have a clear meaning and shall be included in the answer
   as present in the offer.  Other parameters may need to be set among
   parties, because it is not clear that offerer and answerer shall use
   the same values.

   These considerations apply to both 3GP file streaming and live
   streaming scenarios.

8.2.1 Unicast

   The clock rate parameter "rate" in the offered MUST be included
   verbatim in the answer or else the stream MUST be removed or the
   session MUST be rejected.

   The parameters "tx", "ty", "layer", "height", "width", "tx3g",
   "brand", "cbrand" and "mver" are declarative parameters.  This means
   that an offerer using these parameters only specifies which values
   are going to be used for the sent stream or which values the offerer
   prefers or is able to support for the received stream (recvonly or
   sendrecv streams).  Offerer and answerer MAY use different values;
   thereby an answerer MAY include these parameters in the answerer or
   not.  If included, these values MAY be the same values or different.
   Upon receiving the answer, an offerer MUST be prepared to receive a
   stream with the values signalled in the answer, in the case of
   recvonly or sendrecv streams.

   The "spldesc" parameter is used to express an explicit preference (or
   setting) of the offerer for the timed text stream to be received (or
   sent): the text stream uses either a reduced number of out-of-band
   sample descriptions or it MAY use both in- out-of-band.  Therefore,
   if present, it MUST be used symmetrically, i.e. included verbatim in
   the answer.  Otherwise the stream MUST be removed or the session
   rejected.

   The timed text media format version of the stream (3GPP
   specification) MUST be negotiated.  This is accomplished with the
   "sver" parameter and it is negotiated as follows:

     For streams set to sendonly or sendrecv, the answerer MAY downgrade
     the "sver" parameter list by deleting version values, starting

   Rey & Matsui                                              [Page 29]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


     with the newer versions (first values).  An answerer MUST NOT add
     new values to this list.  Upon receiving the answer, the offerer
     MUST accordingly downgrade the stream, e.g. by not sending newer
     extensions.  For sendrecv streams, the offerer MUST additionally be
     prepared to receive a stream according to the version signalled by
     the answerer.

     For streams set to recvonly, the answerer MAY similarly modify the
     "sver" parameter.  Upon receiving the answer, the offerer MUST be
     ready to receive the stream in the signalled version as set by the
     answerer.

   In order to avoid failed negotiations it is RECOMMENDED for the
   offerer to include an exhaustive list of versions it supports.  On
   the other hand, since answerers MUST NOT add new values to the "sver"
   list, an offerer MAY intentionally restrict the version it wishes to
   receive or send by listing only the desired versions in the offer.
   I.e. either the answerer supports that particular version or it MUST
   remove the stream (or reject the session).

8.2.2 Multicast

   In this case all parameters MUST be used symmetrically in order for
   all participants to have the same vision of the multicast session.
   Otherwise the stream MUST be removed or the session MUST be rejected.

8.3 Parameter Usage outside of Offer/Answer

   SDP may also be employed outside of the Offer/Answer context, for
   instance for multimedia sessions that are announced through the
   Session Announcement Protocol (SAP) [14], or streamed through the
   Real Time Streaming Protocol (RTSP) [15].

   In this case, the receiver of a session description MUST support the
   parameters and given values for the streams or else it MUST reject
   the session.  It is the responsibility of the sender of the session
   descriptions to define the session parameters so that the probability
   of unsuccessful session setup is minimized.  This is out of the scope
   of this document.


9. IANA Considerations

   IANA is requested to register the MIME subtype name "3gpp-tt" for the
   media type "video" as specified in Section 8 of this document.


10. Security considerations

   RTP packets using the payload format defined in this specification
   are subject to the security considerations discussed in the RTP
   specification [3].


   Rey & Matsui                                              [Page 30]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   In particular, an attacker may invalidate the current set of valid
   sample descriptions at the client by means of repeating a packet with
   an old sample description, i.e. replay attack.  This would mean that
   the display of the text would be corrupted, if displayed at all.
   Another form of attack may consist in sending redundant fragments,
   whose boundaries do not match the exact boundaries of the originals.
   This may cause a decoder to crash.

   These types of attack may easily be avoided by using source
   authentication and integrity protection.

   Additionally, peers in a timed text session may desire to retain
   privacy in their communication, i.e. confidentiality.

   This payload format does not provide any mechanisms for achieving
   these.  Confidentiality, integrity protection and authentication have
   to be solved by a mechanism external to this payload format, e.g.
   SRTP [10].


11. References

11.1 Normative References

   [1]  Transparent end-to-end packet switched streaming service (PSS);
     Timed Text Format (Release 6), TS 26.245 v 0.1.6, Working Draft,
     July 2003.

   [2]  ISO/IEC 14496-1:2001/AMD5, "Information technology û Coding of
     audio-visual objects û Part 1: Systems, ISO Base Media File
     Format", 2003.

   [3]  H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson, "RTP: A
     Transport Protocol for Real-Time Applications", RFC 3550, July
     2003.

   [4]  M. Handley, V. Jacobson, "SDP: Session Description Protocol",
     RFC 2327, April 1998.

   [5]  S. Bradner, "Key words for use in RFCs to indicate requirement
     levels," BCP 14, RFC 2119, IETF, March 1997.

   [6]  S. Josefsson (Ed.), "The Base16, Base32, and Base64 Data
     Encodings", RFC 3548, July 2003.

11.2 Informative References

   [7]  J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for Generic
     Forward Error Correction", RFC 2733, December 1999.

   [8]  C. Perkins, O. Hodson, "Options for Repair of Streaming Media",
     RFC 2354, June 1998.


   Rey & Matsui                                              [Page 31]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   [9]  W3C, "Synchronised Multimedia Integration Language (SMIL 2.0)",
     August, 2001.

   [10] M. Baugher, D. A. McGrew, D. Oran, R. Blom, E. Carrara, M.
     Naslund, K. Norrman, "The Secure Real-Time Transport Protocol",
     RFC 3711, March 2004.

   [11] J. Rey et al., "RTP Retransmission Payload Format", draft-ietf-
     avt-rtp-retransmission-10.txt, work in progress, January 2004.

   [12] Van der Meer et al., "RTP Payload Format for Transport of MPEG-4
     Elementary Streams ", RFC3640, November 2003.

   [13] J. Rosenberg., H. Schulzrinne, " An Offer/Answer Model with the
     Session Description Protocol (SDP)", RFC 3264, June 2002.

   [14] M. Handley, et al. "Session Announcement Protocol", RFC 2974,
     October 2000.

   [15] H. Schulzrinne, et al.,"Real Time Streaming Protocol (RTSP)",
     RFC 2326, April 1998.

   [16] Transparent end-to-end packet switched streaming service (PSS);
     Protocols and codecs (Release 5), TS 26.234 v 5.6.0, Working
     Draft, September 2003.

   [17] H. Schulzrinne, S. Casner, "RTP Profile for Audio and Video
     Conferences with Minimal Control", RFC 3551, July 2003.

   [18] F. Yergeau, "UTF-8, a transformation format of Unicode and ISO
     10646", RFC 2044, October 1996.

   [19] P. Hoffman, F. Yergeau, "UTF-16, an encoding of ISO 10646", RFC
     2781, February 2000.


12. Annexes

12.1 Dynamic SIDX wraparound mechanism

   This mechanism MUST be implemented if the implementation shall use
   TYPE 5 units.

   As mentioned in Section 3.1.2, dynamic SIDX values remain active
   either during the entire duration of the session (if used just once)
   or in different intervals of it (if used once or more).  Although 64
   sample descriptions should cover the needs of most timed text
   applications, a wraparound mechanism to handle the exception is
   described here.  In the following, SIDX value means dynamic SIDX
   value.

   There is a sliding window of 64 active SIDX values.  Values within
   the window are active, all others are considered inactive.  An SIDX

   Rey & Matsui                                              [Page 32]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   value becomes "active" if at least one sample description identified
   by that SIDX has been received.  Since sample descriptions MAY be
   sent redundantly, it is possible that a client receives a given SIDX
   several times.  However, the receiver SHALL ignore redundant sample
   descriptions and it MUST use the already cached copy.  The guard
   range of inactive values ensures that always the correct association
   SIDX <-> sample description is used.

   The following algorithm is used to maintain the dynamic SIDX values:

     Let X be the SIDX of the last received sample description.  Let Y
     be a value within the allowed range for dynamic SIDX: [0,127], and
     different from X.

        1. Initialize all dynamic SIDX values as inactive.  For stored
          content, read the sample description index in the Sample to
          Chunk box ("stsc") for that sample.  For live streaming, the
          first value MAY be zero or any other value in the interval
          above.  The initial value is SIDX=X.  Go to step 2.
        2. First in-band sample description with SIDX=X is received. Go
          to step 3.
        3. Set all SIDX=Y inactive if inside the interval [X+1
          modulo(128), X+64 modulo(128)].  Otherwise, set SIDX=Y as
          active.  Go to step 4.
        4. Wait for next sample description.  Upon reception of a sample
          description with SIDX=X do:
             a. If X is currently active, then wait for next SIDX (do
               nothing).
             b. Else go to step 3.

   Example,

        if X=4, any SIDX in the interval [5,68] is inactive.  Active
        SIDX values are in the complementary interval [69,127] plus
        [0,4].  Once the client is initialized, the interval of active
        SIDX values MUST change whenever a sample description with an
        inactive SIDX value is received.  E.g., if the client receives a
        SIDX=6, then the active interval is now different: [0,6] plus
        [71,127].  However, if the received SIDX is in the current valid
        interval no change SHALL be applied.  This means that at any
        instant a maximum of 64 SIDX values are valid, whereas the total
        of values used might be over 64.

12.2 Basics of the 3GP File Structure

   This section provides a coarse overview of the 3GP file structure.

   Each 3GP file consists of "Boxes".  Boxes start with a header, which
   indicates both size and type contained.  In general, a 3GP file
   contains the File Type Box (ftyp), the Movie Box (moov), and the
   Media Data Box (mdat).  The Movie Box and the Media Data Box, serving
   as containers, include own boxes for each media.  Similarly, each box


   Rey & Matsui                                              [Page 33]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   type may include a number of boxes.  See ISO Base Media file Format
   [2] for a complete list of possibilities.

   In the following, only those boxes are mentioned, which are useful
   for the purposes of this payload format.

   The File Type Box identifies the type and properties of a 3GP file.
   The File Type Box contents comprise the major brand, the minor
   version and the compatible brands.  When streamed with RTP, these are
   communicated via out-of-band means, such as SDP.

   The Movie Box (moov) contains one or more Track Boxes (trak) which
   include information about each track.  A Track Box contains, among
   others, the Track Header Box (tkhd), the Media Header Box (mdhd) and
   the Media Information Box (minf).

   The Track Header Box specifies the characteristics of a single track,
   where a track is, in this case, the streamed text during a session.
   Exactly one Track Header Box is present for a track.  It contains
   information about the track, such as the spatial layout (width and
   height), the video transformation matrix and the layer number.  Since
   these pieces of information are essential and static, i.e. constant
   for the duration of the session, they MUST be sent prior to the
   transmission of any text samples.  See the ISO base media file format
   [2] for details about the definition of the conveyed information.

   The Media Header Box contains the timescale or number of time units
   that pass in one second, i.e. cycles per second or Hertz.  The Media
   Information Box includes the Sample Table Box (stbl) which itself
   contains the Sample Description Box (stsd), the Decoding Time to
   Sample Box (stts), the Sample Size Box (stsz) and the Sample to Chunk
   Box (stsc).  Sample descriptions for each text sample are encoded as
   "tx3g" sample entries in the Sample Description Box (stsd).

   The Sample Table Box (stbl) contains all the time and data indexing
   of the media samples in a track.  Using the tables here, it is
   possible to locate samples in time, determine their type, and
   determine their size, container, and offset into that container.

   Finally, the Media Data Box contains the media data itself.  In timed
   text tracks this box contains text samples.  Its equivalent to audio
   and video is audio and video frames, respectively.  The text sample
   consists of the text length, the text string, and one or several
   Modifier Boxes.  The text length is the size of the text in bytes.
   The text string is plain text to render.  The Modifier Box is
   information to render in addition to the text such as colour, font,
   etc.

12.3 Usage of 3GP file information for transport in RTP

   For the purpose of streaming timed text contents, some values in the
   boxes contained in a 3GP file are mapped to fields of this payload


   Rey & Matsui                                              [Page 34]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


   header.  This section explains where to find and how to use those
   values.

   From the Track Header Box (tkhd):

        o tx,ty: these values are the second but last and third but
          last values in the unity matrix.  These values are fixed-
          point 16.16 values, restricted to be integers (the lower 16
          bits of each value shall be zero). Therefore, only the first
          16 bits are used.

        o width, height: they also have the same name in the box and
          the payload header.  Similarly as above, only the first 16
          bits are used, the rest is zero.

        o layer: all 16 bits are used.

   From the Sample Table Box (stbl) the following information is carried
   in each RTP packet using this payload format:

        o the Sample Description Box (stsd): this stsd box provides
          information on the basic characteristics of text samples.
          Each entry is a sample entry box of type "tx3g".  An example
          of the information contained in a sample entry could be the
          font size or the background color.  These pieces of
          information are commonly used by many text samples during the
          session.  Each sample entry "tx3g" is transported either in-
          band or out-of-band.

        o the Decoding Time to Sample Box (stts): the 24 least
          significant bits of the "sample_delta" are mapped to the
          field SDUR (Text Sample Duration),

        o the Sample Size Box (stsz): the 16 least significant bits of
          the "sample_size" or "entry_size" (depending on whether the
          sample size is fixed or variable) indicate the length (in
          bytes) of the text string plus any modifier boxes that may be
          in that text sample.  This value is directly mapped to the
          SLEN field defined in the TYPE 2 header with an exception: if
          the text string is encoded using UTF-16, two units have to be
          subtracted from the sample size to account for the 16-bit
          byte order mark, which is not carried in RTP payloads.  At
          the receiver, the reverse operation MUST be done to re-
          assemble the text sample.

        o the Sample to Chunk Box (stsc): the value of the
          "sample_description_index" for that sample in the Sample to
          Chunk Box is mapped to the field SIDX (Text Sample Entry
          Index).  The Sample to Chunk Box (stsc) associates the text
          sample and its corresponding sample description entry in the
          Sample Description Box (stsd, see below).  The Sample to
          Chunk Box can be used to associate a text sample with a
          sample description entry.  Since the sample description may

   Rey & Matsui                                              [Page 35]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


          vary during the session, the association SDIX is sent
          together with the text samples using this payload format.


13. Acknowledgements

   The authors would like to thank Magnus Westerlund, Dave Singer, Jan
   van der Meer and Colin Perkins for their comments and suggestions to
   this document.


14. Author's Addresses

   Jose Rey                                     rey@panasonic.de
   Panasonic European Laboratories GmbH
   Monzastr. 4c
   D-63225 Langen, Germany
   Phone: +49-6103-766-134
   Fax:   +49-6103-766-166

   Yoshinori Matsui             matsui.yoshinori@jp.panasonic.com
   Matsushita Electric Industrial Co., LTD.
   1006 Kadoma
   Kadoma-shi, Osaka, Japan
   Phone: +81 6 6900 9689
   Fax:   +81 6 6900 9699


15. IPR Notices

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at ietf-
   ipr@ietf.org.



   Rey & Matsui                                              [Page 36]


   Internet Draft  RTP Payload Format for 3GPP Timed Text  July 6, 2004


16. Full Copyright Statement

   Copyright (C) The Internet Society (2004).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


17. Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.



































   Rey & Matsui                                              [Page 37]