Internet Engineering Task Force                 Audio-Video Transport WG
INTERNET-DRAFT              J. Geagan, K. Gong, A. Periyannan, D. Singer
draft-singer-rtp-qtfile-00                          Apple Computer, Inc.
                                                          March 13, 1998
                                             Expires: September 13, 1998

            Support for RTP in a stored QuickTime Movie File

Status of This Memo

   This document is an Internet-Draft.  Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups.  Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet- Drafts as reference
   material or to cite them other than as ``work in progress.''

   To learn the current status of any Internet-Draft, please check the
   ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow
   Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
   munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
   ftp.isi.edu (US West Coast).


Distribution of this document is unlimited.


Abstract

   This document proposes structures within a QuickTime movie file which
   permit easy transmission of the media content over RTP.  This
   specification is intended to assist those who wish to stream stored
   movies over RTP, those wishing to prepare movies for streaming, and
   for those who might wish to record into QuickTime while preserving
   RTP information. The bit-stream(s) of RTP packets are normally
   compliant with the RTP payload definitions for their content, and
   full inter-operability can be achieved. Each QuickTime media track
   within a movie is sent over a separate RTP session and synchronized
   using standard RTP techniques. This specification builds on the
   published QuickTime file format specification.

1 Introduction

   This document outlines how a set of sessions using the Realtime
   Transport Protocol (RTP) [1] may be transmitted by a server program



J. Geagan, K. Gong, A. Periyannan, D. Singer                    [Page 1]^L





Internet Draft         draft-singer-rtp-qtfile-00          March 12 1998


   by reading a QuickTime movie.  RTP is a generic protocol designed to
   carry realtime media data along with synchronization information over
   a datagram protocol (mostly UDP over IP).

   QuickTime files form the storage basis of the QuickTime media
   architecture; however, it is not necessary to use the QuickTime
   software to read, construct, or stream RTP from the files. The file
   format, without support for streaming or RTP, is fully described in
   the published specification [2].

   The file format is capable of referring to media data in other files;
   this enables re-use of content. These other files need not be
   structured as QuickTime movies, and a number of 'foreign' formats can
   thus be streamed over RTP under this specification, provided that
   they can also be described by the QuickTime movie (i.e. described by
   the movie meta-data), and that the streaming server is willing and
   able to follow the links to these other files.

2 QuickTime File Format Overview

   This section gives a brief overview of the file format. Readers
   wanting a detailed description are encouraged to refer to the
   published specification [2].

   A fundamental underlying concept in the QuickTime file format is that
   the physical structure of the media data (the mapping of the media
   onto physical storage records) is independent of the logical
   structure of the media file. A QuickTime media composition is
   described by a set of "movie" meta-data; this meta-data provides
   declarative, structural/compositional, and temporal information about
   the actual media data.

   The media data may be in the same file as the descriptive logical
   data (i.e., with the "movie" meta-data) or in separate files. A movie
   structured into one file is commonly called "flat" or "self-
   contained". Movies which are not self-contained may reference some or
   all of their media data in other files.

   This separation between logical organization and physical
   organization makes the QuickTime file format ideally suited to
   optimization in different ways for different scenarios. When editing
   and compositing, this means that media data need not be copied or
   re-coded as edits are applied and media is re-ordered; the meta-data
   file may be extended and temporal mapping information adjusted. When
   editing is completed, the relevant media data and meta-data may be
   rewritten into a single, interleaved, optimized file for efficient
   local or network access. However, both the structured and the
   optimized files are valid QuickTime files, and both may be inspected,



J. Geagan, K. Gong, A. Periyannan, D. Singer                    [Page 2]^L





Internet Draft         draft-singer-rtp-qtfile-00          March 12 1998


   played, streamed, and reworked.

   The use of movies which are not self-contained enables the same basic
   media data to be used and re-used in any number of presentations.
   This same advantage applies when serving, as will be seen below.

   In both editing and streaming, this also permits any number of other
   files to be treated as part of a presentation without copying the
   media data which they contain. Editing can change and re-write just
   the meta-data in the movie file, which is much quicker than reading
   and re-writing all the media data..

   The QuickTime file is divided into a set of objects, called atoms.
   Each object starts with an atom header, which declares its size and
   type:


           class Atom {
                   int(32) size;
                   char            type[4];
                   int(8)  contents[];
           }


   The size is a 32-bit integer, in bytes, including the size and type
   header fields. There is also provision for 64-bit size fields. The
   type field is four characters (usually printable), to permit easy
   documentation and identification. The data in an object after the
   type field may be fields, a sequence of contained objects, or both.
   All field data are stored in big-endian format.

   A QuickTime file consists of a sequence of objects. The two highest-
   level objects are the media-data (mdat) and the meta-data (moov)
   atoms.

   The media-data object(s) contain the actual media (for example,
   sequences of sound samples or video frames).  Their format is not
   constrained by the file format; they are not usually objects. Their
   format is described in the meta-data, not by any declarations
   physically contiguous with them. So, for example, in a movie
   consisting solely of motion-JPEG, JPEG frames are stored contiguously
   in the media data with no required intervening extra headers. The
   media data within the media data objects is logically divided into
   chunks; however, there are no explicit chunk markers.

   When the QuickTime file references media data in other files, it is
   not required that these 'secondary' files be formatted to this
   specification, since these media data files are formatted as if they



J. Geagan, K. Gong, A. Periyannan, D. Singer                    [Page 3]^L





Internet Draft         draft-singer-rtp-qtfile-00          March 12 1998


   were the contents of a media object.  Since the format here does not
   require any headers or other information physically contiguous with
   the media data, it is possible for the media data to be files which
   contain 'foreign' headers (e.g. UNIX ".au" files, or AVI files) and
   for the QuickTime meta-data to contain the appropriate declarative
   information and reference the media data in the 'foreign' file.  In
   this way the file format can be used to update, without copying,
   existing bodies of material in disparate formats. Thus editing and
   serving may be done directly from these files, greatly extending
   their utility. The QuickTime file format is a true unifying concept;
   it is both an established format and is able to work with, include,
   and thereby bring forward, other established formats. (The full range
   of supported file types is large; consult the QuickTime web site
   <http://www.apple.com/quicktime> for more information.).

   Free space (e.g. deleted by an editing operation) can also be
   described by an object at this level. Any software reading the file
   should ignore free space objects, and objects at any level which it
   does not understand; this permits extension of the file at any level
   by introducing new objects.  The primary meta-data is the movie
   object. A QuickTime file normally has exactly one movie object; it is
   typically at the beginning or end of the file, to permit its easy
   location (although this is not required).

   The movie header provides basic information about the overall
   presentation (its creation date, overall timescale, and so on). In
   the sequence of contained objects there would normally be at least
   one track, which describes temporally presented data. A track is a
   media stream.

   The track header provides basic information about the track (its ID,
   timescale, and so on). Information at the track level is independent
   of the media type contained in the track. Objects contained in the
   track might be references to other tracks (e.g. for complex
   compositing), or edit lists.  In this sequence of contained objects
   there would normally be a media object, which describes the media
   which is presented when the track is played.

   The media object contains declarations of the exact presentation
   required by the track (e.g. that it is sampled audio, or MIDI, or
   orientation information for a 3D Scene). The type of track is
   declared by its handler.

   Within the media information there is likewise a handler declaration
   for the data handler (which fetches media data), and a data
   information declaration. This defines which files contain the media
   data for this track; it is by using this declaration that movies may
   be built which span several files.  At the lowest level, a sample



J. Geagan, K. Gong, A. Periyannan, D. Singer                    [Page 4]^L





Internet Draft         draft-singer-rtp-qtfile-00          March 12 1998


   table is used which relates the temporal aspect of the track to the
   data stored in the file:



           class sampletable {
                   int(32) size;
                   char            type[4] = 'stbl';
                   sampledescription       sd;
                   timetosample            tts;
                   syncsampletable         syncs;
                   sampletochunk           stoc;
                   samplesize                      ssize;
                   chunkoffset             coffset;
           }


   The sample description contains information about the media (e.g. the
   compression formats used in video). The time-to-sample table relates
   time in the track, to the sample (by index) which should be displayed
   at that time. The sync sample table declares which of these are sync
   (key) samples, not dependent on other samples.

   The sample-to-chunk object declares how to find the media data for a
   given sample, and its description given its index.

   The sample size table gives the size of each sample; and the chunk
   offset table gives the offset into the containing file of the start
   of each chunk. The chunk offset table can contain 32-bit or 64-bit
   file offsets for chunks, permitting the use of very large files.

   Walking this structure to find the appropriate data to display for a
   given time is straightforward, mostly involving indexing and adding.
   Using the sync table it is also possible then to back-up to the
   preceding sync sample, and roll forward 'silently' accumulating
   deltas to the desired starting point. Note that these tables which
   give sample timing, size, and position information, are constructed
   in such a way that they are naturally compact.


3 Support for streaming protocols

   The QuickTime file format supports streaming of media data over a
   network as well as local playback. The process of sending protocol
   data units is time-based, just like the display of time-based data,
   and is therefore suitably described by a time-based format. A
   QuickTime file or 'movie' which supports streaming includes
   information about the data units to stream. This information is



J. Geagan, K. Gong, A. Periyannan, D. Singer                    [Page 5]^L





Internet Draft         draft-singer-rtp-qtfile-00          March 12 1998


   included in additional tracks of the movie called "hint" tracks.

   Hint tracks contain instructions for a streaming server which assist
   in the formation of packets.  These instructions may contain
   immediate data for the server to send (e.g. header information) or
   reference segments of the media data.  These instructions are encoded
   in the QuickTime file in the same way that editing or presentation
   information is encoded in a QuickTime file for local playback.
   Instead of editing or presentation information, information is
   provided which allows a server to packetize the media data in a
   manner suitable for streaming using a specific network transport.

   The same media data is used in a QuickTime file which contains hints,
   whether it is for local playback, or streaming over a number of
   different transport types.  Separate 'hint' tracks for different
   transport types may be included within the same file and the media
   will play over all such transport types without making any additional
   copies of the media itself.  In addition, existing media can be
   easily made streamable by the addition of appropriate hint tracks for
   specific transports.  The media data itself need not be recast or
   reformatted in any way.

   This approach to streaming is more space efficient than an approach
   that requires that the media information be partitioned into the
   actual data units which will be transmitted for a given transport and
   media format. Under such an approach, local playback requires either
   re-assembling the media from the packets, or having two copies of the
   media-one for local playback and one for streaming.  Similarly,
   streaming such media over multiple transports using this approach
   requires multiple copies of the media data for each transport. This
   is much less space efficient than hint tracks, unless the media data
   must be heavily transformed to be streamed (e.g., by the application
   of error-correcting coding techniques, or by encryption).

   Support for streaming in the QuickTime file format is based upon the
   following three design parameters:

   (1) The media data is represented as a set of network-independent
   standard QuickTime tracks, which may be played, edited, and so on, as
   normal;

   (2) There is a common declaration and base structure for server hint
   tracks; this common format is protocol independent, but contains the
   declarations of which protocol(s) are described in the server
   track(s);

   (3) There is a specific design of the server hint tracks for each
   protocol which may be transmitted; all these designs use the same



J. Geagan, K. Gong, A. Periyannan, D. Singer                    [Page 6]^L





Internet Draft         draft-singer-rtp-qtfile-00          March 12 1998


   basic structure. For example, there may be designs for RTP (for the
   Internet) and MPEG-2 transport (for broadcast), or for new standard
   or vendor-specific protocols.

   The resulting streams, sent by the servers under the direction of the
   hint tracks, need contain no trace of QuickTime information. This
   design does not require that QuickTime, or its structures or
   declaration style, be used either in the data on the wire or in the
   decoding station. For example, a QuickTime file using H.261 video and
   DVI audio, streamed under RTP, results in a packet stream which is
   fully compliant with the IETF specifications for packing those
   codings into RTP.

   The hint tracks are built and flagged so that when the presentation
   is viewed directly (not streamed), they are ignored.

3.1 RTP Hint Tracks

   This section presents an example track format for streaming RTP from
   a QuickTime movie. For brevity, only the essential aspects are
   presented here, for brevity.

   In RTP, each media stream is sent as a separate RTP stream;
   multiplexing is achieved by using IP's port-level multiplexing, not
   by interleaving the data from multiple streams into a single RTP
   session. Therefore each media track in the movie has an associated
   RTP hint track. Each hint track contains a track reference back to
   the media track which it is streaming.

   This design decides the packet size at the time the hint track is
   created; therefore, in the sample description for the hint track (a
   data structure which can contain fields specific to the 'coding' -
   which in this case is a protocol), we indicate the chosen packet
   size. Note that it is valid for there to be several RTP hint tracks
   for each media track, with different packet size choices. Other
   protocols can be parameterized in a similar way. Similarly the time-
   scale for the RTP clock is provided in the sample description.

   The hint track is related to its base media track by a single track
   reference declaration. (The RTP specification does not permit
   multiplexing of media within a single RTP stream.) The sample
   description for RTP declares the maximum packet size which this hint
   track will generate. Partial session description (SAP/SDP)
   information is stored in the track.

   Each sample in the RTP hint track contains the instructions to send
   out a set of packets which must be emitted at a given time. The time
   in the hint track is emission time, not necessarily the media time of



J. Geagan, K. Gong, A. Periyannan, D. Singer                    [Page 7]^L





Internet Draft         draft-singer-rtp-qtfile-00          March 12 1998


   the associated media.

   Notice that we now describe the internal structure of samples, which
   are media data, not meta data, in the terminology of this proposal.
   These need not be structured as objects.  Each sample contains two
   areas: the instructions to compose the packets, and any extra data
   needed when sending those packets (e.g. an encrypted version of the
   media data).


           struct RTPsample {
                   int(16) packetcount;
                   RTPpacket       packets[packetcount];
                   byte            extradata[];
           }


   Each RTP packet contains the information to send a single packet. In
   order to separate media time from emission time, an RTP time stamp is
   specifically included, along with data needed to form the RTP header.
   Other header information is supplied; the algorithms for forming the
   RTP header given the information here are simple. Then there is a
   table of construction entries:


           struct RTPpacket {
                   int(32) RTPtime;
                   int(16) partialRTPheader;
                   int(16) RTPsequenceseed;
                   int(16) entrycount;
                   dataentry       constructors[entrycount];
           }



   There are various forms of the constructor. Each constructor is 16
   bytes, to make iteration easier. The first byte is a union
   discriminator:













J. Geagan, K. Gong, A. Periyannan, D. Singer                    [Page 8]^L





Internet Draft         draft-singer-rtp-qtfile-00          March 12 1998



           struct dataentry {
                   int(8) entrytype;
                   switch entrytype {
                           case immediate:
                                   int(8)  bytecount;
                                   int(8)  bytestocopy[bytecount];
                           case mediasample:
                                   int(8)  reserved[5];
                                   int(16) length;
                                   int(32) mediasamplenumber;
                                   int(32) mediasampleoffset;
                           case hintsample:
                                   int(8)  reserved[5];
                                   int(16) length;
                                   int(32) hintsamplenumber;
                                   int(32) hintsampleoffset;
                   }
           }



   The immediate mode permits the insertion of payload-specific headers
   (e.g. the RTP H.261 header). For hint tracks where the media is sent
   unchanged, the mediasample entry then specifies the bytes to copy
   from the media track, by giving the sample number, data offset, and
   length to copy. For complex cases (e.g. encryption or forward error
   correction), the transformed data would be placed into the hint
   samples, and then hintsample mode would be used. Note that this would
   be from the extradata field in the RTPsample itself.

   Note that these structures should be flexible enough to cover not
   only the standard RTP payloads (H.261, MPEG, etc.) but also private
   packings such as the QuickTime-in-RTP [3], or generic packing as is
   now being proposed [4].

   Notice that there is no requirement that successive packets transmit
   successive bytes from the media stream. For example, to conform with
   RTP-standard packing of H.261, it is sometimes required that a byte
   be sent at the end of one packet and also at the beginning of the
   next (when a macroblock boundary falls within a byte).


4 Open Issues

   The following open issues need to be resolved:

   -      What information is needed about the tracks (e.g. average and



J. Geagan, K. Gong, A. Periyannan, D. Singer                    [Page 9]^L





Internet Draft         draft-singer-rtp-qtfile-00          March 12 1998


   peak data rate)?

   -      For tracks which use re-transmission, how should packet re-
   transmissions be marked, and how should they be treated when seeking?

Acknowledgments

   The authors would like to thank a number of people, particularly
   Peter Hoddie (Apple Computer), William Belknap (IBM Corporation),
   Christopher Walton (Netscape), Dave Pawson (Oracle), Ronald Jacoby
   (Silicon Graphics, Inc.), and Gerard Fernando and Michael Speer (Sun
   Microsystems).







































J. Geagan, K. Gong, A. Periyannan, D. Singer                   [Page 10]^L





Internet Draft         draft-singer-rtp-qtfile-00          March 12 1998


References

   [1] H. Schulzrinne, et. al., "RTP : A Transport Protocol for Real-
   Time Applications", IETF RFC 1889, January 1996.

   [2] Apple Computer, Inc., "QuickTime File Format Specification", May
   1996.
   <ftp://ftp.apple.com/Quicktime/devworld/QuickTime/mac/QuickTime.pdf>.

   [3] A. Jones, et. al., "RTP Payload Format for QuickTime Media
   Streams", IETF Draft, draft-ietf-avt-qt-rtp-00.txt, July 22 1997,
   Expires: January 22 1998.

   [4] A. Periyannan, et. al., "Delivering Media Generically over RTP",
   IETF Draft, draft-periyannan-generic-rtp-00.txt, March 13 1998,
   Expires: September 13 1998.

Authors' Contact Information
   Alagu Periyannan
   Email: alagu@apple.com
   Tel: (408) 862 5387
   Fax: (408) 974 0234

   Jay Geagan
   Email: geagan@apple.com
   Tel: (408) 862 6562

   Kevin Gong
   Email: kevin@apple.com
   Tel: (408) 974 4175

   David Singer
   Email: singer@apple.com
   Tel: (408) 974 3162

   Apple Computer, Inc.
   One Infinite Loop, MS:302-3MT
   Cupertino  CA 95014
   USA












J. Geagan, K. Gong, A. Periyannan, D. Singer                   [Page 11]^L