Skip to main content

Session Recording for Conferences using SMIL
draft-romano-dcon-recording-05

The information below is for an old version of the document.
Document Type
This is an older version of an Internet-Draft whose latest revision state is "Expired".
Authors Alessandro Amirante , Tobia Castaldi , Lorenzo Miniero , Simon Pietro Romano
Last updated 2011-12-20 (Latest revision 2011-06-20)
RFC stream (None)
Formats
Stream Stream state (No stream defined)
Consensus boilerplate Unknown
RFC Editor Note (None)
IESG IESG state I-D Exists
Telechat date (None)
Responsible AD (None)
Send notices to (None)
draft-romano-dcon-recording-05
DISPATCH                                                     A. Amirante
Internet-Draft                                      University of Napoli
Expires: June 21, 2012                                       T. Castaldi
                                                              L. Miniero
                                                                Meetecho
                                                             S P. Romano
                                                    University of Napoli
                                                       December 19, 2011

              Session Recording for Conferences using SMIL
                     draft-romano-dcon-recording-05

Abstract

   This document deals with session recording, specifically for what
   concerns recording of multimedia conferences, both centralized and
   distributed.  Each involved media is recorded separately, and is then
   properly tagged.  A SMIL [W3C.CR-SMIL3-20080115] metadata is used to
   put all the separate recordings together and handle their
   synchronization, as well as the possibly asynchronous opening and
   closure of media within the context of a conference.  This SMIL
   metadata can subsequently be used by an interested user by means of a
   compliant player in order to passively receive a playout of the whole
   multimedia conference session.  The motivation for this document
   comes from our experience with our conferencing framework, Meetecho,
   for which we implemented a recording functionality.

Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on June 21, 2012.

Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the

Amirante, et al.          Expires June 21, 2012                 [Page 1]
Internet-Draft           DCON Session Recording            December 2011

   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Conventions  . . . . . . . . . . . . . . . . . . . . . . . . .  3
   3.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
   4.  Recording  . . . . . . . . . . . . . . . . . . . . . . . . . .  4
     4.1.  Audio/Video  . . . . . . . . . . . . . . . . . . . . . . .  4
     4.2.  Chat . . . . . . . . . . . . . . . . . . . . . . . . . . .  8
     4.3.  Slides . . . . . . . . . . . . . . . . . . . . . . . . . . 10
     4.4.  Whiteboard . . . . . . . . . . . . . . . . . . . . . . . . 11
   5.  Tagging  . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
     5.1.  SMIL Head  . . . . . . . . . . . . . . . . . . . . . . . . 13
     5.2.  SMIL Body  . . . . . . . . . . . . . . . . . . . . . . . . 14
       5.2.1.  Audio/Video  . . . . . . . . . . . . . . . . . . . . . 16
       5.2.2.  Chat . . . . . . . . . . . . . . . . . . . . . . . . . 17
       5.2.3.  Slides . . . . . . . . . . . . . . . . . . . . . . . . 18
       5.2.4.  Whiteboard . . . . . . . . . . . . . . . . . . . . . . 19
   6.  Playout  . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 22
   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22
   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 22
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23

Amirante, et al.          Expires June 21, 2012                 [Page 2]
Internet-Draft           DCON Session Recording            December 2011

1.  Introduction

   This document deals with session recording, specifically for what
   concerns recording of multimedia conferences, both centralized and
   distributed.  Each involved media is recorded separately, and is then
   properly tagged.  Such a functionality is often required in many
   conferencing systems, and is of great interest to the XCON [RFC5239]
   Working Group.  The motivation for this document comes from our
   experience with our conferencing framework, Meetecho, for which we
   implemented a recording functionality.  Meetecho is a standards-based
   conferencing framework, and so we tried our best to implement
   recording in a standard fashion as well.

   In the approach presented in this document, a SMIL
   [W3C.CR-SMIL3-20080115] metadata is used to put all the separate
   recordings together and handle their synchronization, as well as the
   possibly asynchronous opening and closure of media within the context
   of a conference.  This SMIL metadata can subsequently be used by an
   interested user by means of a compliant player in order to passively
   receive a playout of the whole multimedia conference session.

   The document presents the approach by sequentially describing the
   several required steps.  So, in Section 4 the recording step is
   presented, with an overview of how each involved media might be
   recorded and stored for future use.  As it will be explained in the
   following sections, existing approaches might be exploited to achieve
   this steps (e.g.  MEDIACTRL [RFC5567].  Then, in Section 5 the
   tagging process is described, by showing how each media can be
   addressed in a SMIL metadata file, with specific focus upon the
   timing and inter-media synchronization aspects.  Finally, Section 6
   is devoted to describing how a potential player for the recorded
   session can be implemented and what it is supposed to achieve.

2.  Conventions

   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT
   RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as
   described in BCP 14, RFC 2119 [RFC2119] and indicate requirement
   levels for compliant implementations.

3.  Terminology

   TBD.

Amirante, et al.          Expires June 21, 2012                 [Page 3]
Internet-Draft           DCON Session Recording            December 2011

4.  Recording

   When a multimedia conference is realized over the Internet, several
   media might be involved at the same time.  Besides, these media might
   come and go asynchronously during the lifetime of the same
   conference.  This makes it quite clear that, in case such a
   conference needs to be recorded in order to allow a subsequent,
   possibly offline, playout, these media need to be recorded in a
   format that is aware of all the timing-related aspects.  A typical
   example is a videoconference with slide sharing.  While audio and
   video have a life of their own, slides changes might be triggered at
   a completely different pace.  Besides, the start of a slideshow might
   occur much later than the start of the audio/video session.  All
   these requirements must be taken into account when dealing with
   session recording in a conference.  Besides, it's important that all
   the individual recordings be taken in a standard fashion, in order to
   achieve the maximum compatibility among different solutions and avoid
   any proprietary mechanism or approach that could prevent a successful
   playout later on.

   In this document, we present our approach towards media recording in
   a conference.  Specifically, we will deal with the recording of the
   following media:

   o  audio and video streams (in Section 4.1);
   o  text chats (in Section 4.2);
   o  slide presentations (in Section 4.3);
   o  whiteboards (in Section 4.4).

   Additional media that might be involved in a conference (e.g. desktop
   or application sharing) are not presented in this document, and their
   description is left to future extensions.

4.1.  Audio/Video

   In a conferencing system compliant with [RFC5239], audio and video
   streams contributed by participants are carried in RTP channels
   [RFC3550].  These RTP channels may or may not be secured (e.g by
   means of SRTP/ZRTP).  Whether or not these channels are secured,
   anyway, is not an issue in this case.  In fact, as it is usually the
   case, all the participants terminate their media streams at a central
   point (a mixer entity), with whom they would have a secured
   connection.  This means that the mixer would get access to the
   unencrypted payloads, and would be able to mix and/or store them
   accordingly.

   From an high level topology point of view, this is how a recorder for
   audio and video streams could be envisaged:

Amirante, et al.          Expires June 21, 2012                 [Page 4]
Internet-Draft           DCON Session Recording            December 2011

              SIP   +------------+ SIP
         /----------|   XCON AS  |--------
        /           +------------+        \
       /                   |MEDIACTRL      \
      /                    |                \
   +-----+              +-----+              +-----+
   |     |     RTP      |     |   RTP        |     |
   |UA-A +<------------>+Mixer+<------------>+UA-B |
   |     |              |     |              |     |
   +-----+              +-++--+              +-----+
                         |   |
              RTP UA-A   |   | RTP UA-B (Rx+Tx)
              (Rx+Tx)    V   V
                      +----------+
                      |          |
                      | Recorder |
                      |          |
                      +----------+

                      Figure 1: Audio/Video Recorder

      [Editors' Note: this is a slightly modified version of the
      topology proposed on the DISPATCH mailing list,
      http://www.ietf.org/mail-archive/web/dispatch/current/
      msg00256.html
      where the Application Server has been specialized in an XCON-aware
      AS, and the AS<->Mixer protocol is the Media Control Channel
      Framework protocol (CFW) specified in [RFC6230].]

   That said, actually recording audio and video streams in a conference
   may be accomplished in several ways.  Two different approaches might
   be highlighted:

   1.  recording each contribution from/to each participant in a
       separate file (Figure 2);
   2.  recording the overall mix (one for audio and one from video, or
       more if several mixes for the same media type are available) in a
       dedicated file (Figure 3).

Amirante, et al.          Expires June 21, 2012                 [Page 5]
Internet-Draft           DCON Session Recording            December 2011

                                +-------+
                                | UAC-C |
                                +-------+
                                    "
                            C (RTP) "
                                    "
                                    "
                                    v
  +-------+  A (RTP)           +----------+           B (RTP)  +-------+
  | UAC-A |===================>| Recorder |<===================| UAC-B |
  +-------+                    +----------+                    +-------+
                                    *
                                    *
                                    *
                                    ****> A.gsm, A.h263
                                    ****> B.g711, B.h264
                                    ****> C.amr

                  Figure 2: Recording individual streams

                                +-------+
                                | UAC-C |
                                +-------+
                                    "
                            C (RTP) "
                                    "
                                    "
                                    v
  +-------+  A (RTP)           +----------+           B (RTP)  +-------+
  | UAC-A |===================>| Recorder |<===================| UAC-B |
  +-------+                    +----------+                    +-------+
                                    *
                                    *
                                    *
                                    ****> (A+B+C).wav, (A+B+C).h263

                     Figure 3: Recording mixed streams

   Of the two, the second is probably more feasable.  In fact, the first
   would require a potentially vast amount of separate recordings which
   would need to be subsequently muxed and correlated to each other.
   Besides, within the context of a multimedia conference, most of the
   times the streams are already mixed for all the participants, and so
   recording the mix directly would be a clear advantage.  Such an

Amirante, et al.          Expires June 21, 2012                 [Page 6]
Internet-Draft           DCON Session Recording            December 2011

   approach, of course, assumes that all the streams pass through a
   central point where the mixing occurs: it is the case depicted in
   Figure 1.  The recording would take place in that point.  Such
   central point, the mixer (which in this case would also act as the
   recorder, or a frontend to it), might be a MEDIACTRL-based [RFC5567]
   Media Server.  Considering the similar nature of audio and video
   (both being RTP based and mixed by probably the same entity) they are
   analysed in the same section of this document.  The same applies to
   tagging and playout as well.  It is important to note that in case
   any policy is involved (e.g. moderation by means of the BFCP
   [RFC4582]) the mixer would take it into account when recording.  In
   fact, the same policies applied to the actual conference with respect
   to the delivery of audio and video to the participants needs to be
   enforced for the recording as well.

   In a more general way, if the mixer does not support a direct
   recording of the mixes it prepares, recording a mix can be achieved
   by attaching the recorder entity (whatever it is) as a passive
   participant to the conference.  This would allow the recorder to
   receive all the involved audio and video streams already properly
   mixed, with policies already taken into consideration.  This approach
   is depicted in Figure 4.

                                +-------+
                                |  UAC  |
                                |   C   |
                                +-------+
                                   " ^
                           C (RTP) " "
                                   " "
                                   " " A+B (RTP)
                                   v "
   +-------+  A (RTP)           +--------+  A+C (RTP)         +-------+
   |  UAC  |===================>| Media  |===================>|  UAC  |
   |   A   |<===================| Server |<===================|   B   |
   +-------+         B+C (RTP)  +--------+           B (RTP)  +-------+
                                    "
                                    "
                                    " A+B+C (RTP)
                                    "
                                    v
                              +----------+
                              | Recorder |
                              +----------+
                                    *
                                    ****> (A+B+C).wav, (A+B+C).h263

Amirante, et al.          Expires June 21, 2012                 [Page 7]
Internet-Draft           DCON Session Recording            December 2011

                Figure 4: Recorder as a passive participant

   Whether or not the mixer is MEDIACTRL-based, it's quite likely that
   the AS handling the multimedia conference business logic has some
   control on the mixing involved.  This means it can request the
   recording of each available audio and/or video mix in a conference,
   if only by adding the passive participant as mentioned above.
   Besides, events occurring at the media level or business logic in the
   AS itself allow the AS to take note of timing information for each of
   the recorded media.  For instance, the AS may take note of when the
   video mixing started, in order to properly tag the video recording in
   the tagging phase.  Both the recordings and the timing events list
   would subsequently be used in order to prepare the metadata
   information of the audio and video in the overall session recording
   description.  Such a phase is described in Section 5.2.1.

   In a MEDIACTRL Media Server, such a functionality might be
   accomplished by means of the Mixer Control Package
   [I-D.ietf-mediactrl-mixer-control-package].  At the end of the
   conference, URLs to the actual recordings would be made available for
   the AS to use.  The AS might then subsequently access those
   recordings according to its business logic, e.g. to store them
   somewhere else (the MS storage might be temporary) or to implement an
   offline transcoding and/or mixing of all the recordings in order to
   obtain a single file representative of the whole audio/video
   participation in the conference.  Practical examples of these
   scenarios are presented in [I-D.ietf-mediactrl-call-flows].

   Of course, if the recording of a mix is not possible or desired, one
   could still fallback to the first approach, that is individually
   recording all the incoming contributions.  It is the case, for
   instance, of conferencing systems which don't implement video mixing,
   but just rely instead on a switching/forwarding of the potentially
   several video streams to each participant.  This functionality can
   also be achieved by means of the same control package previously
   introduced, since it allows for the recording of both mixes and
   individual connections.  Once the conference ends, the AS can then
   decide what to do with the recordings, e.g. mixing them all together
   offline (thus obtaining an overall mix) or leave them as they are.
   The tagging process would the subsequently take the decision into
   account, and address the resulting media accordingly.

4.2.  Chat

   What has been said about audio and video partially applies to text
   chats as well.  In fact, just as for audio and video a central mixer
   is usually involved, for instant messaging most of the times the
   contributions by all participants pass through a central node from

Amirante, et al.          Expires June 21, 2012                 [Page 8]
Internet-Draft           DCON Session Recording            December 2011

   where they are forwarded to the other participants.  It is the case,
   for instance, of XMPP [RFC3920] and MSRP [RFC4975] based text
   conferences.  If so, recording of the text part of a conference is
   not hard to achieve either.  The AS just needs to implement some form
   of logging, in order to store all the messages flowing through the
   text conference central node, together with information on the
   senders of these messages and timing-related information.  Of course,
   the AS may not directly be the text conference mixer: the same
   considerations apply, however, in the sense that the remote mixer
   must be able to implement the aforementioned logging, and must be
   able to receive related instructions from the controlling AS.
   Besides, considering the possible protocol-agnostic nature of the
   conferencing system (as envisaged in [RFC5239]), several different
   instant messaging protocols may be involved in the same conference.
   Just as the conferencing system would act as a protocol gateway
   during the lifetime of the conference (i.e. provide MSRP users with
   the text coming from XMPP participants and viceversa), all the
   contributions coming from the different instant messaging protocols
   would need to be recorded in the same log, and in the same format, to
   avoid ambiguity later on.

   An example of a recorder for instant messaging is presented in
   Figure 5.

                                +-------+
                                | UAC-C |
                                +-------+
                                    ^
                           C (MSRP) " '10.11.24 Hi!'
                                    "
                                    "
                                    v
  +-------+  A (XMPP)          +----------+           B (IRC)  +-------+
  | UAC-A |<==================>| Recorder |<==================>| UAC-B |
  +-------+  '10.11.26 Hey C'  +----------+ '10.11.30 Hey man' +-------+
                                    *
                                    *
                                    *     [..]
                                    ****> 10.11.24 <User C> Hi!
                                    ****> 10.11.26 <User A> Hey C
                                    ****> 10.11.30 <User B> Hey man
                                          [..]

                   Figure 5: Recording a text conference

Amirante, et al.          Expires June 21, 2012                 [Page 9]
Internet-Draft           DCON Session Recording            December 2011

   The same considerations already mentioned about optional policies
   involved apply to text conferences as well: i.e., if a UAC is not
   allowed to contribute text to the chat, this contribution is excluded
   both from the mix the other participants receive and from the ongoing
   recording.

   Considerations about the format of the recording are left to
   Section 5.2.2.  Until then, we just assume the AS has a way to record
   text conferences somehow in a format it is familiar with.  This
   format would subsequently be converted to another, standard, format
   that a player would be able to access.

4.3.  Slides

   Another media typically available in a multimedia conference over the
   internet is the slides presentation.  In fact, slides, whatever
   format they're in, are still the most common way of presenting
   something within a collaboration framework.  The problem is that,
   most of the times, these slides are deployed in a proprietary way
   (e.g.  Microsoft Powerpoint and the like).  This means that, besides
   the recording aspect of the issue, the delivery itself of such a
   slides can be problematic when considered in a standards based
   conferencing framework.

   Considering that no standard way of implementing such a functionality
   is commonly available yet, we assume that a conferencing framework
   makes such slides available to the participants in a conference as a
   slideshow, that is, a series of static images whose appearance might
   be dictated by a dedicated protocol.  For instance, a presenter may
   trigger the change of a slide by means of an instant messaging
   protocol, providing each authorized participant with an URL from
   where to get the current slide with optional metadata to describe its
   content.

   An example is presented in Figure 6.  The presenter has previously
   uploaded its presentation converted in a proprietary format.  The
   presentation has been converted to images and a description of the
   new format has been sent back to the presenter (e.g. an XML
   metadata).  At this point, the presenter makes use of XMPP to inform
   the other participants about the current slide, by providing an HTTP
   URL to the related image.

Amirante, et al.          Expires June 21, 2012                [Page 10]
Internet-Draft           DCON Session Recording            December 2011

                              +-----------+
                              | Presenter |
                              +-----------+
                                   "
                           (XMPP)  " Current presentation: f44gf
                                   " Current slide number: 4
                                   " URL: http://example.com/f44gf/4.jpg
                                   "
                                   v
 +-------+  (XMPP)            +----------+            (XMPP)  +-------+
 | UAC-A |<===================| ConfServ |===================>| UAC-B |
 +-------+                    +----------+                    +-------+
     |                                                            |
     | HTTP GET (http://example.com/f44gf/4.jpg)                  |
     v                  HTTP GET (http://example.com/f44gf/4.jpg) |
                                                                  v

                  Figure 6: Presentation sharing via XMPP

   From this assumption, the recording of each slide presentation would
   be relatively trivial to achieve.  In fact, the AS would just need to
   have access to the set of images (with the optional metadata
   involved) of each presentation, and to the additional information
   related to presenters and to when each slide was triggered.  For
   instance, the AS may take note of the fact that slide 4 from
   presentation "f44gf" of the example above has been presented by UAC
   "spromano" from the second 56 of the conference to the second 302.
   Properly recording all those events would allow for a subsequent
   tagging, thus allowing for the integration of this medium in the
   whole session recording description together with the other media
   involved.  This phase will be described in Section 5.2.3.

4.4.  Whiteboard

   To conclude the overview on the analysed media, we consider a further
   medium which is quite commonly deployed in multimedia conferences:
   the shared whiteboard.  There are several ways of implementing such a
   functionality.  While some standard solutions exist, they are rarely
   used within the context of commercial conferencing application, which
   usually prefer to implement it in a proprietary fashion.

   Without delving into a discussion on this aspect, suffices it to say
   that for a successful recording of a whiteboard session most of the
   times it is enough to just record the individual contributions of
   each involved participant (together with the usual timing-related
   information).  In fact, this would allow for a subsequent replay of
   the whiteboard session in an easy way.  Unlike audio and video,

Amirante, et al.          Expires June 21, 2012                [Page 11]
Internet-Draft           DCON Session Recording            December 2011

   whiteboarding usually is a very lightweight media, and so recording
   the individual contributions rather than the resulting mix (as we
   suggested in Section 4.1) is advisable.  These contributions may
   subsequently be mixed together in order to obtain a standard
   recording (e.g. a series of images, animations, or even a low
   framerate video).  An example of recording for this medium is
   presented in Figure 7.

                                +-------+
                                | UAC-C |
                                +-------+
                                    "
                           C (XMPP) " 10.11.20: line
                                    "
                                    "
                                    v
 +-------+  A (XMPP)          +-----------+          B (XMPP)  +-------+
 | UAC-A |===================>| WB server |<===================| UAC-B |
 +-------+  10.10.56: circle  +-----------+    10.12.30: text  +-------+
                                    *
                                    *
                                    *
                                    ****> 10.10.56: circle (A)
                                    ****> 10.11.20: line (C)
                                    ****> 10.12.30: text (B)

                 Figure 7: Recording a whiteboard session

   The recording process may be enriched by the population of a parallel
   event list.  For instance, optimizations might include event as the
   creation of a new whiteboard, the clearing of an existing whiteboard
   or the adding of a background image that replaced the previously
   existing content.  Such event would be precious in a subsequent
   playout of the recorded steps, since they would allow for a more
   lightweight replication in case seeking is involved.  For instance,
   if 70 drawings have been done, but at second 560 of the conference
   the whiteboard has been cleared and since then only 5 drawings have
   been added, a viewer seeking to second 561 would just need the
   clear+5 drawings to be replicated.  Anyway, further discussion upon
   the tagging process of this media is presented in Section 5.2.4.

5.  Tagging

   Once the different media have been recorded and stored, and their

Amirante, et al.          Expires June 21, 2012                [Page 12]
Internet-Draft           DCON Session Recording            December 2011

   timing related somehow, this information needs to be properly tagged
   in order to allow intra-media and inter-media synchronization in case
   a playout is invoked.  Besides, it would be desirable to make use of
   standard means for achieving such a functionality.  For these
   reasons, we chose to make use of the Synchronized Multimedia
   Integration Language [W3C.CR-SMIL3-20080115], which fulfills all the
   aforementioned requirements, besides being a well-established W3C
   standard.  In fact, timing information is very easy to address using
   this specification, and VCR-like controls (start, pause, stop,
   rewind, fast forward, seek and the like) are all easily deploayble in
   a player using the format.

   The SMIL specification provides means to address different media by
   using custom tags (e.g. audio, img, textstream and so on), and for
   each of these media the related tempification can be easily
   described.  The following subsections will describe how a SMIL
   metadata could be prepared in order to map with the media recorded as
   described in Section 4.

   Specifically, considering how a SMIL file is assumed to be
   constructed, the head will be described in Section 5.1, while the
   body (with different focus for each media) will be presented in
   Section 5.2.

5.1.  SMIL Head

   As specified in [W3C.CR-SMIL3-20080115], a SMIL file is composed of
   two separate sections: a head and a body.  The head, among all the
   needed information, also includes details about the allowed layouts
   for a multimedia presentation.  Considering the amount of media that
   might have been involved in a single conference, properly
   constructing such a section definitely makes much sense.  In fact,
   all the involved media need to be placed in order not to prevent
   access to other concurrent media within the context of the same
   recording.

   For instance, this is how a series of different media might be placed
   in a layout according to different screen resolutions:

Amirante, et al.          Expires June 21, 2012                [Page 13]
Internet-Draft           DCON Session Recording            December 2011

<?xml version="1.0" encoding="UTF-8"?>
<smil xmlns:xml="http://www.w3.org/XML/1998/namespace">
  <head>
    <switch systemScreenSize="800X600">
      <layout>
        <root-layout width="800" height="600" background-color="black"/>
        <region id="image0" regionName="image" fit="fill" top="310" \
                left="370" width="400" height="350" />
        <region id="video0" regionName="video" top="0" left="370" \
                width="430" height="310" fit="fill" />
        <region id="chat0" regionName="chat" fit="fill" alt="chat" \
                top="410" left="370" width="400" height="-60"/>
        <region id="wb0" regionName="wb" top="0" left="0" width="370" \
                height="520"/>
      </layout>
    </switch>
    <switch systemScreenSize="1024X768">
      <layout>
        <root-layout width="1024" height="768" \
                     background-color="black"/>
        <region id="image1" regionName="image" fit="fill" top="310" \
                left="594" width="400" height="350"/>
        <region id="video1" regionName="video" top="0" left="594" \
                width="430" height="310" fit="fill"/>
        <region id="chat1" regionName="chat" fit="fill" alt="chat" \
                top="578" left="594" width="400" height="108"/>
        <region id="wb1" regionName="wb" top="0" left="0" width="594" \
                height="688"/>
      </layout>
    </switch>
[..]

   That said, it's important that this section of the SMIL file be
   constructed properly.  In fact, the layout description also contains
   explicit region identifiers, which are referred to when describing
   media in the body section.

   TBD. (?)

5.2.  SMIL Body

   The SMIL head section described previously is very important for what
   concerns presentation-related settings, but does not contain any
   timing-related information.  Such information, in fact, belongs to a
   separate section in the SMIL file, the so called body.  This body
   contains the information on all the involved media in the recorded
   session, and for each media timing information are provided.  This

Amirante, et al.          Expires June 21, 2012                [Page 14]
Internet-Draft           DCON Session Recording            December 2011

   timing information includes not only when each media appears and when
   it goes away, but also details on the media lifetime as well.  By
   correlating the timing information for each media, a SMIL reader can
   infer inter-media synchronization and present the recorded session as
   it was conceived to appear.

   Besides, the involved media can be grouped in the body in order to
   implement sequential and/or parallel playback involving a subset of
   the available media.  This is made possible by making use of the
   <seq> and <par> elements.  The <par> element in particular is of
   great interest to this document, since in a multimedia conference
   many media are presented to participants at the same time.

   That said, it is important to be able to separately address each
   involved medium.  To do so, SMIL makes use of well specified
   elements.  For instance, a <video> element is used to state the
   presence of a video stream in the session.  Each of these elements
   can be furtherly customized and configured by means of ad-hoc
   attributes.  For instance, the 'src' attribute in a <video> element
   means that the actual video stream source can be found at the
   provided address.

   The element for each media is also the place where SMIL adds
   information upon when the addressed media comes into play.  This is
   done by means of two attributes called 'begin' and 'end'
   respectively.  As the names themselves suggest, the 'begin' attribute
   gives a temporal reference on the media start, while the 'end'
   attribute specifies when the media ends.  For instance, an element
   formatted in the following way:

   <video src="http://www.example.com/conference45.avi" region="box12" \
          begin="15s" end="400s"/>

   means that a video stream (whose URL is provided in 'src') must be
   played in the session only 15 seconds after the session beginning,
   and that it must end 385 seconds after.  This information is also
   used when seeking through a session.  For instance, if a user
   accessing the recording seeks to 200 seconds after the beginning, the
   video will appear as well at the relative time of 200-15=185 seconds.

   Considering the recorded media presented in Section 4, the
   construction of following sections of the body will be described:

Amirante, et al.          Expires June 21, 2012                [Page 15]
Internet-Draft           DCON Session Recording            December 2011

   o  audio/video streams (in Section 5.2.1);
   o  text chats (in Section 5.2.2);
   o  slide presentations (in Section 5.2.3);
   o  whiteboards (in Section 5.2.4).

5.2.1.  Audio/Video

   In SMIL, the element to describe an audio stream is <audio>, while
   for video the element is <video>.  Considering that these two stream
   types are handled in a very similar way, only video will be
   addressed.  This is an explicit choice for two reasons: (i) video is
   slightly more complex to address than audio, and so treating video
   makes more sense; (ii) often off-line encoders/muxers will place the
   recorded elementary audio and video streams in a single video
   container, which means both streams can actually be addressed in a
   single media file.

   That said, <video> is the element used in a SMIL bod to state the
   presence of an audio/video stream.  It's tempification, related to
   other media, might be implemented by making use of a <par>/<seq>
   aggregator.  In such an element, some attributes are of great
   relevance and should be included:

   o  'src', to address the actual video file to use (usually a HTTP
      URL);
   o  'begin' and 'end', for timing information (when the video should
      appear/disappear in the session);
   o  'region', to specify where the stream will need to appear in the
      layout as configured in the head (e.g. place it in the region
      called box12).

   All these information can easily be taken according to the stream as
   recorded previously (optionally re-encoded and/or re-muxed), together
   with the timing information as part of the event log.  The 'src', in
   particular, can be any video file, which means that an encoding of
   the stream for a player is quite trivial to achieve.

   Besides, as mentioned in Section 4.1, recordings may be available as
   already mixed streams, or individual streams.  In case the recording
   is already mixed, then the tagging can be done as seen in the
   previous paragraph:

   <video src="http://www.example.com/conference45.avi" region="box12" \
          begin="15s" end="400s"/>

Amirante, et al.          Expires June 21, 2012                [Page 16]
Internet-Draft           DCON Session Recording            December 2011

   where this element would state the presence of an audio/video stream,
   to appear in the specified region in the specified range of time.  In
   case several recordings are available, instead, the tagging would be
   a little more complex: in fact, the metadata would need to address
   the parallel playback of the different recordings, which would also
   need to reflect the actual lifetime of the original streams in the
   conference.  For instance, if UAC A joined the conference much before
   UAC B, its contributions would appear in the playout accordingly.  An
   example of how this could be achieved in a SMIL metadata is presented
   here:

   <par>
      [..]
      <video src="http://www.example.com/userA.avi" region="box12" \
             begin="15s" end="400s"/>
      <video src="http://www.example.com/userB.avi" region="box16" \
             begin="230s" end="521s"/>
      [..]
   </par>

   This lines tell an interested player that the two specified video
   streams (whose URLs are provided in the respective 'src' attributes)
   must be played in parallel, and in different regions.  Anyway, video
   stream 'userA.avi' starts after 15 seconds, while 'userB.avi' starts
   after 230 seconds since the beginning of the conference, reflecting
   the appearance of these media in the conference itself.

5.2.2.  Chat

   Text in SMIL can be addressed in several different ways, the most
   common ones being <text> and <textstream> elements. <text>, however,
   usually deals only with static text content, that is text without
   timing information (e.g.  HTML).  For this reason, <textstream>
   should be used instead, since it allows text to appear and disappear
   in real-time.

   The attributes to configure the element are basically the same as the
   one presented for <video> (src, region, begin, end).  The difference,
   anyway, is on the file to refer to in the 'src' attribute.  In fact,
   if timing information is needed, a proper format for tempified text
   is needed.  The <textstream> element supports RealText Markup, which
   is a separate markup language for dealing with real-time text.  It is
   the format used, for instance, for subtitle captioning.  An example
   of RealText is presented in the following lines:

Amirante, et al.          Expires June 21, 2012                [Page 17]
Internet-Draft           DCON Session Recording            December 2011

   <window width="340" height="160" wordwrap="true" loop="false" \
           bgcolor="white">
      <font color="black" face="Arial" size="+0">
         <Time begin="0:00:02.2"/><br/><User C>Hi
         <Time begin="0:00:04.5"/><br/><User A>Hey C
         <Time begin="0:00:08.1"/><br/><User B>Hey man
         [..]

   This example recalls Figure 5, where the first message (by User C)
   was sent at 10.11.24.  Assuming the text conference started at
   10.11.22, the log is converted to RealText and tagged accordingly
   (e.g.  User C saying his first message two seconds after the
   conference started).  The RealText fine can then be addressed in SMIL
   using the aforementioned <textstream> element:

 <par>
    [..]
    <textstream src="http://example.com/chats/conf45.rt" region="chat" \
                begin="0s" end="500s"/>
    [..]
 </par>

   Once the requirement on the file format is assessed, the next step is
   obvious.  Whatever format the chat in the conference had been
   recorded into, it needs to be converted to a RealText file in order
   to have it addressed in the resulting SMIL file.  The conversion is
   usually very trivial to achieve, considering that chat logs often
   have the same information needed in a RealText file except for the
   presentation format.

5.2.3.  Slides

   The easiest way to deal with a slideshow and/or a shared slide
   presentation is to make use of the <img> element.  In fact, as
   anticipated in Section 4.3, slides in a presentation most often are
   composed of a static content, and can be assimilated to images.  This
   means that addressing a complete presentation in a SMIL file can be
   achieved by following these steps:

   1.  preparing a list of images reflecting the original presentations
       (e.g. 10 images for 10 slides, or more if any animation was
       involved);

Amirante, et al.          Expires June 21, 2012                [Page 18]
Internet-Draft           DCON Session Recording            December 2011

   2.  prepare the timing related information (e.g. when slide 1
       appeared, and when it was substituted by slide 2);
   3.  placing a series of <img> elements in the SMIL metadata to
       address the first two steps.

   An example of this, recalling the scenario depicted in Figure 6, is
   presented here:

   <par>
      [..]
      <img src="http://www.example.com/f44gf/1.jpg" region="image" \
           begin="0s" end="10s"/>
      <img src="http://www.example.com/f44gf/2.jpg" region="image" \
           begin="10s" end="18s"/>
      <img src="http://www.example.com/f44gf/3.jpg" region="image" \
           begin="18s" end="30s"/>
      [..]
   </par>

   The slideshow would usually be a sequence, and so a <seq> would seem
   the more apt way to address the presentation sharing.  Nevertheless,
   timing information are very important, and it's quite likely that
   several additional media will flow in parallel with the slides (e.g.
   the video stream which includes the presenter talking).  That's why a
   <par> element is used instead, which for brevity omits the other
   media involved.

5.2.4.  Whiteboard

   As anticipated in Section 4.4, no standard solution is usually
   deployed when talking of whitebording in a conferencing system.  For
   this reason, the recording process suggested in Section 4.4 is just a
   timing-aware dump of all the interactions occurred at the whiteboard
   level.  These interactions might subsequently be converted in a more
   common format as, for instance, a video or an image slide show.  In
   case of a video, the same considerations of Section 5.2.1 would
   apply, since the whiteboard recording would actually be a video
   itself.  In case it is converted to a slideshow, the tagging process
   would occur as explained in Section 5.2.3.

   However, SMIL also allows for custom, non-standard media to be
   involved in its metadata.  This can be achieved by means of the
   standard element <ref>, which is a generic media reference.  This
   element allows for the description and addressing of non-standard
   media (or at least media the chosen SMIL specification is not aware

Amirante, et al.          Expires June 21, 2012                [Page 19]
Internet-Draft           DCON Session Recording            December 2011

   of), which could be implemented in a custom player.  This means that,
   if a whiteboard has been recorded in a proprietary way, and this way
   needs for a reason or for another to be preserved, the <ref> element
   may be used to address it: in fact, the same attributes previously
   introduced (including 'src' and the others) are available to this
   element as well.  Of course, if this approach is used only a player
   able to understand the proprietary media extension would be able to
   replay the recorded whiteboard session.  To make the player aware of
   the format employed, a 'type' attribute could be added as well.

   An example of how the recorded whiteboard might be addressed is
   provided here:

   <par>
      [..]
         <ref src="http://example.com/wb/wb12.txt" region="wb" \
              type="myFormat"/>
      [..]
   </par>

6.  Playout

   Once the SMIL metadata has been properly prepared, a playout of the
   recorded conference is not difficult to achieve.  In fact, an
   interested user just needs to get a SMIL-aware player supporting the
   several file formats involved, that are: (i) audio/video; (ii)
   images; (iii) RealText; (iv) the whiteboarding session, whatever
   format it has been recorded into.  Considering the standard nature of
   SMIL and of almost all the media involved, the session is likely to
   be easily accessable to many players out there in the wild.  Anyway,
   the 'type' attribute for all the involved media can be used to check
   for the support of the related media or not.

   Additional information provided in the SMIL head (e.g. the <switch>
   elements and the <layout> they suggest) provide guidance for players
   to presenting the addressed media in the expected way.

   The sequence an interested user needs to realize in order to access a
   recorded conference session can be summarized in the following
   simplified steps:

   o  the user retrieves the SMIL file associated with the conference
      she/he is interested to (e.g. by means of HTTP or other out-of-
      band mechanisms);

Amirante, et al.          Expires June 21, 2012                [Page 20]
Internet-Draft           DCON Session Recording            December 2011

   o  the SMIL file is passed to a compliant media player (which could
      have been the means to get the SMIL file in the first place);
   o  the player parses the SMIL file and checks if all the media are
      supported; apart from explicitly non-standard media (e.g.
      whiteboard) the player might check if the envolved media files are
      encoded in a format it supports (e.g. a video file encoded in
      H.264/MP3);
   o  the player prepares the presentation screen; it makes use of the
      information in the <head> in order to choose the right layout; the
      choice may be automatic (e.g. according to the screen resolution)
      or guided by the user;
   o  the player starts retrieving each involved media file; it may
      either retrieve each file in its completeness, or start
      downloading and then start the playout almost immediately (e.g.
      buffering); it also listens for user-generated events, like the
      user pausing/resuming the playout, or seeking to a specific time
      in the conference; if any of these events occur, it takes the
      related action (e.g. seeking to the right time for each medium in
      the conference, taking the timing information from the SMIL file
      as well).

   A general overview of the scenario can be seen in Figure 8.

+------+ 1. START    +----------+                          +----------+
| User |------------>|   User   |------------------------->| Sessions |
|      |<------------| (player) |  2. get conf45.smil      | database |
+------+  6. SHOW    +----------+                          +----------+
                       |  |  |
                       |  |  |
                       |  |  |   3. get audios and videos  +-----------+
                       |  |  +---------------------------->| WebServer |
                       |  |                                |  (video)  |
                       |  |    4. get RealText files       +-----------+
                       |  +------------------------------->|  (text)   |
                       |    5. get slide images            +-----------+
                       +---------------------------------->|  (images) |
                                                           +-----------+

      Figure 8: Retrieving and playing a recorded conference session

   In this quite oversimplified scenario, an interested viewer chooses
   to start viewing a previously recorded conference.  She/he knows the
   address to the recorded session (http://example.com/conf45.smil) and
   passes it to her/his player (1.).  Starting the playout triggers the
   retrieval of the SMIL description (2.), which may be achieved by

Amirante, et al.          Expires June 21, 2012                [Page 21]
Internet-Draft           DCON Session Recording            December 2011

   means of HTTP or any other protocol.  Once the player has access to
   the description, it starts retrieving the individual media resources
   addressed there (video in 3., chat in 4., slides in 5.), and,
   according to the implementation of the player, it either waits for
   all the downloads to complete or just buffers a little while before
   starting the presentation to the user (6.).

7.  Security Considerations

   TBD.

8.  Acknowledgements

   The authors would like to thank...

9.  References

   [RFC2234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", RFC 2234, November 1997.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2434]  Narten, T. and H. Alvestrand, "Guidelines for Writing an
              IANA Considerations Section in RFCs", BCP 26, RFC 2434,
              October 1998.

   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
              A., Peterson, J., Sparks, R., Handley, M., and E.
              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
              June 2002.

   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
              Jacobson, "RTP: A Transport Protocol for Real-Time
              Applications", STD 64, RFC 3550, July 2003.

   [RFC5567]  Melanchuk, T., "An Architectural Framework for Media
              Server Control", RFC 5567, June 2009.

   [RFC6230]  Boulton, C., Melanchuk, T., and S. McGlashan, "Media
              Control Channel Framework", RFC 6230, May 2011.

   [I-D.ietf-mediactrl-mixer-control-package]
              McGlashan, S., Melanchuk, T., and C. Boulton, "A Mixer
              Control Package for the Media Control Channel Framework",

Amirante, et al.          Expires June 21, 2012                [Page 22]
Internet-Draft           DCON Session Recording            December 2011

              draft-ietf-mediactrl-mixer-control-package-14 (work in
              progress), January 2011.

   [I-D.ietf-mediactrl-call-flows]
              Amirante, A., Castaldi, T., Miniero, L., and S. Romano,
              "Media Control Channel Framework (CFW) Call Flow
              Examples", draft-ietf-mediactrl-call-flows-07 (work in
              progress), July 2011.

   [RFC5239]  Barnes, M., Boulton, C., and O. Levin, "A Framework for
              Centralized Conferencing", RFC 5239, June 2008.

   [RFC4582]  Camarillo, G., Ott, J., and K. Drage, "The Binary Floor
              Control Protocol (BFCP)", RFC 4582, November 2006.

   [W3C.CR-SMIL3-20080115]
              Bulterman, D., "Synchronized Multimedia Integration
              Language (SMIL 3.0)", World Wide Web Consortium CR CR-
              SMIL3-20080115, January 2008,
              <http://www.w3.org/TR/2008/CR-SMIL3-20080115>.

   [RFC3920]  Saint-Andre, P., Ed., "Extensible Messaging and Presence
              Protocol (XMPP): Core", RFC 3920, October 2004.

   [RFC4975]  Campbell, B., Mahy, R., and C. Jennings, "The Message
              Session Relay Protocol (MSRP)", RFC 4975, September 2007.

Authors' Addresses

   Alessandro Amirante
   University of Napoli
   Via Claudio 21
   Napoli  80125
   Italy

   Email: alessandro.amirante@unina.it

   Tobia Castaldi
   Meetecho
   Via Carlo Poerio 89
   Napoli  80100
   Italy

   Email: tcastaldi@meetecho.com

Amirante, et al.          Expires June 21, 2012                [Page 23]
Internet-Draft           DCON Session Recording            December 2011

   Lorenzo Miniero
   Meetecho
   Via Carlo Poerio 89
   Napoli  80100
   Italy

   Email: lorenzo@meetecho.com

   Simon Pietro Romano
   University of Napoli
   Via Claudio 21
   Napoli  80125
   Italy

   Email: spromano@unina.it

Amirante, et al.          Expires June 21, 2012                [Page 24]