CLUE WG                                            M. Duckworth, Ed.
Internet Draft                                                Polycom
Intended status: Informational                            A. Pepperell
Expires: June, 2013                                        Silverflare
                                                             S. Wenger
                                                                 Vidyo
                                                     December 24, 2012



               Framework for Telepresence Multi-Streams
                   draft-ietf-clue-framework-08.txt


Abstract

   This memo offers a framework for a protocol that enables devices
   in a telepresence conference to interoperate by specifying the
   relationships between multiple media streams.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current
   Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time.  It is inappropriate to use Internet-Drafts
   as reference material or to cite them other than as "work in
   progress."

   This Internet-Draft will expire on June 24, 2013.

Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.  Code Components extracted from this
   document must include Simplified BSD License text as described in


Duckworth et. al.        Expires June 24 2013                 [Page 1]


Internet-Draft       CLUE Telepresence Framework        December 2012


   Section 4.e of the Trust Legal Provisions and are provided without
   warranty as described in the Simplified BSD License.

Table of Contents




   1. Introduction...................................................3
   2. Terminology....................................................6
   3. Definitions....................................................6
   4. Overview of the Framework/Model................................9
   5. Spatial Relationships.........................................11
   6. Media Captures and Capture Scenes.............................12
      6.1. Media Captures...........................................12
         6.1.1. Media Capture Attributes............................12
      6.2. Capture Scene............................................15
         6.2.1. Capture scene attributes............................17
         6.2.2. Capture scene entry attributes......................18
      6.3. Simultaneous Transmission Set Constraints................19
   7. Encodings.....................................................20
      7.1. Individual Encodings.....................................21
      7.2. Encoding Group...........................................22
   8. Associating Media Captures with Encoding Groups...............24
   9. Consumer's Choice of Streams to Receive from the Provider.....25
      9.1. Local preference.........................................26
      9.2. Physical simultaneity restrictions.......................26
      9.3. Encoding and encoding group limits.......................26
      9.4. Message Flow.............................................27
   10. Extensibility................................................28
   11. Examples - Using the Framework...............................28
      11.1. Three screen endpoint media provider....................28
      11.2. Encoding Group Example..................................35
      11.3. The MCU Case............................................36
      11.4. Media Consumer Behavior.................................37
         11.4.1. One screen consumer................................37
         11.4.2. Two screen consumer configuring the example........38
         11.4.3. Three screen consumer configuring the example......38
   12. Acknowledgements.............................................39
   13. IANA Considerations..........................................39
   14. Security Considerations......................................39
   15. Changes Since Last Version...................................39
   16. Authors' Addresses...........................................42




Duckworth et. al.       Expires June 24, 2013                [Page 2]


Internet-Draft       CLUE Telepresence Framework        December 2012


1. Introduction

   Current telepresence systems, though based on open standards such
   as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate
   with each other.  A major factor limiting the interoperability of
   telepresence systems is the lack of a standardized way to describe
   and negotiate the use of the multiple streams of audio and video
   comprising the media flows.  This draft provides a framework for a
   protocol to enable interoperability by handling multiple streams
   in a standardized way.  It is intended to support the use cases
   described in draft-ietf-clue-telepresence-use-cases-02 and to meet
   the requirements in draft-ietf-clue-telepresence-requirements-01.

   Conceptually distinguished are Media Providers and Media
   Consumers.  A Media Provider provides Media in the form of RTP
   packets, a Media Consumer consumes those RTP packets.  Media
   Providers and Media Consumers can reside in Endpoints or in
   middleboxes such as Multipoint Control Units (MCUs).  A Media
   Provider in an Endpoint is usually associated with the generation
   of media for Media Captures; these Media Captures are typically
   sourced from cameras, microphones, and the like.  Similarly, the
   Media Consumer in an Endpoint is usually associated with
   Renderers, such as screens and loudspeakers.  In middleboxes,
   Media Providers and Consumers can have the form of outputs and
   inputs, respectively, of RTP mixers, RTP translators, and similar
   devices.  Typically, telepresence devices such as Endpoints and
   middleboxes would perform as both Media Providers and Media
   Consumers, the former being concerned with those devices'
   transmitted media and the latter with those devices' received
   media.  In a few circumstances, a CLUE Endpoint middlebox may
   include only Consumer or Provider functionality, such as recorder-
   type Consumers or webcam-type Providers.

   One initial motivation for this memo and its companion documents
   has been that Endpoints according to this memo can, and usually
   do, have multiple Media Captures and Media Renderers.  While
   previous system designs can deal with such a situation, what was
   missing was a mechanism that can associate the Media Captures with
   each other in space and time.  Further, due to the potentially
   large number of RTP flows required for a Multimedia Conference
   involving potentially many Endpoints, each of which can have many
   Media Captures and Media Renderers, a sensible system design is to
   multiplex multiple RTP media flows onto the same transport
   address, so to avoid using the port number as a multiplexing point
   and the associated shortcomings such as NAT/firewall traversal.


Duckworth et. al.       Expires June 24, 2013                [Page 3]


Internet-Draft       CLUE Telepresence Framework        December 2012


   While the actual mapping of those RTP flows to the header fields
   of the RTP packets is not subject of this specification, the large
   number of possible permutations of sensible options a Media
   Provider may make available to a Media Consumer makes a mechanism
   desirable that allows to narrow down the number of possible
   options that a SIP offer-answer exchange has to consider.  Such
   information is made available using protocol mechanisms specified
   in this memo and companion documents, although it should be
   stressed that its use in an implementation is optional.  Also,
   there are aspects of the control of both Endpoints and
   middleboxes/MCUs that dynamically change during the progress of a
   call, such as audio-level based screen switching, layout changes,
   and so on, which need to be conveyed.  Note that these control
   aspects are complementary to those specified in traditional SIP
   based conference management such as BFCP.  Finally, all this
   information needs to be conveyed, and the notion of support for it
   needs to be established.  This is done by the negotiation of a
   "CLUE channel", a data channel negotiated early during the
   initiation of a call.  An Endpoint or MCU that rejects the
   establishment of this data channel, by definition, is not
   supporting CLUE based mechanisms, whereas an Endpoint or MCU that
   accepts it is required to use it to the extent specified in this
   memo and its companion documents.

   A very brief outline of the call flow used by a simple system in
   compliance with this memo can be described as follows.

   An initial offer/answer exchange establishes a CLUE channel
   between two Endpoints.  With the establishment of that channel,
   the endpoints have consented to use the CLUE protocol mechanisms
   and have to adhere to them.

   Over this CLUE channel, the Provider in each Endpoint conveys its
   characteristics and capabilities as specified herein (which will
   typically not be sufficient to set up all media).  The Consumer in
   the Endpoint receives the information provided by the Provider,
   and can use it for two purposes.  First, it can, but is not
   necessarily required to, use the information provided to tailor
   the SDP it is going to send during the following SIP offer/answer
   exchange, and its reaction to SDP it receives in that step.  It is
   often a sensible implementation choice to do so, as the
   representation of the media information conveyed over the CLUE
   channel can dramatically cut down on the size of SDP messages used
   in the O/A exchange that follows.  Second, it takes note of the
   spatial relationship associated with the Media that are described.


Duckworth et. al.       Expires June 24, 2013                [Page 4]


Internet-Draft       CLUE Telepresence Framework        December 2012


   It is often sensible to take that spatial relationship into
   account when tailoring the SDP.

   This CLUE exchange is followed by an SDP offer answer exchange
   that not only establishes those aspects of the media that have not
   been "negotiated" over CLUE, but has also the side effect of
   setting up the media transmission itself, involving potentially
   security exchanges, ICE, and whatnot.  This step is plain vanilla
   SIP, with the exception that the SDP used herein, in most cases
   can (but not necessarily must) be considerably smaller than the
   SDP a system would typically need to exchange if there were no
   pre-established knowledge about the Provider and Consumer
   characteristics.

   During the lifetime of a call, further exchanges can occur over
   the CLUE channel.  In some cases, those further exchanges can be
   dealt with by Provider or Consumer without any other protocol
   activity.  For example, voice-activated screen switching, signaled
   over the CLUE channel, ought not to lead to heavy-handed
   mechanisms like SIP re-invites.  However, in other cases, after
   the CLUE negotiation an additional offer/answer exchange may
   become necessary.  For example, if both sides decide to upgrade
   the call from a single screen to a multi-screen call and more
   bandwidth is required for the additional video channels, that
   could require a new O/A exchange.

   Numerous optimizations may be possible, and are the implementer's
   choice.  For example, it may be sensible to establish one or more
   initial media channels during the initial offer/answer exchange,
   which would allow, for example, for a fast startup of audio.
   Depending on the system design, it may be possible to re-use this
   established channel using only CLUE mechanisms, thereby avoiding
   further offer/answer exchanges.

   One aspect of the protocol outlined herein and specified in
   normative detail in companion documents is that it makes available
   information regarding the Provider's capabilities to deliver
   Media, and attributes related to that media such as their spatial
   relationship, to the Media Consumer.  The operation of the
   Renderer inside the Consumer is unspecified in that it can choose
   to ignore some information provided by the Provider, and/or not
   render media streams available from the Provider (although it has
   to follow the CLUE protocol and, therefore, has to "accept" the
   Provider's information).  All CLUE protocol mechanisms are
   optional in the Consumer in the sense that, while the Consumer


Duckworth et. al.       Expires June 24, 2013                [Page 5]


Internet-Draft       CLUE Telepresence Framework        December 2012


   must be able to receive (and, potentially, gracefully acknowledge)
   CLUE messages, it is free to ignore the information provided
   therein.  Obviously, this is not a particularly sensible design
   choice.

   Legacy devices are defined here in as those Endpoints and MCUs
   that do not support the setup and use of the CLUE channel.  The
   notion of a device being a legacy device is established during the
   initial offer/answer exchange, in which the legacy device will not
   understand the offer for the CLUE channel and, therefore, reject
   it.  This is the indication for the CLUE-implementing Endpoint or
   MCU that the other side of the communication is not compliant with
   CLUE, and to fall back to whatever mechanism was used before the
   introduction of CLUE.

   As for the media, Provider and Consumer have an end-to-end
   communication relationship with respect to (RTP transported)
   media; and the mechanisms described herein and in companion
   documents do not change the aspects of setting up those RTP flows
   and sessions.  However, it should be noted that forms of RTP
   multiplexing of multiple RTP flows onto the same transport address
   are developed concurrently with the CLUE suite of specifications,
   and it is widely expected that most, if not all, Endpoints or MCUs
   supporting CLUE will also support those mechanisms.  Some design
   choices made in this memo reflect this coincidence in spec
   development timing.

2. Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
   NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL"
   in this document are to be interpreted as described in RFC 2119
   [RFC2119].

3. Definitions

   The terms defined below are used throughout this memo and
   companion documents and they are normative.  In order to easily
   identify the use of a defined term, those terms are capitalized.

   Audio Capture: Media Capture for audio.  Denoted as ACn.

   Camera-Left and Right: For media captures, camera-left and camera-
   right are from the point of view of a person observing the



Duckworth et. al.       Expires June 24, 2013                [Page 6]


Internet-Draft       CLUE Telepresence Framework        December 2012


   rendered media.  They are the opposite of stage-left and stage-
   right.

   Capture Device: A device that converts audio and video input into
   an electrical signal, in most cases to be fed into a media
   encoder.

   Cameras and microphones are examples for capture devices.

   Capture Encoding: A specific encoding of a media capture, to be
   sent by a media provider to a media consumer via RTP.

   Capture Scene: a structure representing the scene that is captured
   by a collection of capture devices.  A capture scene includes
   attributes and one or more capture scene entries, with each entry
   including one or more media captures.

   Capture Scene Entry: a list of media captures of the same media
   type that together form one way to represent the capture scene.

   Conference: used as defined in [RFC4353], A Framework for
   Conferencing within the Session Initiation Protocol (SIP).

   Individual Encoding: A variable with a set of attributes that
   describes the maximum values of a single audio or video capture
   encoding.  The attributes include: maximum bandwidth- and for
   video maximum macroblocks (for H.264), maximum width, maximum
   height, maximum frame rate.

   Encoding Group: A set of encoding parameters representing a total
   media encoding capability to be sub-divided across potentially
   multiple Individual Encodings.



   Endpoint: The logical point of final termination through
   receiving, decoding and rendering, and/or initiation through
   capturing, encoding, and sending of media streams.  An endpoint
   consists of one or more physical devices which source and sink
   media streams, and exactly one [RFC4353] Participant (which, in
   turn, includes exactly one SIP User Agent).  In contrast to an
   endpoint, an MCU may also send and receive media streams, but it
   is not the initiator nor the final terminator in the sense that
   Media is Captured or Rendered. Endpoints can be anything from
   multiscreen/multicamera rooms to handheld devices.


Duckworth et. al.       Expires June 24, 2013                [Page 7]


Internet-Draft       CLUE Telepresence Framework        December 2012


   Front: the portion of the room closest to the cameras.  In going
   towards back you move away from the cameras.

   MCU: Multipoint Control Unit (MCU) - a device that connects two or
   more endpoints together into one single multimedia conference
   [RFC5117].  An MCU includes an [RFC4353] Mixer.  [Edt. RFC4353 is
   tardy in requiring that media from the mixer be sent to EACH
   participant.  I think we have practical use cases where this is
   not the case.  But the bug (if it is one) is in 4353 and not
   herein.]



   Media: Any data that, after suitable encoding, can be conveyed
   over RTP, including audio, video or timed text.

   Media Capture: a source of Media, such as from one or more Capture
   Devices.  A Media Capture (MC) may be the source of one or more
   capture encodings.  A Media Capture may also be constructed from
   other Media streams.  A middle box can express Media Captures that
   it constructs from Media streams it receives.

   Media Consumer: an Endpoint or middle box that receives media
   streams

   Media Provider: an Endpoint or middle box that sends Media streams

   Model: a set of assumptions a telepresence system of a given
   vendor adheres to and expects the remote telepresence system(s)
   also to adhere to.

   Plane of Interest: The spatial plane containing the most relevant
   subject matter.

   Render: the process of generating a representation from a media,
   such as displayed motion video or sound emitted from loudspeakers.

   Simultaneous Transmission Set: a set of media captures that can be
   transmitted simultaneously from a Media Provider.

   Spatial Relation: The arrangement in space of two objects, in
   contrast to relation in time or other relationships.  See also
   Camera-Left and Right.




Duckworth et. al.       Expires June 24, 2013                [Page 8]


Internet-Draft       CLUE Telepresence Framework        December 2012


   Stage-Left and Right: For media captures, stage-left and stage-
   right are the opposite of camera-left and camera-right.  For the
   case of a person facing (and captured by) a camera, stage-left and
   stage-right are from the point of view of that person.

   Stream: a capture encoding sent from a media provider to a media
   consumer via RTP [RFC3550].

   Stream Characteristics: the media stream attributes commonly used
   in non-CLUE SIP/SDP environments (such as: media codec, bit rate,
   resolution, profile/level etc.) as well as CLUE specific
   attributes, such as the ID of a capture or a spatial location.

   Telepresence: an environment that gives non co-located users or
   user groups a feeling of (co-located) presence - the feeling that
   a Local user is in the same room with other Local users and the
   Remote parties.  The inclusion of Remote parties is achieved
   through multimedia communication including at least audio and
   video signals of high fidelity.

   Video Capture: Media Capture for video.  Denoted as VCn.

   Video composite: A single image that is formed from combining
   visual elements from separate sources.

4. Overview of the Framework/Model

   The CLUE framework specifies how multiple media streams are to be
   handled in a telepresence conference.

   The main goals include:

   o  Interoperability

   o  Extensibility

   o  Flexibility

   Interoperability is achieved by the media provider describing the
   relationships between media streams in constructs that are
   understood by the consumer, who can then render the media.
   Extensibility is achieved through abstractions and the generality
   of the model, making it easy to add new parameters.  Flexibility
   is achieved largely by having the consumer choose what content and



Duckworth et. al.       Expires June 24, 2013                [Page 9]


Internet-Draft       CLUE Telepresence Framework        December 2012


   format it wants to receive from what the provider is capable of
   sending.

   A transmitting endpoint or MCU describes specific aspects of the
   content of the media and the formatting of the media streams it
   can send (advertisement); and the receiving end responds to the
   provider by specifying which content and media streams it wants to
   receive (configuration).  The provider then transmits the asked
   for content in the specified streams.

   This advertisement and configuration occurs at call initiation but
   may also happen at any time throughout the conference, whenever
   there is a change in what the consumer wants or the provider can
   send.

   An endpoint or MCU typically acts as both provider and consumer at
   the same time, sending advertisements and sending configurations
   in response to receiving advertisements.  (It is possible to be
   just one or the other.)

   The data model is based around two main concepts: a capture and an
   encoding.  A media capture (MC), such as audio or video, describes
   the content a provider can send.  Media captures are described in
   terms of CLUE-defined attributes, such as spatial relationships
   and purpose of the capture.  Providers tell consumers which media
   captures they can provide, described in terms of the media capture
   attributes.

   A provider organizes its media captures that represent the same
   scene into capture scenes.  A consumer chooses which media
   captures it wants to receive according to the capture scenes sent
   by the provider.

   In addition, the provider sends the consumer a description of the
   individual encodings it can send in terms of the media attributes
   of the encodings, in particular, well-known audio and video
   parameters such as bandwidth, frame rate, macroblocks per second.

   The provider also specifies constraints on its ability to provide
   media, and the consumer must take these into account in choosing
   the content and capture encodings it wants.  Some constraints are
   due to the physical limitations of devices - for example, a camera
   may not be able to provide zoom and non-zoom views simultaneously.
   Other constraints are system based constraints, such as maximum
   bandwidth and maximum macroblocks/second.


Duckworth et. al.       Expires June 24, 2013               [Page 10]


Internet-Draft       CLUE Telepresence Framework        December 2012


   The following sections discuss these constructs and processes in
   detail, followed by use cases showing how the framework
   specification can be used.

5. Spatial Relationships

   In order for a consumer to perform a proper rendering, it is often
   necessary to provide spatial information about the streams it is
   receiving.  CLUE defines a coordinate system that allows media
   providers to describe the spatial relationships of their media
   captures to enable proper scaling and spatial rendering of their
   streams.  The coordinate system is based on a few principles:

   o  Simple systems which do not have multiple Media Captures to
      associate spatially need not use the coordinate model.

   o  Coordinates can either be in real, physical units
      (millimeters), have an unknown scale or have no physical scale.
      Systems which know their physical dimensions should always
      provide those real-world measurements.  Systems which don't
      know specific physical dimensions but still know relative
      distances should use 'unknown scale'.  'No scale' is intended
      to be used where Media Captures from different devices (with
      potentially different scales) will be forwarded alongside one
      another (e.g. in the case of a middle box).

      *  "millimeters" means the scale is in millimeters

      *  "Unknown" means the scale is not necessarily millimeters,
         but the scale is the same for every capture in the capture
         scene.

      *  "No Scale" means the scale could be different for each
         capture- an MCU provider that advertises two adjacent
         captures and picks sources (which can change quickly) from
         different endpoints might use this value; the scale could be
         different and changing for each capture.  But the areas of
         capture still represent a spatial relation between captures.

   o  The coordinate system is Cartesian X, Y, Z with the origin at a
      spot of the provider's choosing.  The provider must use the
      same coordinate system with same scale and origin for all
      coordinates within the same capture scene.




Duckworth et. al.       Expires June 24, 2013               [Page 11]


Internet-Draft       CLUE Telepresence Framework        December 2012


   The direction of increasing coordinate values is:
   X increases from camera left to camera right
   Y increases from front to back
   Z increases from low to high

6. Media Captures and Capture Scenes

   This section describes how media providers can describe the
   content of media to consumers.

6.1. Media Captures

   Media captures are the fundamental representations of streams that
   a device can transmit.  What a Media Capture actually represents
   is flexible:

   o  It can represent the immediate output of a physical source
      (e.g. camera, microphone) or 'synthetic' source (e.g. laptop
      computer, DVD player).

   o  It can represent the output of an audio mixer or video composer

   o  It can represent a concept such as 'the loudest speaker'

   o  It can represent a conceptual position such as 'the leftmost
      stream'

   To distinguish between multiple instances, video and audio
   captures are numbered such as: VC1, VC2 and AC1, AC2.  VC1 and VC2
   refer to two different video captures and AC1 and AC2 refer to two
   different audio captures.

   Each Media Capture can be associated with attributes to describe
   what it represents.

6.1.1. Media Capture Attributes

   Media Capture Attributes describe static information about the
   captures.  A provider uses the media capture attributes to
   describe the media captures to the consumer.  The consumer will
   select the captures it wants to receive.  Attributes are defined
   by a variable and its value.  The currently defined attributes and
   their values are:

   Content: {slides, speaker, sl, main, alt}


Duckworth et. al.       Expires June 24, 2013               [Page 12]


Internet-Draft       CLUE Telepresence Framework        December 2012


   A field with enumerated values which describes the role of the
   media capture and can be applied to any media type.  The
   enumerated values are defined by [RFC4796].  The values for this
   attribute are the same as the mediacnt values for the content
   attribute in [RFC4796].  This attribute can have multiple values,
   for example content={main, speaker}.

   Composed: {true, false}

   A field with a Boolean value which indicates whether or not the
   Media Capture is a mix (audio) or composition (video) of streams.

   This attribute is useful for a media consumer to avoid nesting a
   composed video capture into another composed capture or rendering.
   This attribute is not intended to describe the layout a media
   provider uses when composing video streams.

   Audio Channel Format: {mono, stereo} A field with enumerated
   values which describes the method of encoding used for audio.

   A value of 'mono' means the Audio Capture has one channel.

   A value of 'stereo' means the Audio Capture has two audio
   channels, left and right.

   This attribute applies only to Audio Captures.  A single stereo
   capture is different from two mono captures that have a left-right
   spatial relationship.  A stereo capture maps to a single RTP
   stream, while each mono audio capture maps to a separate RTP
   stream.

   Switched: {true, false}

   A field with a Boolean value which indicates whether or not the
   Media Capture represents the (dynamic) most appropriate subset of
   a 'whole'.  What is 'most appropriate' is up to the provider and
   could be the active speaker, a lecturer or a VIP.

   Point of Capture: {(X, Y, Z)}

   A field with a single Cartesian (X, Y, Z) point value which
   describes the spatial location, virtual or physical, of the
   capturing device (such as camera).




Duckworth et. al.       Expires June 24, 2013               [Page 13]


Internet-Draft       CLUE Telepresence Framework        December 2012


   When the Point of Capture attribute is specified, it must include
   X, Y and Z coordinates.  If the point of capture is not specified,
   it means the consumer should not assume anything about the spatial
   location of the capturing device.  Even if the provider specifies
   an area of capture attribute, it does not need to specify the
   point of capture.

   Point on Line of Capture: {(X,Y,Z)}

   A field with a single Cartesian (X, Y, Z) point value (virtual or
   physical) which describes a position in space of a second point on
   the axis of the capturing device; the first point being the Point
   of Capture (see above).  This point MUST lie between the Point of
   Capture and the Area of Capture.

   The Point on Line of Capture MUST be ignored if the Point of
   Capture is not present for this capture device.  When the Point on
   Line of Capture attribute is specified, it must include X, Y and Z
   coordinates.  These coordinates MUST NOT be identical to the Point
   of Capture coordinates.  If the Point on Line of Capture is not
   specified, no assumptions are made about the axis of the capturing
   device.

   Area of Capture:

   {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3,
   Y3, Z3), top right(X4, Y4, Z4)}

   A field with a set of four (X, Y, Z) points as a value which
   describe the spatial location of what is being "captured".  By
   comparing the Area of Capture for different Media Captures within
   the same capture scene a consumer can determine the spatial
   relationships between them and render them correctly.

   The four points should be co-planar.  The four points form a
   quadrilateral, not necessarily a rectangle.

   The quadrilateral described by the four (X, Y, Z) points defines
   the plane of interest for the particular media capture.

   If the area of capture attribute is specified, it must include X,
   Y and Z coordinates for all four points.  If the area of capture
   is not specified, it means the media capture is not spatially
   related to any other media capture (but this can change in a
   subsequent provider advertisement).


Duckworth et. al.       Expires June 24, 2013               [Page 14]


Internet-Draft       CLUE Telepresence Framework        December 2012


   For a switched capture that switches between different sections
   within a larger area, the area of capture should use coordinates
   for the larger potential area.

   EncodingGroup: {<encodeGroupID value>}

   A field with a value equal to the encodeGroupID of the encoding
   group associated with the media capture.

   Max Capture Encodings: {unsigned integer}

   An optional attribute indicating the maximum number of capture
   encodings that can be simultaneously active for the media capture.
   If absent, this parameter defaults to 1.  The minimum value for
   this attribute is 1.  The number of simultaneous capture encodings
   is also limited by the restrictions of the encoding group for the
   media capture.

6.2. Capture Scene

   In order for a provider's individual media captures to be used
   effectively by a consumer, the provider organizes the media
   captures into capture scenes, with the structure and contents of
   these capture scenes being sent from the provider to the consumer.

   A capture scene is a structure representing the scene that is
   captured by a collection of capture devices.  A capture scene
   includes one or more capture scene entries, with each entry
   including one or more media captures.  A capture scene represents,
   for example, the video image of a group of people seated next to
   each other, along with the sound of their voices, which could be
   represented by some number of VCs and ACs in the capture scene
   entries.  A middle box may also express capture scenes that it
   constructs from media streams it receives.

   A provider may advertise multiple capture scenes or just a single
   capture scene.  A media provider might typically use one capture
   scene for main participant media and another capture scene for a
   computer generated presentation.  A capture scene may include more
   than one type of media.  For example, a capture scene can include
   several capture scene entries for video captures, and several
   capture scene entries for audio captures.

   A provider can express spatial relationships between media
   captures that are included in the same capture scene.  But there


Duckworth et. al.       Expires June 24, 2013               [Page 15]


Internet-Draft       CLUE Telepresence Framework        December 2012


   is no spatial relationship between media captures that are in
   different capture scenes.

   A media provider arranges media captures in a capture scene to
   help the media consumer choose which captures it wants.  The
   capture scene entries in a capture scene are different
   alternatives the provider is suggesting for representing the
   capture scene.  The media consumer can choose to receive all media
   captures from one capture scene entry for each media type (e.g.
   audio and video), or it can pick and choose media captures
   regardless of how the provider arranges them in capture scene
   entries.  Different capture scene entries of the same media type
   are not necessarily mutually exclusive alternatives.

   Media captures within the same capture scene entry must be of the
   same media type - it is not possible to mix audio and video
   captures in the same capture scene entry, for instance.  The
   provider must be capable of encoding and sending all media
   captures in a single entry simultaneously.  A consumer may decide
   to receive all the media captures in a single capture scene entry,
   but a consumer could also decide to receive just a subset of those
   captures.  A consumer can also decide to receive media captures
   from different capture scene entries.

   When a provider advertises a capture scene with multiple entries,
   it is essentially signaling that there are multiple
   representations of the same scene available.  In some cases, these
   multiple representations would typically be used simultaneously
   (for instance a "video entry" and an "audio entry").  In some
   cases the entries would conceptually be alternatives (for instance
   an entry consisting of 3 video captures versus an entry consisting
   of just a single video capture).  In this latter example, the
   provider would in the simple case end up providing to the consumer
   the entry containing the number of video captures that most
   closely matched the media consumer's number of display devices.

   The following is an example of 4 potential capture scene entries
   for an endpoint-style media provider:

   1.  (VC0, VC1, VC2) - left, center and right camera video captures

   2.  (VC3) - video capture associated with loudest room segment

   3.  (VC4) - video capture zoomed out view of all people in the
   room


Duckworth et. al.       Expires June 24, 2013               [Page 16]


Internet-Draft       CLUE Telepresence Framework        December 2012


   4.  (AC0) - main audio

   The first entry in this capture scene example is a list of video
   captures with a spatial relationship to each other.  Determination
   of the order of these captures (VC0, VC1 and VC2) for rendering
   purposes is accomplished through use of their Area of Capture
   attributes.  The second entry (VC3) and the third entry (VC4) are
   additional alternatives of how to capture the same room in
   different ways.  The inclusion of the audio capture in the same
   capture scene indicates that AC0 is associated with those video
   captures, meaning it comes from the same scene.  The audio should
   be rendered in conjunction with any rendered video captures from
   the same capture scene.

6.2.1. Capture scene attributes

   Attributes can be applied to capture scenes as well as to
   individual media captures.  Attributes specified at this level
   apply to all constituent media captures.

   Description attribute - list of {<description text>, <language
   tag>}

   The optional description attribute is a list of human readable
   text strings which describe the capture scene.  If there is more
   than one string in the list, then each string in the list should
   contain the same description, but in a different language.  A
   provider that advertises multiple capture scenes can provide
   descriptions for each of them.  This attribute can contain text in
   any number of languages.

   The language tag identifies the language of the corresponding
   description text.  The possible values for a language tag are the
   values of the 'Subtag' column for the "Type: language" entries in
   the "Language Subtag Registry" at [IANA-Lan] originally defined in
   [RFC5646].  A particular language tag value MUST NOT be used more
   than once in the description attribute list.

   Area of Scene attribute

   The area of scene attribute for a capture scene has the same
   format as the area of capture attribute for a media capture.  The
   area of scene is for the entire scene, which is captured by the
   one or more media captures in the capture scene entries.  If the
   provider does not specify the area of scene, but does specify


Duckworth et. al.       Expires June 24, 2013               [Page 17]


Internet-Draft       CLUE Telepresence Framework        December 2012


   areas of capture, then the consumer may assume the area of scene
   is greater than or equal to the outer extents of the individual
   areas of capture.

   Scale attribute

   An optional attribute indicating if the numbers used for area of
   scene, area of capture and point of capture are in terms of
   millimeters, unknown scale factor, or not any scale, as described
   in Section 5.  If any media captures have an area of capture
   attribute or point of capture attribute, then this scale attribute
   must also be defined.  The possible values for this attribute are:

      "millimeters"

      "unknown"

      "no scale"

6.2.2. Capture scene entry attributes

   Attributes can be applied to capture scene entries.  Attributes
   specified at this level apply to the capture scene entry as a
   whole.

   Scene-switch-policy: {site-switch, segment-switch}

   A media provider uses this scene-switch-policy attribute to
   indicate its support for different switching policies.  In the
   provider's advertisement, this attribute can have multiple values,
   which means the provider supports each of the indicated policies.
   The consumer, when it requests media captures from this capture
   scene entry, should also include this attribute but with only the
   single value (from among the values indicated by the provider)
   indicating the consumer's choice for which policy it wants the
   provider to use.  If the provider does not support any of these
   policies, it should omit this attribute.

   The "site-switch" policy means all captures are switched at the
   same time to keep captures from the same endpoint site together.
   Let's say the speaker is at site A and everyone else is at a
   "remote" site.

   When the room at site A shown, all the camera images from site A
   are forwarded to the remote sites.  Therefore at each receiving


Duckworth et. al.       Expires June 24, 2013               [Page 18]


Internet-Draft       CLUE Telepresence Framework        December 2012


   remote site, all the screens display camera images from site A.
   This can be used to preserve full size image display, and also
   provide full visual context of the displayed far end, site A. In
   site switching, there is a fixed relation between the cameras in
   each room and the displays in remote rooms.  The room or
   participants being shown is switched from time to time based on
   who is speaking or by manual control.

   The "segment-switch" policy means different captures can switch at
   different times, and can be coming from different endpoints.
   Still using site A as where the speaker is, and "remote" to refer
   to all the other sites, in segment switching, rather than sending
   all the images from site A, only the image containing the speaker
   at site A is shown.  The camera images of the current speaker and
   previous speakers (if any) are forwarded to the other sites in the
   conference.

   Therefore the screens in each site are usually displaying images
   from different remote sites - the current speaker at site A and
   the previous ones.  This strategy can be used to preserve full
   size image display, and also capture the non-verbal communication
   between the speakers.  In segment switching, the display depends
   on the activity in the remote rooms - generally, but not
   necessarily based on audio / speech detection.

6.3. Simultaneous Transmission Set Constraints

   The provider may have constraints or limitations on its ability to
   send media captures.  One type is caused by the physical
   limitations of capture mechanisms; these constraints are
   represented by a simultaneous transmission set.  The second type
   of limitation reflects the encoding resources available -
   bandwidth and macroblocks/second.  This type of constraint is
   captured by encoding groups, discussed below.

   An endpoint or MCU can send multiple captures simultaneously,
   however sometimes there are constraints that limit which captures
   can be sent simultaneously with other captures.  A device may not
   be able to be used in different ways at the same time.  Provider
   advertisements are made so that the consumer will choose one of
   several possible mutually exclusive usages of the device.  This
   type of constraint is expressed in a Simultaneous Transmission
   Set, which lists all the media captures that can be sent at the
   same time.  This is easier to show in an example.



Duckworth et. al.       Expires June 24, 2013               [Page 19]


Internet-Draft       CLUE Telepresence Framework        December 2012


   Consider the example of a room system where there are 3 cameras
   each of which can send a separate capture covering 2 persons each-
   VC0, VC1, VC2.  The middle camera can also zoom out and show all 6
   persons, VC3.  But the middle camera cannot be used in both modes
   at the same time - it has to either show the space where 2
   participants sit or the whole 6 seats, but not both at the same
   time.

   Simultaneous transmission sets are expressed as sets of the MCs
   that could physically be transmitted at the same time, (though it
   may not make sense to do so).  In this example the two
   simultaneous sets are shown in Table 1.  The consumer must make
   sure that it chooses one and not more of the mutually exclusive
   sets.  A consumer may choose any subset of the media captures in a
   simultaneous set, it does not have to choose all the captures in a
   simultaneous set if it does not want to receive all of them.

                           +-------------------+
                           | Simultaneous Sets |
                           +-------------------+
                           | {VC0, VC1, VC2}   |
                           | {VC0, VC3, VC2}   |
                           +-------------------+

                Table 1: Two Simultaneous Transmission Sets

   A media provider includes the simultaneous sets in its provider
   advertisement.  These simultaneous set constraints apply across
   all the captures scenes in the advertisement.  The simultaneous
   transmission sets MUST allow all the media captures in a
   particular capture scene entry to be used simultaneously.

7. Encodings

   We have considered how providers can describe the content of media
   to consumers.  We will now consider how the providers communicate
   information about their abilities to send streams.  We introduce
   two constructs - individual encodings and encoding groups.
   Consumers will then map the media captures they want onto the
   encodings with encoding parameters they want.  This process is
   then described.






Duckworth et. al.       Expires June 24, 2013               [Page 20]


Internet-Draft       CLUE Telepresence Framework        December 2012


7.1. Individual Encodings

   An individual encoding represents a way to encode a media capture
   to become a capture encoding, to be sent as an encoded media
   stream from the media provider to the media consumer.  An
   individual encoding has a set of parameters characterizing how the
   media is encoded.

   Different media types have different parameters, and different
   encoding algorithms may have different parameters.  An individual
   encoding can be assigned to only one capture encoding at a time.

   The parameters of an individual encoding represent the maximum
   values for certain aspects of the encoding.  A particular
   instantiation into a capture encoding might use lower values than
   these maximums.

   The following tables show the variables for audio and video
   encoding.

   +--------------+--------------------------------------------------
   --+
   | Name         | Description
   |
   +--------------+--------------------------------------------------
   --+
   | encodeID     | A unique identifier for the individual encoding
   |
   | maxBandwidth | Maximum number of bits per second
   |
   | maxH264Mbps  | Maximum number of macroblocks per second: ((width
   |
   |              | + 15) / 16) * ((height + 15) / 16) *
   |
   |              | framesPerSecond
   |
   | maxWidth     | Video resolution's maximum supported width,
   |
   |              | expressed in pixels
   |
   | maxHeight    | Video resolution's maximum supported height,
   |
   |              | expressed in pixels
   |
   | maxFrameRate | Maximum supported frame rate


Duckworth et. al.       Expires June 24, 2013               [Page 21]


Internet-Draft       CLUE Telepresence Framework        December 2012


   |
   +--------------+--------------------------------------------------
   --+

               Table 2: Individual Video Encoding Parameters

           +--------------+-----------------------------------+
           | Name         | Description                       |
           +--------------+-----------------------------------+
           | maxBandwidth | Maximum number of bits per second |
           +--------------+-----------------------------------+

               Table 3: Individual Audio Encoding Parameters

7.2. Encoding Group

   An encoding group includes a set of one or more individual
   encodings, plus some parameters that apply to the group as a
   whole.  By grouping multiple individual encodings together, an
   encoding group describes additional constraints on bandwidth and
   other parameters for the group.  Table 4 shows the parameters and
   individual encoding sets that are part of an encoding group.

























Duckworth et. al.       Expires June 24, 2013               [Page 22]


Internet-Draft       CLUE Telepresence Framework        December 2012


   +-------------------+---------------------------------------------
   --+
   | Name              | Description
   |
   +-------------------+---------------------------------------------
   --+
   | encodeGroupID     | A unique identifier for the encoding group
   |
   | maxGroupBandwidth | Maximum number of bits per second relating
   to |
   |                   | all encodings combined
   |
   | maxGroupH264Mbps  | Maximum number of macroblocks per second
   |
   |                   | relating to all video encodings combined
   |
   | videoEncodings[]  | Set of potential encodings (list of
   |
   |                   | encodeIDs)
   |
   | audioEncodings[]  | Set of potential encodings (list of
   |
   |                   | encodeIDs)
   |
   +-------------------+---------------------------------------------
   --+

                          Table 4: Encoding Group

   When the individual encodings in a group are instantiated into
   capture encodings, each capture encoding has a bandwidth that must
   be less than or equal to the maxBandwidth for the particular
   individual encoding.  The maxGroupBandwidth parameter gives the
   additional restriction that the sum of all the individual capture
   encoding bandwidths must be less than or equal to the
   maxGroupBandwidth value.

   Likewise, the sum of the macroblocks per second of each
   instantiated encoding in the group must not exceed the
   maxGroupH264Mbps value.

   The following diagram illustrates the structure of a media
   provider's Encoding Groups and their contents.




Duckworth et. al.       Expires June 24, 2013               [Page 23]


Internet-Draft       CLUE Telepresence Framework        December 2012


   ,-------------------------------------------------.
   |             Media Provider                      |
   |                                                 |
   |  ,--------------------------------------.       |
   |  | ,--------------------------------------.     |
   |  | | ,--------------------------------------.   |
   |  | | |          Encoding Group              |   |
   |  | | | ,-----------.                        |   |
   |  | | | |           | ,---------.            |   |
   |  | | | |           | |         | ,---------.|   |
   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
   |  `.| | |           | |         | `---------'|   |
   |    `.| `-----------' `---------'            |   |
   |      `--------------------------------------'   |
   `-------------------------------------------------'

                    Figure 1: Encoding Group Structure

   A media provider advertises one or more encoding groups.  Each
   encoding group includes one or more individual encodings.  Each
   individual encoding can represent a different way of encoding
   media.  For example one individual encoding may be 1080p60 video,
   another could be 720p30, with a third being CIF.


   While a typical 3 codec/display system might have one encoding
   group per "codec box", there are many possibilities for the number
   of encoding groups a provider may be able to offer and for the
   encoding values in each encoding group.

   There is no requirement for all encodings within an encoding group
   to be instantiated at once.

8. Associating Media Captures with Encoding Groups

   Every media capture is associated with an encoding group, which is
   used to instantiate that media capture into one or more capture
   encodings.  Each media capture has an encoding group attribute.
   The value of this attribute is the encodeGroupID for the encoding
   group with which it is associated.  More than one media capture
   may use the same encoding group.

   The maximum number of streams that can result from a particular
   encoding group constraint is equal to the number of individual
   encodings in the group.  The actual number of capture encodings


Duckworth et. al.       Expires June 24, 2013               [Page 24]


Internet-Draft       CLUE Telepresence Framework        December 2012


   used at any time may be less than this maximum.  Any of the media
   captures that use a particular encoding group can be encoded
   according to any of the individual encodings in the group.  If
   there are multiple individual encodings in the group, then the
   media consumer can configure the media provider to encode a single
   media capture into multiple different capture encodings at the
   same time, subject to the Max Capture Encodings constraint, with
   each capture encoding following the constraints of a different
   individual encoding.

   The Encoding Groups MUST allow all the media captures in a
   particular capture scene entry to be used simultaneously.

9. Consumer's Choice of Streams to Receive from the Provider

   After receiving the provider's advertised media captures and
   associated constraints, the consumer must choose which media
   captures it wishes to receive, and which individual encodings from
   the provider it wants to use to encode the captures.  Each media
   capture has an encoding group ID attribute which specifies which
   individual encodings are available to be used for that media
   capture.

   For each media capture the consumer wants to receive, it
   configures one or more of the encodings in that capture's encoding
   group.  The consumer does this by telling the provider the
   resolution, frame rate, bandwidth, etc. when asking for capture
   encodings for its chosen captures.  Upon receipt of this
   configuration command from the consumer, the provider generates a
   stream for each such configured capture encoding and sends those
   streams to the consumer.

   The consumer must have received at least one capture advertisement
   from the provider to be able to configure the provider's
   generation of media streams.

   The consumer is able to change its configuration of the provider's
   encodings any number of times during the call, either in response
   to a new capture advertisement from the provider or autonomously.
   The consumer need not send a new configure message to the provider
   when it receives a new capture advertisement from the provider
   unless the contents of the new capture advertisement cause the
   consumer's current configure message to become invalid.




Duckworth et. al.       Expires June 24, 2013               [Page 25]


Internet-Draft       CLUE Telepresence Framework        December 2012


   When choosing which streams to receive from the provider, and the
   encoding characteristics of those streams, the consumer needs to
   take several things into account: its local preference,
   simultaneity restrictions, and encoding limits.

9.1. Local preference

   A variety of local factors will influence the consumer's choice of
   streams to be received from the provider:

   o  if the consumer is an endpoint, it is likely that it would
      choose, where possible, to receive video and audio captures
      that match the number of display devices and audio system it
      has

   o  if the consumer is a middle box such as an MCU, it may choose
      to receive loudest speaker streams (in order to perform its own
      media composition) and avoid pre-composed video captures

   o  user choice (for instance, selection of a new layout) may
      result in a different set of media captures, or different
      encoding characteristics, being required by the consumer

9.2. Physical simultaneity restrictions

   There may be physical simultaneity constraints imposed by the
   provider that affect the provider's ability to simultaneously send
   all of the captures the consumer would wish to receive.  For
   instance, a middle box such as an MCU, when connected to a multi-
   camera room system, might prefer to receive both individual camera
   streams of the people present in the room and an overall view of
   the room from a single camera.  Some endpoint systems might be
   able to provide both of these sets of streams simultaneously,
   whereas others may not (if the overall room view were produced by
   changing the zoom level on the center camera, for instance).

9.3. Encoding and encoding group limits

   Each of the provider's encoding groups has limits on bandwidth and
   macroblocks per second, and the constituent potential encodings
   have limits on the bandwidth, macroblocks per second, video frame
   rate, and resolution that can be provided.  When choosing the
   media captures to be received from a provider, a consumer device
   must ensure that the encoding characteristics requested for each
   individual media capture fits within the capability of the


Duckworth et. al.       Expires June 24, 2013               [Page 26]


Internet-Draft       CLUE Telepresence Framework        December 2012


   encoding it is being configured to use, as well as ensuring that
   the combined encoding characteristics for media captures fit
   within the capabilities of their associated encoding groups.  In
   some cases, this could cause an otherwise "preferred" choice of
   capture encodings to be passed over in favour of different capture
   encodings - for instance, if a set of 3 media captures could only
   be provided at a low resolution then a 3 screen device could
   switch to favoring a single, higher quality, capture encoding.

9.4. Message Flow

   The following diagram shows the basic flow of messages between a
   media provider and a media consumer.  The usage of the "capture
   advertisement" and "configure encodings" message is described
   above. The consumer also sends its own capability message to the
   provider which may contain information about its own capabilities
   or restrictions.

   Diagram for Message Flow

            Media Consumer                         Media Provider
            --------------                         ------------
                  |                                     |
                  |----- Consumer Capability ---------->|
                  |                                     |
                  |                                     |
                  |<---- Capture advertisement ---------|
                  |                                     |
                  |                                     |
                  |------ Configure encodings --------->|
                  |                                     |

   In order for a maximally-capable provider to be able to advertise
   a manageable number of video captures to a consumer, there is a
   potential use for the consumer, at the start of CLUE, to be able
   to inform the provider of its capabilities.  One example here
   would be the video capture attribute set - a consumer could tell
   the provider the complete set of video capture attributes it is
   able to understand and so the provider would be able to reduce the
   capture scene it advertises to be tailored to the consumer.

   TBD - the content of the consumer capability message needs to be
   better defined.  The authors believe there is a need for this
   message, but have not worked out the details yet.



Duckworth et. al.       Expires June 24, 2013               [Page 27]


Internet-Draft       CLUE Telepresence Framework        December 2012


10. Extensibility

   One of the most important characteristics of the Framework is its
   extensibility.  Telepresence is a relatively new industry and
   while we can foresee certain directions, we also do not know
   everything about how it will develop.  The standard for
   interoperability and handling multiple streams must be future-
   proof. The framework itself is inherently extensible through
   expanding the data model types.  For example:

   o  Adding more types of media, such as telemetry, can done by
      defining additional types of captures in addition to audio and
      video.

   o  Adding new functionalities , such as 3-D, say, will require
      additional attributes describing the captures.

   o  Adding a new codecs, such as H.265, can be accomplished by
      defining new encoding variables.

   The infrastructure is designed to be extended rather than
   requiring new infrastructure elements.  Extension comes through
   adding to defined types.

   Assuming the implementation is in something like XML, adding data
   elements and attributes makes extensibility easy.

11. Examples - Using the Framework

   This section shows some examples in more detail how to use the
   framework to represent a typical case for telepresence rooms.
   First an endpoint is illustrated, then an MCU case is shown.

11.1. Three screen endpoint media provider

   Consider an endpoint with the following description:

   3 cameras, 3 displays, a 6 person table

   o  Each video device can provide one capture for each 1/3 section
      of the table

   o  A single capture representing the active speaker can be
      provided



Duckworth et. al.       Expires June 24, 2013               [Page 28]


Internet-Draft       CLUE Telepresence Framework        December 2012


   o  A single capture representing the active speaker with the other
      2 captures shown picture in picture within the stream can be
      provided

   o  A capture showing a zoomed out view of all 6 seats in the room
      can be provided

   The audio and video captures for this endpoint can be described as
   follows.

   Video Captures:

   o  VC0- (the camera-left camera stream), encoding group=EG0,
      content=main, switched=false

   o  VC1- (the center camera stream), encoding group=EG1,
      content=main, switched=false

   o  VC2- (the camera-right camera stream), encoding group=EG2,
      content=main, switched=false

   o  VC3- (the loudest panel stream), encoding group=EG1,
      content=main, switched=true

   o  VC4- (the loudest panel stream with PiPs), encoding group=EG1,
      content=main, composed=true, switched=true

   o  VC5- (the zoomed out view of all people in the room), encoding
      group=EG1, content=main, composed=false, switched=false

   o  VC6- (presentation stream), encoding group=EG1, content=slides,
      switched=false

   The following diagram is a top view of the room with 3 cameras, 3
   displays, and 6 seats.  Each camera is capturing 2 people.  The
   six seats are not all in a straight line.











Duckworth et. al.       Expires June 24, 2013               [Page 29]


Internet-Draft       CLUE Telepresence Framework        December 2012


      ,-. D
     (   )`--.__        +---+
      `-' /     `--.__  |   |
    ,-.  |            `-.._ |_-+Camera 2 (VC2)
   (   ).'        ___..-+-''`+-+
    `-' |_...---''      |   |
    ,-.c+-..__          +---+
   (   )|     ``--..__  |   |
    `-' |             ``+-..|_-+Camera 1 (VC1)
    ,-. |            __..--'|+-+
   (   )|     __..--'   |   |
    `-'b|..--'          +---+
    ,-. |``---..___     |   |
   (   )\          ```--..._|_-+Camera 0 (VC0)
    `-'  \             _..-''`-+
     ,-. \      __.--'' |   |
    (   ) |..-''        +---+
     `-' a


   The two points labeled b and c are intended to be at the midpoint
   between the seating positions, and where the fields of view of the
   cameras intersect.

   The plane of interest for VC0 is a vertical plane that intersects
   points 'a' and 'b'.

   The plane of interest for VC1 intersects points 'b' and 'c'. The
   plane of interest for VC2 intersects points 'c' and 'd'.

   This example uses an area scale of millimeters.

   Areas of capture:

       bottom left    bottom right  top left         top right
   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
   VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   VC6 none

   Points of capture:
   VC0 (-1678,0,800)


Duckworth et. al.       Expires June 24, 2013               [Page 30]


Internet-Draft       CLUE Telepresence Framework        December 2012


   VC1 (0,0,800)
   VC2 (1678,0,800)
   VC3 none
   VC4 none
   VC5 (0,0,800)
   VC6 none

   In this example, the right edge of the VC0 area lines up with the
   left edge of the VC1 area.  It doesn't have to be this way.  There
   could be a gap or an overlap.  One additional thing to note for
   this example is the distance from a to b is equal to the distance
   from b to c and the distance from c to d.  All these distances are
   1346 mm. This is the planar width of each area of capture for VC0,
   VC1, and VC2.

   Note the text in parentheses (e.g. "the camera-left camera
   stream") is not explicitly part of the model, it is just
   explanatory text for this example, and is not included in the
   model with the media captures and attributes.  Also, the
   "composed" boolean attribute doesn't say anything about how a
   capture is composed, so the media consumer can't tell based on
   this attribute that VC4 is composed of a "loudest panel with
   PiPs".

   Audio Captures:

   o  AC0 (camera-left), encoding group=EG3, content=main, channel
      format=mono

   o  AC1 (camera-right), encoding group=EG3, content=main, channel
      format=mono

   o  AC2 (center) encoding group=EG3, content=main, channel
      format=mono

   o  AC3 being a simple pre-mixed audio stream from the room (mono),
      encoding group=EG3, content=main, channel format=mono

   o  AC4 audio stream associated with the presentation video (mono)
      encoding group=EG3, content=slides, channel format=mono

   Areas of capture:

       bottom left    bottom right  top left         top right



Duckworth et. al.       Expires June 24, 2013               [Page 31]


Internet-Draft       CLUE Telepresence Framework        December 2012


   AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
   AC1 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
   AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
   AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
   AC4 none

   The physical simultaneity information is:

      Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6}

      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6}

   This constraint indicates it is not possible to use all the VCs at
   the same time.  VC5 can not be used at the same time as VC1 or VC3
   or VC4.  Also, using every member in the set simultaneously may
   not make sense - for example VC3(loudest) and VC4 (loudest with
   PIP).  (In addition, there are encoding constraints that make
   choosing all of the VCs in a set impossible.  VC1, VC3, VC4, VC5,
   VC6 all use EG1 and EG1 has only 3 ENCs.  This constraint shows up
   in the encoding groups, not in the simultaneous transmission
   sets.)

   In this example there are no restrictions on which audio captures
   can be sent simultaneously.

   Encoding Groups:

   This example has three encoding groups associated with the video
   captures.  Each group can have 3 encodings, but with each
   potential encoding having a progressively lower specification.  In
   this example, 1080p60 transmission is possible (as ENC0 has a
   maxMbps value compatible with that) as long as it is the only
   active encoding in the group(as maxMbps for the entire encoding
   group is also 489600).  Significantly, as up to 3 encodings are
   available per group, it is possible to transmit some video
   captures simultaneously that are not in the same entry in the
   capture scene.  For example VC1 and VC3 at the same time.

   It is also possible to transmit multiple capture encodings of a
   single video capture.  For example VC0 can be encoded using ENC0
   and ENC1 at the same time, as long as the encoding parameters
   satisfy the constraints of ENC0, ENC1, and EG0, such as one at
   1080p30 and one at 720p30.




Duckworth et. al.       Expires June 24, 2013               [Page 32]


Internet-Draft       CLUE Telepresence Framework        December 2012


   encodeGroupID=EG0, maxGroupH264Mbps=489600,
   maxGroupBandwidth=6000000
       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                      maxH264Mbps=489600, maxBandwidth=4000000
       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
                      maxH264Mbps=108000, maxBandwidth=4000000
       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
                      maxH264Mbps=61200, maxBandwidth=4000000
   encodeGroupID=EG1 maxGroupH264Mbps=489600
   maxGroupBandwidth=6000000
       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                      maxH264Mbps=489600, maxBandwidth=4000000
       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
                      maxH264Mbps=108000, maxBandwidth=4000000
       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
                      maxH264Mbps=61200, maxBandwidth=4000000
   encodeGroupID=EG2 maxGroupH264Mbps=489600
   maxGroupBandwidth=6000000
       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
                      maxH264Mbps=489600, maxBandwidth=4000000
       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
                      maxH264Mbps=108000, maxBandwidth=4000000
       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
                      maxH264Mbps=61200, maxBandwidth=4000000

                Figure 2: Example Encoding Groups for Video

   For audio, there are five potential encodings available, so all
   five audio captures can be encoded at the same time.

   encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000
       encodeID=ENC9, maxBandwidth=64000
       encodeID=ENC10, maxBandwidth=64000
       encodeID=ENC11, maxBandwidth=64000
       encodeID=ENC12, maxBandwidth=64000
       encodeID=ENC13, maxBandwidth=64000

                Figure 3: Example Encoding Group for Audio

   Capture Scenes:

   The following table represents the capture scenes for this
   provider. Recall that a capture scene is composed of alternative
   capture scene entries covering the same scene.  Capture Scene #1



Duckworth et. al.       Expires June 24, 2013               [Page 33]


Internet-Draft       CLUE Telepresence Framework        December 2012


   is for the main people captures, and Capture Scene #2 is for
   presentation.

   Each row in the table is a separate entry in the capture scene

                           +------------------+
                           | Capture Scene #1 |
                           +------------------+
                           | VC0, VC1, VC2    |
                           | VC3              |
                           | VC4              |
                           | VC5              |
                           | AC0, AC1, AC2    |
                           | AC3              |
                           +------------------+

                           +------------------+
                           | Capture Scene #2 |
                           +------------------+
                           | VC6              |
                           | AC4              |
                           +------------------+

   Different capture scenes are unique to each other, non-
   overlapping. A consumer can choose an entry from each capture
   scene.  In this case the three captures VC0, VC1, and VC2 are one
   way of representing the video from the endpoint.  These three
   captures should appear adjacent next to each other.
   Alternatively, another way of representing the Capture Scene is
   with the capture VC3, which automatically shows the person who is
   talking.  Similarly for the VC4 and VC5 alternatives.

   As in the video case, the different entries of audio in Capture
   Scene #1 represent the "same thing", in that one way to receive
   the audio is with the 3 audio captures (AC0, AC1, AC2), and
   another way is with the mixed AC3.  The Media Consumer can choose
   an audio capture entry it is capable of receiving.

   The spatial ordering is understood by the media capture attributes
   area and point of capture.

   A Media Consumer would likely want to choose a capture scene entry
   to receive based in part on how many streams it can simultaneously
   receive.  A consumer that can receive three people streams would
   probably prefer to receive the first entry of Capture Scene #1


Duckworth et. al.       Expires June 24, 2013               [Page 34]


Internet-Draft       CLUE Telepresence Framework        December 2012


   (VC0, VC1, VC2) and not receive the other entries.  A consumer
   that can receive only one people stream would probably choose one
   of the other entries.

   If the consumer can receive a presentation stream too, it would
   also choose to receive the only entry from Capture Scene #2 (VC6).

11.2. Encoding Group Example

   This is an example of an encoding group to illustrate how it can
   express dependencies between encodings.

   encodeGroupID=EG0, maxGroupH264Mbps=489600,
   maxGroupBandwidth=6000000
       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
   maxFrameRate=60,
                         maxH264Mbps=244800, maxBandwidth=4000000
       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
   maxFrameRate=60,
                         maxH264Mbps=244800, maxBandwidth=4000000
       encodeID=AUDENC0, maxBandwidth=96000
       encodeID=AUDENC1, maxBandwidth=96000
       encodeID=AUDENC2, maxBandwidth=96000

   Here, the encoding group is EG0.  It can transmit up to two
   1080p30 capture encodings (Mbps for 1080p = 244800), but it is
   capable of transmitting a maxFrameRate of 60 frames per second
   (fps).  To achieve the maximum resolution (1920 x 1088) the frame
   rate is limited to 30 fps.  However 60 fps can be achieved at a
   lower resolution if required by the consumer.  Although the
   encoding group is capable of transmitting up to 6Mbit/s, no
   individual video encoding can exceed 4Mbit/s.

   This encoding group also allows up to 3 audio encodings, AUDENC<0-
   2>. It is not required that audio and video encodings reside
   within the same encoding group, but if so then the group's overall
   maxBandwidth value is a limit on the sum of all audio and video
   encodings configured by the consumer.  A system that does not wish
   or need to combine bandwidth limitations in this way should
   instead use separate encoding groups for audio and video in order
   for the bandwidth limitations on audio and video to not interact.

   Audio and video can be expressed in separate encoding groups, as
   in this illustration.



Duckworth et. al.       Expires June 24, 2013               [Page 35]


Internet-Draft       CLUE Telepresence Framework        December 2012


   encodeGroupID=EG0, maxGroupH264Mbps=489600,
   maxGroupBandwidth=6000000
       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
   maxFrameRate=60,
                         maxH264Mbps=244800, maxBandwidth=4000000
       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
   maxFrameRate=60,
                         maxH264Mbps=244800, maxBandwidth=4000000
   encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000
       encodeID=AUDENC0, maxBandwidth=96000
       encodeID=AUDENC1, maxBandwidth=96000
       encodeID=AUDENC2, maxBandwidth=96000

11.3. The MCU Case

   This section shows how an MCU might express its Capture Scenes,
   intending to offer different choices for consumers that can handle
   different numbers of streams.  A single audio capture stream is
   provided for all single and multi-screen configurations that can
   be associated (e.g. lip-synced) with any combination of video
   captures at the consumer.

   +--------------------+--------------------------------------------
   -+
   | Capture Scene #1   | note
   |
   +--------------------+--------------------------------------------
   -+
   | VC0                | video capture for single screen consumer
   |
   | VC1, VC2           | video capture for 2 screen consumer
   |
   | VC3, VC4, VC5      | video capture for 3 screen consumer
   |
   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer
   |
   | AC0                | audio capture representing all participants
   |
   +--------------------+--------------------------------------------
   -+

   If / when a presentation stream becomes active within the
   conference the MCU might re-advertise the available media as:




Duckworth et. al.       Expires June 24, 2013               [Page 36]


Internet-Draft       CLUE Telepresence Framework        December 2012


        +------------------+--------------------------------------+
        | Capture Scene #2 | note                                 |
        +------------------+--------------------------------------+
        | VC10             | video capture for presentation       |
        | AC1              | presentation audio to accompany VC10 |
        +------------------+--------------------------------------+

11.4. Media Consumer Behavior

   This section gives an example of how a media consumer might behave
   when deciding how to request streams from the three screen
   endpoint described in the previous section.

   The receive side of a call needs to balance its requirements,
   based on number of screens and speakers, its decoding capabilities
   and available bandwidth, and the provider's capabilities in order
   to optimally configure the provider's streams.  Typically it would
   want to receive and decode media from each capture scene
   advertised by th provider.

   A sane, basic, algorithm might be for the consumer to go through
   eac capture scene in turn and find the collection of video
   captures that best matches the number of screens it has (this
   might include consideration of screens dedicated to presentation
   video display rather than "people" video) and then decide between
   alternative entries in the video capture scenes based either on
   hard-coded preferences or user choice.  Once this choice has been
   made, the consumer would then decide how to configure the
   provider's encoding groups in order to make best use of the
   available network bandwidth and its own decoding capabilities.

11.4.1. One screen consumer

   VC3, VC4 and VC5 are all different entries by themselves, not
   grouped together in a single entry, so the receiving device should
   choose between one of those.  The choice would come down to
   whether to see the greatest number of participants simultaneously
   at roughly equal precedence (VC5), a switched view of just the
   loudest region (VC3) or a switched view with PiPs (VC4).  An
   endpoint device with a small amount of knowledge of these
   differences could offer a dynamic choice of these options, in-
   call, to the user.





Duckworth et. al.       Expires June 24, 2013               [Page 37]


Internet-Draft       CLUE Telepresence Framework        December 2012


11.4.2. Two screen consumer configuring the example

   Mixing systems with an even number of screens, "2n", and those
   with "2n+1" cameras (and vice versa) is always likely to be the
   problematic case.  In this instance, the behavior is likely to be
   determined by whether a "2 screen" system is really a "2 decoder"
   system, i.e., whether only one received stream can be displayed
   per screen or whether more than 2 streams can be received and
   spread across the available screen area.  To enumerate 3 possible
   behaviors here for the 2 screen system when it learns that the far
   end is "ideally" expressed via 3 capture streams:

   1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as
      per the 1 screen consumer case above) and either leave one
      screen blank or use it for presentation if / when a
      presentation becomes active.

   2. Receive 3 streams (VC0, VC1 and VC2) and display across 2
      screens (either with each capture being scaled to 2/3 of a
      screen and the centre capture being split across 2 screens) or,
      as would be necessary if there were large bezels on the
      screens, with each stream being scaled to 1/2 the screen width
      and height and there being a 4th "blank" panel.  This 4th panel
      could potentially be used for any presentation that became
      active during the call.

   3. Receive 3 streams, decode all 3, and use control information
      indicating which was the most active to switch between showing
      the left and centre streams (one per screen) and the centre and
      right streams.

   For an endpoint capable of all 3 methods of working described
   above, again it might be appropriate to offer the user the choice
   of display mode.

11.4.3. Three screen consumer configuring the example

   This is the most straightforward case - the consumer would look to
   identify a set of streams to receive that best matched its
   available screens and so the VC0 plus VC1 plus VC2 should match
   optimally.  The spatial ordering would give sufficient information
   for the correct video capture to be shown on the correct screen,
   and the consumer would either need to divide a single encoding
   group's capability by 3 to determine what resolution and frame
   rate to configure the provider with or to configure the individual


Duckworth et. al.       Expires June 24, 2013               [Page 38]


Internet-Draft       CLUE Telepresence Framework        December 2012


   video captures' encoding groups with what makes most sense (taking
   into account the receive side decode capabilities, overall call
   bandwidth, the resolution of the screens plus any user preferences
   such as motion vs sharpness).

12. Acknowledgements

   Mark Gorzyinski contributed much to the approach.  We want to
   thank Stephen Botzko for helpful discussions on audio.

13. IANA Considerations

   TBD

14. Security Considerations

   TBD

15. Changes Since Last Version

   NOTE TO THE RFC-Editor: Please remove this section prior to
   publication as an RFC.

   Changes from 06 to 07:

   1. Ticket #9.  Rename Axis of Capture Point attribute to Point on
      Line of Capture.  Clarify the description of this attribute.

   2. Ticket #17.  Add "capture encoding" definition.  Use this new
      term throughout document as appropriate, replacing some usage
      of the terms "stream" and "encoding".

   3. Ticket #18.  Add Max Capture Encodings media capture attribute.

   4. Add clarification that different capture scene entries are not
      necessarily mutually exclusive.

   Changes from 05 to 06:

   1. Capture scene description attribute is a list of text strings,
      each in a different language, rather than just a single string.

   2. Add new Axis of Capture Point attribute.

   3. Remove appendices A.1 through A.6.


Duckworth et. al.       Expires June 24, 2013               [Page 39]


Internet-Draft       CLUE Telepresence Framework        December 2012


   4. Clarify that the provider must use the same coordinate system
      with same scale and origin for all coordinates within the same
      capture scene.

   Changes from 04 to 05:

   1. Clarify limitations of "composed" attribute.

   2. Add new section "capture scene entry attributes" and add the
      attribute "scene-switch-policy".

   3. Add capture scene description attribute and description
      language attribute.

   4. Editorial changes to examples section for consistency with the
      rest of the document.

   Changes from 03 to 04:

   1. Remove sentence from overview - "This constitutes a significant
      change ..."

   2. Clarify a consumer can choose a subset of captures from a
      capture scene entry or a simultaneous set (in section "capture
      scene" and "consumer's choice...").

   3. Reword first paragraph of Media Capture Attributes section.

   4. Clarify a stereo audio capture is different from two mono audio
      captures (description of audio channel format attribute).

   5. Clarify what it means when coordinate information is not
      specified for area of capture, point of capture, area of scene.

   6. Change the term "producer" to "provider" to be consistent (it
      was just in two places).

   7. Change name of "purpose" attribute to "content" and refer to
      RFC4796 for values.

   8. Clarify simultaneous sets are part of a provider advertisement,
      and apply across all capture scenes in the advertisement.

   9. Remove sentence about lip-sync between all media captures in a
      capture scene.


Duckworth et. al.       Expires June 24, 2013               [Page 40]


Internet-Draft       CLUE Telepresence Framework        December 2012


   10.   Combine the concepts of "capture scene" and "capture set"
      into a single concept, using the term "capture scene" to
      replace the previous term "capture set", and eliminating the
      original separate capture scene concept.

   Informative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G.,
   Johnston,
              A., Peterson, J., Sparks, R., Handley, M., and E.
              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
              June 2002.

   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
              Jacobson, "RTP: A Transport Protocol for Real-Time
              Applications", STD 64, RFC 3550, July 2003.

   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
              Session Initiation Protocol (SIP)", RFC 4353,
              February 2006.

   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session
   Description
              Protocol (SDP) Content Attribute", RFC 4796,
              February 2007.

   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC
   5117,
              January 2008.

   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying
              Languages", BCP 47, RFC 5646, September 2009.

   [IANA-Lan]
              IANA, "Language Subtag Registry",
   <http://www.iana.org/assignments/
              language-subtag-registry>.







Duckworth et. al.       Expires June 24, 2013               [Page 41]


Internet-Draft       CLUE Telepresence Framework        December 2012


16. Authors' Addresses

   Mark Duckworth (editor)
   Polycom
   Andover, MA  01810
   USA

   Email: mark.duckworth@polycom.com



   Andrew Pepperell
   Silverflare
   Uxbridge, England
   UK

   Email: apeppere@gmail.com



   Stephan Wenger
   Vidyo, Inc.
   433 Hakcensack Ave.
   Hackensack, N.J. 07601
   USA

   Email: stewe@stewe.org




















Duckworth et. al.       Expires June 24, 2013               [Page 42]