CLUE WG                                                        C. Groves
Internet-Draft                                                   W. Yang
Intended status: Informational                                   R. Even
Expires: March 14, 2013                                           Huawei
                                                      September 10, 2012

                     CLUE media capture description


   This memo discusses how media captures are described and in
   particular the content attribute in the current CLUE framework
   document and proposes several alternatives.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on March 14, 2013.

Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   ( in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Groves, et al.           Expires March 14, 2013                 [Page 1]

Internet-Draft           New capture attributes           September 2012

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
   3.  Issues with Content attribute  . . . . . . . . . . . . . . . .  4
     3.1.  Ambiguous definition . . . . . . . . . . . . . . . . . . .  4
     3.2.  Multiple functions . . . . . . . . . . . . . . . . . . . .  5
     3.3.  Limited Stream Support . . . . . . . . . . . . . . . . . .  5
     3.4.  Insufficient information for individual parameters . . . .  5
     3.5.  Insufficient information for negotiation . . . . . . . . .  5
   4.  Capture description attributes . . . . . . . . . . . . . . . .  6
     4.1.  Presentation . . . . . . . . . . . . . . . . . . . . . . .  7
     4.2.  View . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
     4.3.  Language . . . . . . . . . . . . . . . . . . . . . . . . .  8
     4.4.  Role . . . . . . . . . . . . . . . . . . . . . . . . . . .  8
     4.5.  Priority . . . . . . . . . . . . . . . . . . . . . . . . .  9
     4.6.  Others . . . . . . . . . . . . . . . . . . . . . . . . . .  9
       4.6.1.  Dynamic  . . . . . . . . . . . . . . . . . . . . . . .  9
       4.6.2.  Embedded Text  . . . . . . . . . . . . . . . . . . . . 10
       4.6.3.  Supplementary Description  . . . . . . . . . . . . . . 10
       4.6.4.  Telepresence . . . . . . . . . . . . . . . . . . . . . 11
   5.  Summary  . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
   6.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11
   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 11
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 12
     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 12
     9.2.  Informative References . . . . . . . . . . . . . . . . . . 12
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12

Groves, et al.           Expires March 14, 2013                 [Page 2]

Internet-Draft           New capture attributes           September 2012

1.  Introduction

   One of the fundamental aspects of the CLUE framework is the concept
   of media captures.  The media captures are sent from a provider to a
   consumer.  This consumer then selects which captures it is interested
   in and replies back to the consumer.  The question is how does the
   consumer choose between what may be many different media captures?

   In order to be able to choose between the different media captures
   the consumer must have enough information regarding what the media
   capture represents and to distinguish between the media captures.

   The CLUE framework draft currently defines several media capture
   attributes which provide information regarding the capture.  The
   draft indicates that Media Capture Attributes describe static
   information about the captures.  A provider uses the media capture
   attributes to describe the media captures to the consumer.  The
   consumer will select the captures it wants to receive.  Attributes
   are defined by a variable and its value."

   One of the media capture attributes is the content attribute.  As
   indicated in the draft it is a field with enumerated values which
   describes the role of the media capture and can be applied to any
   media type.  The enumerated values are defined by [RFC4796] The
   values for this attribute are the same as the mediacnt values for the
   content attribute in [RFC4796] This attribute can have multiple
   values, for example content={main, speaker}."

   [RFC4796] defines the values as:

   o  slides: the media stream includes presentation slides.  The media
      type can be, for example, a video stream or a number of instant
      messages with pictures.  Typical use cases for this are online
      seminars and courses.  This is similar to the 'presentation' role
      in H.239.

   o  speaker: the media stream contains the image of the speaker.  The
      media can be, for example, a video stream or a still image.
      Typical use cases for this are online seminars and courses.

   o  sl: the media stream contains sign language.  A typical use case
      for this is an audio stream that is translated into sign language,
      which is sent over a video stream.

   o  main: the media stream is taken from the main source.  A typical
      use case for this is a concert where the camera is shooting the

Groves, et al.           Expires March 14, 2013                 [Page 3]

Internet-Draft           New capture attributes           September 2012

   o  alt: the media stream is taken from the alternative source.  A
      typical use case for this is an event where the ambient sound is
      separated from the main sound.  The alternative audio stream could
      be, for example, the sound of a jungle.  Another example is the
      video of a conference room, while the main stream carries the
      video of the speaker.  This is similar to the 'live' role in

   Whilst the above values appear to be a simple way of conveying the
   content of a stream the Contributors believe that there are multiple
   issues that make the use of the existing "Content" tag insufficient
   for CLUE and multi-stream telepresence systems.  These issues are
   described in section 3.  Section 4 proposes new capture description

2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   document are to be interpreted as described in RFC2119[RFC2119] and
   indicate requirement levels for compliant RTP implementations.

3.  Issues with Content attribute

3.1.  Ambiguous definition

   There is ambiguity in the definitions that may cause problems for
   interoperability.  A clear example is "slides" which could be any
   form of presentation media.  Another example is the difference
   between "main" and "alt".  In a telepresence scenario the room would
   be captured by the "main cameras" and a speaker would be captured by
   an alternative "camera".  This runs counter with the definition of

   Another example is a university use case where:

   The main site is a university auditorium which is equipped with three
   cameras.  One camera is focused on the professor at the podium.  A
   second camera is mounted on the wall behind the professor and
   captures the class in its entirety.  The third camera is co-located
   with the second, and is designed to capture a close up view of a
   questioner in the audience.  It automatically zooms in on that
   student using sound localization.

   For the first camera, it's not clear whether to use "main" or
   "speaker".  According to the definition and example of "speaker" in

Groves, et al.           Expires March 14, 2013                 [Page 4]

Internet-Draft           New capture attributes           September 2012

   RFC 4796, maybe it's more proper to use "speaker" here?  For the
   third camera it could fit the definition of "main" or "alt" or

3.2.   Multiple functions

   It appears that the definitions cover disparate functions.  "Main"
   and "alt" appear to describe the source from which media is sent.
   "Speaker" indicates a role associated with the media stream.
   "Slides" and "Sign Language" indicates the actual content.  Also
   indirectly some prioritization is applied to these parameters.  For
   example: the IMTC document on best practices for H.239 indicates a
   display priority between "main" and "alt".  This mixing of functions
   per code point can lead to ambiguous behavior and interoperability
   problems.  It also is an issue when extending the values.

3.3.   Limited Stream Support

   The values above appear to be defined based on a small number of
   video streams that are typically supported by legacy video
   conferencing.  E.g. a main video stream (main), a secondary one (alt)
   and perhaps a presentation stream (slides).  It is not clear how this
   value scales when many media streams are present.  For example if you
   have several main streams and several presentation streams how would
   an endpoint distinguish between them?

3.4.    Insufficient information for individual parameters

   Related to the above point is that some individual values do not
   provide sufficient information for an endpoint to make an educated
   decision on the content.  For example: Sign language (sl) - If a
   conference provides multiple streams each one containing a sign
   interpretation in a different sign language how does an endpoint
   distinguish between the languages if "sl" is the only label?  Also
   for accessible services other functions such a real time captioning
   and video description where an additional audio channel is used to
   describe the conference for vision impaired people should be

   Note: SDP provide a language attribute.

3.5.    Insufficient information for negotiation

   CLUE negotiation is likely to be at the start of a session
   initiation.  At this point of time only a very simple set of SDP
   (i.e. limited media description) may be available (depending on call
   flow).  In most cases the supported media captures may be agreed upon
   before the full SDP information for each media stream.  The effect of

Groves, et al.           Expires March 14, 2013                 [Page 5]

Internet-Draft           New capture attributes           September 2012

   this is that detailed information would not be available for the
   initial decision about which capture to choose.  The obvious solution
   is to provide "enough" data in the CLUE provider messages so that a
   consumer can choose the appropriate media captures.  The current CLUE
   framework already partly addresses this through the "Content"
   attribute however based on the current "Content" values it appears
   that the information is not sufficient to fully describe the content
   of the captures.

   The purpose of the CLUE work is to supply enough information for
   negotiating multiple streams.  CLUE framework
   [I-D.ietf-clue-framework] addresses the spatial relation between the
   streams but it looks like it does not provide enough information
   about the semantic content of the stream to allow interoperability.

   Some information is available in SDP and may be available before the
   CLUE exchange but there are still some information missing.

4.  Capture description attributes

   As indicated above it is proposed to introduce a new attribute/s that
   allows the definition of various pieces of information that provide
   metadata about a particular media capture.  This information should
   be described in a way that it only supplies one atomic function.  It
   should also be applicable in a multi-stream environment.  It should
   also be extensible to allow new information elements to be introduced
   in the future.

   As an initial list the following attributes are proposed for use as
   metadata associated with media captures.  Further attributes may be
   identified in the future.

   This document propose to remove the "Content" attribute.  Rather than
   describing the "source device" in this way it may be better to
   describe its characteristics. i.e.

   o  An attribute to indicate "Presentation" rather than the value

   o  An attribute to describe the "Role" of a capture rather than the
      value "Speaker".

   o  An attribute to indicate the actual language used rather than a
      value "Sign Language".  This is also applicable to multiple audio

Groves, et al.           Expires March 14, 2013                 [Page 6]

Internet-Draft           New capture attributes           September 2012

   o  With respect to "main" and "alt" in a multiple stream environment
      it's not clear these values are needed if the characteristics of
      the capture are described.  An assumption may be that a capture is
      "main" unless described otherwise.

   Note: CLUE may have missed a media type "text".  How about a real
   time captioning or a real time text conversation associated with a
   video meeting?  It's a text based service.  It's not necessarily a
   presentation stream.  It's not audio or visual but a valid component
   of a conference.

   The sections below contain an initial list of attributes.

4.1.    Presentation

   This attribute indicates that the capture originates from a
   presentation device, that is one that provides supplementary
   information to a conference through slides, video, still images, data
   etc.  Where more information is known about the capture it may be
   expanded hierarchically to indicate the different types of
   presentation media, e.g. presentation.slides, presentation.image etc.

   Note: It is expected that a number of keywords will be defined that
   provide more detail on the type of presentation.

4.2.   View

   The Area of capture attribute provides a physical indication of a
   region that the media capture captures.  However the consumer does
   not know what this physical region relates to.  In discussions on the
   IETF mailing list it is apparent that some people propose to use the
   "Description" attribute to describe a scene.  This is a free text
   field and as such can be used to signal any piece of information.
   This leads to problems with interoperability if this field is
   automatically processed.  For interoperability purposes it is
   proposed to introduce a set of keywords that could be used as a basis
   for the selection of captures.  It is envisaged that this list would
   be extendable to allow for future uses not covered by the initial
   specification.  Therefore it is proposed to introduce a number of
   keywords (that may be expanded) indicating what the spatial region
   relates to?  I.e.  Room, table, etc. this is an initial description
   of an attribute introducing these keywords.

   This attribute provides a textual description of the area that a
   media capture captures.  This provides supplementary information in
   addition to the spatial information (i.e. area of capture) regarding
   the region that is captured.

Groves, et al.           Expires March 14, 2013                 [Page 7]

Internet-Draft           New capture attributes           September 2012

   Room - Captures the entire scene.

   Table - Captures the conference table with seated participants

   Individual - Captures an individual participant

   Lectern - Captures the region of the lectern including the presenter
   in classroom style conference

   Audience - Captures a region showing the audience in a classroom
   style conference.

   Others - TBD

4.3.   Language

   As indicated in the discussion in section 2 captures may be offered
   in different languages in case of multi-lingual and/or accessible
   conferences.  It is important to allow the remote end to distinguish
   between them.  It is noted that SDP already contains a language
   attribute however this may not be available at the time that an
   initial CLUE message is sent.  Therefore a language attribute is
   proposed for CLUE.

   This indicates which language is associated with the capture.  For
   example: it may provide a language associated with an audio capture
   or a language associated with a video capture when sign
   interpretation or text is used.  The possible values for a language
   tag are the values of the 'Subtag' column for the "Type: language"
   entries in the "Language Subtag Registry" defined in [RFC5646]

4.4.   Role

   The original definition of "Content" allows the indication that a
   particular media stream is related to the speaker.  CLUE should also
   allow this identification for captures.  In addition with the advent
   of XCON there may be other formal roles that may be associated with
   media/captures.  For instance: a remote end may like to always view
   the floor controller.  It is envisaged that a remote end may also
   chose captures depending on the role of the person/s captured.  For
   example: the people at the remote end may wish to always view the
   chairmen.  This indicates that the capture is associated with an
   entity that has a particular role in the conference.  The values are:

   Speaker - indicates that the capture relates to the current speaker

   Floor - indicates that the capture relates to the current floor
   controller of the conference

Groves, et al.           Expires March 14, 2013                 [Page 8]

Internet-Draft           New capture attributes           September 2012

   Chairman- indicates who the chairman of the meeting is.

   Others - ?

4.5.   Priority

   As has been highlighted in discussions on the CLUE mailing list there
   appears to be some desire to provide some relative priority between
   captures when multiple alternatives are supplied.  This priority can
   be used to determine which captures contain the most important
   information (according to the provider).  This may be important in
   case where the consumer has limited resources and can on render a
   subset of captures.  Priority may also be advantageous in congestion
   scenarios where media from one capture may be favoured over other
   captures in any control algorithms.  This could be supplied via
   "ordering" in a CLUE data structure however this may be problematic
   if people assume some spatial meaning behind ordering, i.e. given
   three captures VC1, VC2, VC3: it would be natural to send VC1,VC2,VC3
   if the images are composed this way.  However if your boss sits in
   the middle view the priority may be VC2,VC1,VC3.  Explicit signalling
   is better.

   Additionally currently there are no hints to relative priority among
   captures from different capture scenes.  In order to prevent any
   misunderstanding with implicit ordering a numeric number that may be
   assigned to each capture.

   The "priority" attribute indicates a relative priority between
   captures.  For example it is possible to assign a priority between
   two presentation captures that would allow a remote endpoint to
   determine which presentation is more important.  Priority is assigned
   at the individual capture level.  It represents the provider's view
   of the relative priority between captures with a priority.  The same
   priority number may be used across multiple captures.  It indicates
   they are equally as important.  If no priority is assigned no
   assumptions regarding relative important of the capture can be

4.6.   Others

4.6.1.   Dynamic

   In the framework it has been assumed that the capture point is a
   fixed point within a telepresence session.  However depending on the
   conference scenario this may not be the case.  In tele-medical or
   tele-education cases a conference may include cameras that move
   during the conference.  For example: a camera may be placed at
   different positions in order to provide the best angle to capture a

Groves, et al.           Expires March 14, 2013                 [Page 9]

Internet-Draft           New capture attributes           September 2012

   work task, or may include a camera worn by a participant.  This would
   have an effect of changing the capture point, capture axis and area
   of capture.  In order that the remote endpoint can chose to layout/
   render the capture appropriately an indication of if the camera is
   dynamic should be indicated in the initial capture description.

   This indicates that the spatial information related to the capture
   may be dynamic and change through the conference.  Thus captures may
   be characterised as static, dynamic or highly dynamic.  The capture
   point of a static capture does not move for the life of the
   conference.  The capture point of dynamic captures is categorised by
   a change in position followed by a reasonable period of stability.
   High dynamic captures are categorised by a capture point that is
   constantly moving.  This may assist an endpoint in determining the
   correct display layout.  If the "area of capture", "capture point"
   and "line of capture" attributes are included with dynamic or highly
   dynamic captures they indicate spatial information at the time a CLUE
   message is sent.  No information regarding future spatial information
   should be assumed.

4.6.2.   Embedded Text

   In accessible conferences textual information may be added to a
   capture before it is transmitted to the remote end.  In the case
   where multiple video captures are presented the remote end may
   benefit from the ability to choose a video stream containing text
   over one that does not.

   This attribute indicates that a capture provides embedded textual
   information.  For example the video capture may contain speech to
   text information composed with the video image.  This attribute is
   only applicable to video captures and presentation streams with
   visual information.

4.6.3.  Supplementary Description

   Some conferences utilise translators or facilitators that provide an
   additional audio stream (i.e. a translation or description of the
   conference).  These persons may not be pictured in a video capture.
   Where multiple audio captures are presented it may be advantageous
   for an endpoint to select a supplementary stream instead of or
   additional to an audio feed associated with the participants from a
   main video capture.  Therefore an attribute is proposed for this.
   Depending on the results of the discussion of the source device this
   parameter may be another value for the source.

   This indicates that a capture provides additional description of the
   conference.  For example an additional audio stream that provides a

Groves, et al.           Expires March 14, 2013                [Page 10]

Internet-Draft           New capture attributes           September 2012

   commentary of a conference that provides supplementary information
   (e.g. a translation) or extra information to participants in
   accessible conferences.

4.6.4.  Telepresence

   In certain use cases scenarios it is important to maintain a feeling
   of "Telepresence" associated with captures when they are played at
   the remote end.  For example: in medical use cases it is important to
   maintain the colour of images.  It is important to note that CLUE is
   used to describe multi-stream conferences.  These may or may not be
   "telepresence" conferences.  Alternatively it could be assumed that
   all captures possess this attribute and the only captures not subject
   to processing to create "telepresence" this are those marked with
   "presentation".  We did discuss the aspect of how an endpoint
   determines if a capture relates to a computer generated image or a
   real environment.  An endpoint may apply different images processing
   depending on a source, i.e. it may or not apply image processing to
   adjust lighting levels for a telepresence experience.

   This parameter indicates that "telepresence" should be associated
   with the capture.  E.g. real world environmental conditions are
   associated with this capture.  Lighting, spatial and timing
   information are important aspects of the telepresence session.  The
   remote should apply the appropriate capture processing to maintain
   integrity of this information.  For example: the colour related
   information associated with the original capture is important and
   should be replicated when displayed/played.

5.  Summary

   The main proposal is a to remove the Content Attribute in favour of
   describing the characteristics of captures in a more
   functional(atomic) way using the above attributes as the attributes
   to describe metadata regarding a capture.

6.  Acknowledgements

   place holder

7.  IANA Considerations


Groves, et al.           Expires March 14, 2013                [Page 11]

Internet-Draft           New capture attributes           September 2012

8.  Security Considerations


9.  References

9.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

9.2.  Informative References

              Romanow, A., Duckworth, M., Pepperell, A., and B. Baldino,
              "Framework for Telepresence Multi-Streams",
              draft-ietf-clue-framework-06 (work in progress),
              July 2012.

   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session Description
              Protocol (SDP) Content Attribute", RFC 4796,
              February 2007.

   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying
              Languages", BCP 47, RFC 5646, September 2009.

Authors' Addresses

   Christian Groves


   Weiwei Yang
   P.R. China


Groves, et al.           Expires March 14, 2013                [Page 12]

Internet-Draft           New capture attributes           September 2012

   Roni Even
   Tel Aviv,


Groves, et al.           Expires March 14, 2013                [Page 13]