CLUE                                                      C. Groves, Ed.
Internet-Draft                                                   W. Yang
Intended status: Informational                                   R. Even
Expires: August 22, 2013                                          Huawei
                                                       February 18, 2013

                     CLUE media capture description


   This memo discusses how media captures are described and in
   particular the content attribute in the current CLUE framework
   document and proposes several alternatives.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on August 22, 2013.

Copyright Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   ( in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Groves, et al.           Expires August 22, 2013                [Page 1]

Internet-Draft       CLUE media capture description        February 2013

1.  Introduction

   One of the fundamental aspects of the CLUE framework is the concept
   of media captures.  The media captures are sent from a provider to a
   consumer.  This consumer then selects which captures it is interested
   in and replies back to the consumer.  The question is how does the
   consumer choose between what may be many different media captures?

   In order to be able to choose between the different media captures
   the consumer must have enough information regarding what the media
   capture represents and to distinguish between the media captures.

   The CLUE framework draft currently defines several media capture
   attributes which provide information regarding the capture.  The
   draft indicates that Media Capture Attributes describe static
   information about the captures.  A provider uses the media capture
   attributes to describe the media captures to the consumer.  The
   consumer will select the captures it wants to receive.  Attributes
   are defined by a variable and its value.

   One of the media capture attributes is the content attribute.  As
   indicated in the draft it is a field with enumerated values which
   describes the role of the media capture and can be applied to any
   media type.  The enumerated values are defined by RFC 4796 [RFC4796].
   The values for this attribute are the same as the mediacnt values for
   the content attribute in RFC 4796 [RFC4796].  This attribute can have
   multiple values, for example content={main, speaker}.

   RFC 4796 [RFC4796] defines the values as:

        slides: the media stream includes presentation slides.  The
        media type can be, for example, a video stream or a number of
        instant messages with pictures.  Typical use cases for this are
        online seminars and courses.  This is similar to the
        'presentation' role in H.239.

        speaker: the media stream contains the image of the speaker.
        The media can be, for example, a video stream or a still image.
        Typical use cases for this are online seminars and courses.

        sl: the media stream contains sign language.  A typical use case
        for this is an audio stream that is translated into sign
        language, which is sent over a video stream.

   Whilst the above values appear to be a simple way of conveying the
   content of a stream the Contributors believe that there are multiple
   issues that make the use of the existing "Content" tag insufficient
   for CLUE and multi-stream telepresence systems.  These issues are

Groves, et al.           Expires August 22, 2013                [Page 2]

Internet-Draft       CLUE media capture description        February 2013

   described in section 3.  Section 4 proposes new capture description

2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
   this document are to be interpreted as described in RFC 2119

   This document draws liberally from the terminology defined in the
   CLUE Framework [I-D.ietf-clue-framework]

3.  Issues with Content attribute

3.1.  Ambiguous definition

   *There is ambiguity in the definitions that may cause problems for
   interoperability.  A clear example is "slides" which could be any
   form of presentation media.  Another example is the difference
   between "main" and "alt".  In a telepresence scenario the room would
   be captured by the "main cameras" and a speaker would be captured by
   an alternative "camera".  This runs counter with the definition of

   Another example is a university use case where:

   The main site is a university auditorium which is equipped with three
   cameras.  One camera is focused on the professor at the podium.  A
   second camera is mounted on the wall behind the professor and
   captures the class in its entirety.  The third camera is co-located
   with the second, and is designed to capture a close up view of a
   questioner in the audience.  It automatically zooms in on that
   student using sound localization.

   For the first camera, it's not clear whether to use "main" or
   "speaker".  According to the definition and example of "speaker" in
   RFC 4796 [RFC4796], maybe it's more proper to use "speaker" here?
   For the third camera it could fit the definition of "main" or "alt"

3.2.  Multiple functions

   It appears that the definitions cover disparate functions.  "Main"
   and "alt" appear to describe the source from which media is sent.
   "Speaker" indicates a role associated with the media stream.

Groves, et al.           Expires August 22, 2013                [Page 3]

Internet-Draft       CLUE media capture description        February 2013

   "Slides" and "Sign Language" indicates the actual content.  Also
   indirectly some prioritization is applied to these parameters.  For
   example: the IMTC document on best practices for H.239 indicates a
   display priority between "main" and "alt".  This mixing of functions
   per code point can lead to ambiguous behavior and interoperability
   problems.  It also is an issue when extending the values.

3.3.  Limited Stream Support

   The values above appear to be defined based on a small number of
   video streams that are typically supported by legacy video
   conferencing.  E.g. a main video stream (main), a secondary one (alt)
   and perhaps a presentation stream (slides).  It is not clear how this
   value scales when many media streams are present.  For example if you
   have several main streams and several presentation streams how would
   an endpoint distinguish between them?

3.4.  Insufficient information for individual parameters

   Related to the above point is that some individual values do not
   provide sufficient information for an endpoint to make an educated
   decision on the content.  For example: Sign language (sl) - If a
   conference provides multiple streams each one containing a sign
   interpretation in a different sign language how does an endpoint
   distinguish between the languages if "sl" is the only label?  Also
   for accessible services other functions such a real time captioning
   and video description where an additional audio channel is used to
   describe the conference for vision impaired people should be

   Note: SDP provide a language attribute.

3.5.  Insufficient information for negotiation

   CLUE negotiation is likely to be at the start of a session
   initiation.  At this point of time only a very simple set of SDP
   (i.e. limited media description) may be available (depending on call
   flow).  In most cases the supported media captures may be agreed upon
   before the full SDP information for each media stream.  The effect of
   this is that detailed information would not be available for the
   initial decision about which capture to choose.  The obvious solution
   is to provide "enough" data in the CLUE provider messages so that a
   consumer can choose the appropriate media captures.  The current CLUE
   framework already partly addresses this through the "Content"
   attribute however based on the current "Content" values it appears
   that the information is not sufficient to fully describe the content
   of the captures.

Groves, et al.           Expires August 22, 2013                [Page 4]

Internet-Draft       CLUE media capture description        February 2013

   The purpose of the CLUE work is to supply enough information for
   negotiating multiple streams.  CLUE framework [I-D.ietf-clue-
   framework] addresses the spatial relation between the streams but it
   looks like it does not provide enough information about the semantic
   content of the stream to allow interoperability.

   Some information is available in SDP and may be available before the
   CLUE exchange but there are still some information missing.

4.  Capture description attributes

   As indicated above it is proposed to introduce a new attribute/s that
   allows the definition of various pieces of information that provide
   metadata about a particular media capture.  This information should
   be described in a way that it only supplies one atomic function.  It
   should also be applicable in a multi-stream environment.  It should
   also be extensible to allow new information elements to be introduced
   in the future.

   As an initial list the following attributes are proposed for use as
   metadata associated with media captures.  Further attributes may be
   identified in the future.

   This document propose to remove the "Content" attribute.  Rather than
   describing the "source device" in this way it may be better to
   describe its characteristics. i.e.

        An attribute to indicate "Presentation" rather than the value

        An attribute to describe the "Role" of a capture rather than the
        value "Speaker".

        An attribute to indicate the actual language used rather than a
        value "Sign Language".  This is also applicable to multiple
        audio streams.

        With respect to "main" and "alt" in a multiple stream
        environment it's not clear these values are needed if the
        characteristics of the capture are described.  An assumption may
        be that a capture is "main" unless described otherwise.

   Note: CLUE may have missed a media type "text".  How about a real
   time captioning or a real time text conversation associated with a
   video meeting?  It's a text based service.  It's not necessarily a
   presentation stream.  It's not audio or visual but a valid component
   of a conference.

Groves, et al.           Expires August 22, 2013                [Page 5]

Internet-Draft       CLUE media capture description        February 2013

   The sections below contain an initial list of attributes.

4.1.  Presentation

   This attribute indicates that the capture originates from a
   presentation device, that is one that provides supplementary
   information to a conference through slides, video, still images, data
   etc.  Where more information is known about the capture it may be
   expanded hierarchically to indicate the different types of
   presentation media, e.g. presentation.slides, presentation.image etc.

   Note: It is expected that a number of keywords will be defined that
   provide more detail on the type of presentation.

4.2.  View

   The Area of capture attribute provides a physical indication of a
   region that the media capture captures.  However the consumer does
   not know what this physical region relates to.  In discussions on the
   IETF mailing list it is apparent that some people propose to use the
   "Description" attribute to describe a scene.  This is a free text
   field and as such can be used to signal any piece of information.
   This leads to problems with interoperability if this field is
   automatically processed.  For interoperability purposes it is
   proposed to introduce a set of keywords that could be used as a basis
   for the selection of captures.  It is envisaged that this list would
   be extendable to allow for future uses not covered by the initial
   specification.  Therefore it is proposed to introduce a number of
   keywords (that may be expanded) indicating what the spatial region
   relates to?  I.e.  Room, table, etc. this is an initial description
   of an attribute introducing these keywords.

   This attribute provides a textual description of the area that a
   media capture captures.  This provides supplementary information in
   addition to the spatial information (i.e. area of capture) regarding
   the region that is captured.

        Room - Captures the entire scene.

        Table - Captures the conference table with seated participants

        Individual - Captures an individual participant

        Lectern - Captures the region of the lectern including the
        presenter in classroom style conference

        Audience - Captures a region showing the audience in a classroom
        style conference.

Groves, et al.           Expires August 22, 2013                [Page 6]

Internet-Draft       CLUE media capture description        February 2013

        Others - TBD

4.3.  Language

   Captures may be offered in different languages in case of multi-
   lingual and/or accessible conferences.  It is important to allow the
   remote end to distinguish between them.  It is noted that SDP already
   contains a language attribute however this may not be available at
   the time that an initial CLUE message is sent.  Therefore a language
   attribute is needed in CLUE to indicate the language used by the

   This indicates which language is associated with the capture.  For
   example: it may provide a language associated with an audio capture
   or a language associated with a video capture when sign
   interpretation or text is used.

   An example where multiple languages may be used is where a capture
   includes multiple conference participants who use different

   The possible values for the language tag are the values of the
   'Subtag' column for the "Type: language" entries in the "Language
   Subtag Registry" defined in RFC 5646 [RFC5646].

4.4.  Role

   The original definition of "Content" allows the indication that a
   particular media stream is related to the speaker.  CLUE should also
   allow this identification for captures.  In addition with the advent
   of XCON there may be other formal roles that may be associated with
   media/captures.  For instance: a remote end may like to always view
   the floor controller.  It is envisaged that a remote end may also
   chose captures depending on the role of the person/s captured.  For
   example: the people at the remote end may wish to always view the
   chairman.  This indicates that the capture is associated with an
   entity that has a particular role in the conference.  It is possible
   for the attribute to have multiple values where the capture has
   multiple roles.

   The values are grouped into two types: Person roles and Conference

4.4.1.  Person Roles

   The roles are related to the titles of the person/s associated with
   the capture.

Groves, et al.           Expires August 22, 2013                [Page 7]

Internet-Draft       CLUE media capture description        February 2013

        Manager - Indicates that the capture is assigned to a person
        with a senior position.

        Chairman- indicates who the chairman of the meeting is.

        Secretary - indicates that the capture is associated with the
        conference secretary.

        Lecturer - indicates that the capture is associated with the
        conference lecturer.

        Audience - indicates that the capture is associated with the
        conference audience.


4.4.2.  Conference Roles

   These roles are related to the establishment and maintenance of the
   multimedia conference and is related to the conference system.

        Speaker - indicates that the capture relates to the current

        Controller - indicates that the capture relates to the current
        floor controller of the conference.


   An example is:

              AC1 [Role=Speaker]
              VC1 [Role=Lecturer,Speaker]

4.5.  Priority

   As has been highlighted in discussions on the CLUE mailing list there
   appears to be some desire to provide some relative priority between
   captures when multiple alternatives are supplied.  This priority can
   be used to determine which captures contain the most important
   information (according to the provider).  This may be important in
   case where the consumer has limited resources and can on render a
   subset of captures.  Priority may also be advantageous in congestion
   scenarios where media from one capture may be favoured over other
   captures in any control algorithms.  This could be supplied via
   "ordering" in a CLUE data structure however this may be problematic
   if people assume some spatial meaning behind ordering, i.e. given
   three captures VC1, VC2, VC3: it would be natural to send VC1,VC2,VC3

Groves, et al.           Expires August 22, 2013                [Page 8]

Internet-Draft       CLUE media capture description        February 2013

   if the images are composed this way.  However if your boss sits in
   the middle view the priority may be VC2,VC1,VC3.  Explicit signalling
   is better.

   Additionally currently there are no hints to relative priority among
   captures from different capture scenes.  In order to prevent any
   misunderstanding with implicit ordering a numeric number that may be
   assigned to each capture.

   The "priority" attribute indicates a relative priority between
   captures.  For example it is possible to assign a priority between
   two presentation captures that would allow a remote endpoint to
   determine which presentation is more important.  Priority is assigned
   at the individual capture level.  It represents the provider's view
   of the relative priority between captures with a priority.  The same
   priority number may be used across multiple captures.  It indicates
   they are equally as important.  If no priority is assigned no
   assumptions regarding relative important of the capture can be

4.6.  Others

4.6.1.  Dynamic

   In the framework it has been assumed that the capture point is a
   fixed point within a telepresence session.  However depending on the
   conference scenario this may not be the case.  In tele-medical or
   tele-education cases a conference may include cameras that move
   during the conference.  For example: a camera may be placed at
   different positions in order to provide the best angle to capture a
   work task, or may include a camera worn by a participant.  This would
   have an effect of changing the capture point, capture axis and area
   of capture.  In order that the remote endpoint can chose to layout/
   render the capture appropriately an indication of if the camera is
   dynamic should be indicated in the initial capture description.

   This indicates that the spatial information related to the capture
   may be dynamic and change through the conference.  Thus captures may
   be characterised as static, dynamic or highly dynamic.  The capture
   point of a static capture does not move for the life of the
   conference.  The capture point of dynamic captures is categorised by
   a change in position followed by a reasonable period of stability.
   High dynamic captures are categorised by a capture point that is
   constantly moving.  This may assist an endpoint in determining the
   correct display layout.  If the "area of capture", "capture point"
   and "line of capture" attributes are included with dynamic or highly
   dynamic captures they indicate spatial information at the time a CLUE
   message is sent.  No information regarding future spatial information

Groves, et al.           Expires August 22, 2013                [Page 9]

Internet-Draft       CLUE media capture description        February 2013

   should be assumed.

4.6.2.  Embedded Text

   In accessible conferences textual information may be added to a
   capture before it is transmitted to the remote end.  In the case
   where multiple video captures are presented the remote end may
   benefit from the ability to choose a video stream containing text
   over one that does not.

   This attribute indicates that a capture provides embedded textual
   information.  For example the video capture may contain speech to
   text information composed with the video image.  This attribute is
   only applicable to video captures and presentation streams with
   visual information.

   The EmbeddedText attribute contains a language value according to RFC
   5646 [RFC5646] and may use a script sub-tag.  For example:


   Which indicates embedded text in Chinese written using the simplified
   Chinese script.

4.6.3.  Complementary Feed

   Some conferences utilise translators or facilitators that provide an
   additional audio stream (i.e. a translation or description of the
   conference).  These persons may not be pictured in a video capture.
   Where multiple audio captures are presented it may be advantageous
   for an endpoint to select a complementary stream instead of or
   additional to an audio feed associated with the participants from a
   main video capture.

   This indicates that a capture provides additional description of the
   conference.  For example an additional audio stream that provides a
   commentary of a conference that provides complementary information
   (e.g. a translation) or extra information to participants in
   accessible conferences.

   An example is where an additional capture provides a translation of
   another capture:

                   AC1 [Language = English]
                   AC2 [ComplementaryFeed = AC1, Language=Chinese]

   The complementary feed attribute indicates the capture to which it is
   providing additional information.

Groves, et al.           Expires August 22, 2013               [Page 10]

Internet-Draft       CLUE media capture description        February 2013

5.  Summary

   The main proposal is a to remove the Content Attribute in favour of
   describing the characteristics of captures in a more
   functional(atomic) way using the above attributes as the attributes
   to describe metadata regarding a capture.

6.  Acknowledgements

   This template was derived from an initial version written by Pekka
   Savola and contributed by him to the xml2rfc project.

7.  IANA Considerations

   This memo includes no request to IANA.

8.  Security Considerations


9.  Changes and Status Since Last Version

   Changes from 00 to 01:

   1.      Changed source to XML.

   2.      4.1 Presentation : No comments or concerns.  No changes.

   3.      4.2 View : No comments or concerns.  No changes.

   4.      4.3 Language: There were comments that multiple languages
           need to be supported e.g. audio in one, embedded text in
           another.  The text need to be clear whether it is supported
           or preferred language however it was clarified it is neither.
           Its the language of the content/capture.  It was also noted
           that different speakers using different languages could talk
           on the main speakers capture therefore language should be a
           list.  Seemed to be support for this.  Text was adapted

   5.      4.4 Role: There were a couple of responses for support for
           this attribute.  The actual values still need some work.  It
           was noted that there were two possible sets of roles: One
           group related to the titles of the person: i.e.  Boss,

Groves, et al.           Expires August 22, 2013               [Page 11]

Internet-Draft       CLUE media capture description        February 2013

           Chairman, Secretary, Lecturer, Audience.  Another group
           related to conference functions: i.e.  Conference initiator,
           controller, speaker.  Text was adapted accordingly.

   6.      4.5 Priority: No direct comment on the proposal.  There
           appeared to be some interest in a prioritisation scheme
           during discussions on the framework.  No changes.

   7.      4.6.1 Dynamic : No comments or concerns.  No changes.

   8.      4.6.2 Embedded text: There was a comment that "text" media
           capture was needed.  It was also indicated that it should be
           possible to associate a language with embedded text.  It
           should be possible to also specify language and script. e.g.
           Embedded text could have its own language.  Text adapted

   9.      4.6.3 Supplementary Description: There were comments that it
           could be interpreted as a free text field.  The intention is
           that its more of a flag.  A better name could be
           "Complementary feed"?  There was also a comment that perhaps
           a specific "translator flag" is needed.  It was noted the
           usage was like: AC1 Language=English or AC2 Supplementary
           Description = TRUE, Language=Chinese.  Text updated

   10.     4.6.4 Telepresence There were a couple of comments
           questioning the need for this parameter.  Attribute removed.

10.  References

10.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

10.2.  Informative References

              Duckworth, M., Pepperell, A., and S. Wenger, "Framework
              for Telepresence Multi-Streams",
              draft-ietf-clue-framework-08 (work in progress).

   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session Description
              Protocol (SDP) Content Attribute", RFC 4796,
              February 2007.

Groves, et al.           Expires August 22, 2013               [Page 12]

Internet-Draft       CLUE media capture description        February 2013

   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying
              Languages", BCP 47, RFC 5646, September 2009.

Authors' Addresses

   Christian Groves (editor)


   Weiwei Yang


   Roni Even
   Tel Aviv,


Groves, et al.           Expires August 22, 2013               [Page 13]