CLUE WG C. Groves
Internet-Draft W. Yang
Intended status: Informational R. Even
Expires: March 14, 2013 Huawei
September 10, 2012
CLUE media capture description
draft-groves-clue-capture-attr-00.txt
Abstract
This memo discusses how media captures are described and in
particular the content attribute in the current CLUE framework
document and proposes several alternatives.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on March 14, 2013.
Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Groves, et al. Expires March 14, 2013 [Page 1]
Internet-Draft New capture attributes September 2012
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Issues with Content attribute . . . . . . . . . . . . . . . . 4
3.1. Ambiguous definition . . . . . . . . . . . . . . . . . . . 4
3.2. Multiple functions . . . . . . . . . . . . . . . . . . . . 5
3.3. Limited Stream Support . . . . . . . . . . . . . . . . . . 5
3.4. Insufficient information for individual parameters . . . . 5
3.5. Insufficient information for negotiation . . . . . . . . . 5
4. Capture description attributes . . . . . . . . . . . . . . . . 6
4.1. Presentation . . . . . . . . . . . . . . . . . . . . . . . 7
4.2. View . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3. Language . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.4. Role . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.5. Priority . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.6. Others . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.6.1. Dynamic . . . . . . . . . . . . . . . . . . . . . . . 9
4.6.2. Embedded Text . . . . . . . . . . . . . . . . . . . . 10
4.6.3. Supplementary Description . . . . . . . . . . . . . . 10
4.6.4. Telepresence . . . . . . . . . . . . . . . . . . . . . 11
5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11
8. Security Considerations . . . . . . . . . . . . . . . . . . . 12
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12
9.1. Normative References . . . . . . . . . . . . . . . . . . . 12
9.2. Informative References . . . . . . . . . . . . . . . . . . 12
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12
Groves, et al. Expires March 14, 2013 [Page 2]
Internet-Draft New capture attributes September 2012
1. Introduction
One of the fundamental aspects of the CLUE framework is the concept
of media captures. The media captures are sent from a provider to a
consumer. This consumer then selects which captures it is interested
in and replies back to the consumer. The question is how does the
consumer choose between what may be many different media captures?
In order to be able to choose between the different media captures
the consumer must have enough information regarding what the media
capture represents and to distinguish between the media captures.
The CLUE framework draft currently defines several media capture
attributes which provide information regarding the capture. The
draft indicates that Media Capture Attributes describe static
information about the captures. A provider uses the media capture
attributes to describe the media captures to the consumer. The
consumer will select the captures it wants to receive. Attributes
are defined by a variable and its value."
One of the media capture attributes is the content attribute. As
indicated in the draft it is a field with enumerated values which
describes the role of the media capture and can be applied to any
media type. The enumerated values are defined by [RFC4796] The
values for this attribute are the same as the mediacnt values for the
content attribute in [RFC4796] This attribute can have multiple
values, for example content={main, speaker}."
[RFC4796] defines the values as:
o slides: the media stream includes presentation slides. The media
type can be, for example, a video stream or a number of instant
messages with pictures. Typical use cases for this are online
seminars and courses. This is similar to the 'presentation' role
in H.239.
o speaker: the media stream contains the image of the speaker. The
media can be, for example, a video stream or a still image.
Typical use cases for this are online seminars and courses.
o sl: the media stream contains sign language. A typical use case
for this is an audio stream that is translated into sign language,
which is sent over a video stream.
o main: the media stream is taken from the main source. A typical
use case for this is a concert where the camera is shooting the
performer.
Groves, et al. Expires March 14, 2013 [Page 3]
Internet-Draft New capture attributes September 2012
o alt: the media stream is taken from the alternative source. A
typical use case for this is an event where the ambient sound is
separated from the main sound. The alternative audio stream could
be, for example, the sound of a jungle. Another example is the
video of a conference room, while the main stream carries the
video of the speaker. This is similar to the 'live' role in
H.239.
Whilst the above values appear to be a simple way of conveying the
content of a stream the Contributors believe that there are multiple
issues that make the use of the existing "Content" tag insufficient
for CLUE and multi-stream telepresence systems. These issues are
described in section 3. Section 4 proposes new capture description
attributes.
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC2119[RFC2119] and
indicate requirement levels for compliant RTP implementations.
3. Issues with Content attribute
3.1. Ambiguous definition
There is ambiguity in the definitions that may cause problems for
interoperability. A clear example is "slides" which could be any
form of presentation media. Another example is the difference
between "main" and "alt". In a telepresence scenario the room would
be captured by the "main cameras" and a speaker would be captured by
an alternative "camera". This runs counter with the definition of
"alt".
Another example is a university use case where:
The main site is a university auditorium which is equipped with three
cameras. One camera is focused on the professor at the podium. A
second camera is mounted on the wall behind the professor and
captures the class in its entirety. The third camera is co-located
with the second, and is designed to capture a close up view of a
questioner in the audience. It automatically zooms in on that
student using sound localization.
For the first camera, it's not clear whether to use "main" or
"speaker". According to the definition and example of "speaker" in
Groves, et al. Expires March 14, 2013 [Page 4]
Internet-Draft New capture attributes September 2012
RFC 4796, maybe it's more proper to use "speaker" here? For the
third camera it could fit the definition of "main" or "alt" or
"speaker".
3.2. Multiple functions
It appears that the definitions cover disparate functions. "Main"
and "alt" appear to describe the source from which media is sent.
"Speaker" indicates a role associated with the media stream.
"Slides" and "Sign Language" indicates the actual content. Also
indirectly some prioritization is applied to these parameters. For
example: the IMTC document on best practices for H.239 indicates a
display priority between "main" and "alt". This mixing of functions
per code point can lead to ambiguous behavior and interoperability
problems. It also is an issue when extending the values.
3.3. Limited Stream Support
The values above appear to be defined based on a small number of
video streams that are typically supported by legacy video
conferencing. E.g. a main video stream (main), a secondary one (alt)
and perhaps a presentation stream (slides). It is not clear how this
value scales when many media streams are present. For example if you
have several main streams and several presentation streams how would
an endpoint distinguish between them?
3.4. Insufficient information for individual parameters
Related to the above point is that some individual values do not
provide sufficient information for an endpoint to make an educated
decision on the content. For example: Sign language (sl) - If a
conference provides multiple streams each one containing a sign
interpretation in a different sign language how does an endpoint
distinguish between the languages if "sl" is the only label? Also
for accessible services other functions such a real time captioning
and video description where an additional audio channel is used to
describe the conference for vision impaired people should be
supported.
Note: SDP provide a language attribute.
3.5. Insufficient information for negotiation
CLUE negotiation is likely to be at the start of a session
initiation. At this point of time only a very simple set of SDP
(i.e. limited media description) may be available (depending on call
flow). In most cases the supported media captures may be agreed upon
before the full SDP information for each media stream. The effect of
Groves, et al. Expires March 14, 2013 [Page 5]
Internet-Draft New capture attributes September 2012
this is that detailed information would not be available for the
initial decision about which capture to choose. The obvious solution
is to provide "enough" data in the CLUE provider messages so that a
consumer can choose the appropriate media captures. The current CLUE
framework already partly addresses this through the "Content"
attribute however based on the current "Content" values it appears
that the information is not sufficient to fully describe the content
of the captures.
The purpose of the CLUE work is to supply enough information for
negotiating multiple streams. CLUE framework
[I-D.ietf-clue-framework] addresses the spatial relation between the
streams but it looks like it does not provide enough information
about the semantic content of the stream to allow interoperability.
Some information is available in SDP and may be available before the
CLUE exchange but there are still some information missing.
4. Capture description attributes
As indicated above it is proposed to introduce a new attribute/s that
allows the definition of various pieces of information that provide
metadata about a particular media capture. This information should
be described in a way that it only supplies one atomic function. It
should also be applicable in a multi-stream environment. It should
also be extensible to allow new information elements to be introduced
in the future.
As an initial list the following attributes are proposed for use as
metadata associated with media captures. Further attributes may be
identified in the future.
This document propose to remove the "Content" attribute. Rather than
describing the "source device" in this way it may be better to
describe its characteristics. i.e.
o An attribute to indicate "Presentation" rather than the value
"Slides"
o An attribute to describe the "Role" of a capture rather than the
value "Speaker".
o An attribute to indicate the actual language used rather than a
value "Sign Language". This is also applicable to multiple audio
streams.
Groves, et al. Expires March 14, 2013 [Page 6]
Internet-Draft New capture attributes September 2012
o With respect to "main" and "alt" in a multiple stream environment
it's not clear these values are needed if the characteristics of
the capture are described. An assumption may be that a capture is
"main" unless described otherwise.
Note: CLUE may have missed a media type "text". How about a real
time captioning or a real time text conversation associated with a
video meeting? It's a text based service. It's not necessarily a
presentation stream. It's not audio or visual but a valid component
of a conference.
The sections below contain an initial list of attributes.
4.1. Presentation
This attribute indicates that the capture originates from a
presentation device, that is one that provides supplementary
information to a conference through slides, video, still images, data
etc. Where more information is known about the capture it may be
expanded hierarchically to indicate the different types of
presentation media, e.g. presentation.slides, presentation.image etc.
Note: It is expected that a number of keywords will be defined that
provide more detail on the type of presentation.
4.2. View
The Area of capture attribute provides a physical indication of a
region that the media capture captures. However the consumer does
not know what this physical region relates to. In discussions on the
IETF mailing list it is apparent that some people propose to use the
"Description" attribute to describe a scene. This is a free text
field and as such can be used to signal any piece of information.
This leads to problems with interoperability if this field is
automatically processed. For interoperability purposes it is
proposed to introduce a set of keywords that could be used as a basis
for the selection of captures. It is envisaged that this list would
be extendable to allow for future uses not covered by the initial
specification. Therefore it is proposed to introduce a number of
keywords (that may be expanded) indicating what the spatial region
relates to? I.e. Room, table, etc. this is an initial description
of an attribute introducing these keywords.
This attribute provides a textual description of the area that a
media capture captures. This provides supplementary information in
addition to the spatial information (i.e. area of capture) regarding
the region that is captured.
Groves, et al. Expires March 14, 2013 [Page 7]
Internet-Draft New capture attributes September 2012
Room - Captures the entire scene.
Table - Captures the conference table with seated participants
Individual - Captures an individual participant
Lectern - Captures the region of the lectern including the presenter
in classroom style conference
Audience - Captures a region showing the audience in a classroom
style conference.
Others - TBD
4.3. Language
As indicated in the discussion in section 2 captures may be offered
in different languages in case of multi-lingual and/or accessible
conferences. It is important to allow the remote end to distinguish
between them. It is noted that SDP already contains a language
attribute however this may not be available at the time that an
initial CLUE message is sent. Therefore a language attribute is
proposed for CLUE.
This indicates which language is associated with the capture. For
example: it may provide a language associated with an audio capture
or a language associated with a video capture when sign
interpretation or text is used. The possible values for a language
tag are the values of the 'Subtag' column for the "Type: language"
entries in the "Language Subtag Registry" defined in [RFC5646]
4.4. Role
The original definition of "Content" allows the indication that a
particular media stream is related to the speaker. CLUE should also
allow this identification for captures. In addition with the advent
of XCON there may be other formal roles that may be associated with
media/captures. For instance: a remote end may like to always view
the floor controller. It is envisaged that a remote end may also
chose captures depending on the role of the person/s captured. For
example: the people at the remote end may wish to always view the
chairmen. This indicates that the capture is associated with an
entity that has a particular role in the conference. The values are:
Speaker - indicates that the capture relates to the current speaker
Floor - indicates that the capture relates to the current floor
controller of the conference
Groves, et al. Expires March 14, 2013 [Page 8]
Internet-Draft New capture attributes September 2012
Chairman- indicates who the chairman of the meeting is.
Others - ?
4.5. Priority
As has been highlighted in discussions on the CLUE mailing list there
appears to be some desire to provide some relative priority between
captures when multiple alternatives are supplied. This priority can
be used to determine which captures contain the most important
information (according to the provider). This may be important in
case where the consumer has limited resources and can on render a
subset of captures. Priority may also be advantageous in congestion
scenarios where media from one capture may be favoured over other
captures in any control algorithms. This could be supplied via
"ordering" in a CLUE data structure however this may be problematic
if people assume some spatial meaning behind ordering, i.e. given
three captures VC1, VC2, VC3: it would be natural to send VC1,VC2,VC3
if the images are composed this way. However if your boss sits in
the middle view the priority may be VC2,VC1,VC3. Explicit signalling
is better.
Additionally currently there are no hints to relative priority among
captures from different capture scenes. In order to prevent any
misunderstanding with implicit ordering a numeric number that may be
assigned to each capture.
The "priority" attribute indicates a relative priority between
captures. For example it is possible to assign a priority between
two presentation captures that would allow a remote endpoint to
determine which presentation is more important. Priority is assigned
at the individual capture level. It represents the provider's view
of the relative priority between captures with a priority. The same
priority number may be used across multiple captures. It indicates
they are equally as important. If no priority is assigned no
assumptions regarding relative important of the capture can be
assumed.
4.6. Others
4.6.1. Dynamic
In the framework it has been assumed that the capture point is a
fixed point within a telepresence session. However depending on the
conference scenario this may not be the case. In tele-medical or
tele-education cases a conference may include cameras that move
during the conference. For example: a camera may be placed at
different positions in order to provide the best angle to capture a
Groves, et al. Expires March 14, 2013 [Page 9]
Internet-Draft New capture attributes September 2012
work task, or may include a camera worn by a participant. This would
have an effect of changing the capture point, capture axis and area
of capture. In order that the remote endpoint can chose to layout/
render the capture appropriately an indication of if the camera is
dynamic should be indicated in the initial capture description.
This indicates that the spatial information related to the capture
may be dynamic and change through the conference. Thus captures may
be characterised as static, dynamic or highly dynamic. The capture
point of a static capture does not move for the life of the
conference. The capture point of dynamic captures is categorised by
a change in position followed by a reasonable period of stability.
High dynamic captures are categorised by a capture point that is
constantly moving. This may assist an endpoint in determining the
correct display layout. If the "area of capture", "capture point"
and "line of capture" attributes are included with dynamic or highly
dynamic captures they indicate spatial information at the time a CLUE
message is sent. No information regarding future spatial information
should be assumed.
4.6.2. Embedded Text
In accessible conferences textual information may be added to a
capture before it is transmitted to the remote end. In the case
where multiple video captures are presented the remote end may
benefit from the ability to choose a video stream containing text
over one that does not.
This attribute indicates that a capture provides embedded textual
information. For example the video capture may contain speech to
text information composed with the video image. This attribute is
only applicable to video captures and presentation streams with
visual information.
4.6.3. Supplementary Description
Some conferences utilise translators or facilitators that provide an
additional audio stream (i.e. a translation or description of the
conference). These persons may not be pictured in a video capture.
Where multiple audio captures are presented it may be advantageous
for an endpoint to select a supplementary stream instead of or
additional to an audio feed associated with the participants from a
main video capture. Therefore an attribute is proposed for this.
Depending on the results of the discussion of the source device this
parameter may be another value for the source.
This indicates that a capture provides additional description of the
conference. For example an additional audio stream that provides a
Groves, et al. Expires March 14, 2013 [Page 10]
Internet-Draft New capture attributes September 2012
commentary of a conference that provides supplementary information
(e.g. a translation) or extra information to participants in
accessible conferences.
4.6.4. Telepresence
In certain use cases scenarios it is important to maintain a feeling
of "Telepresence" associated with captures when they are played at
the remote end. For example: in medical use cases it is important to
maintain the colour of images. It is important to note that CLUE is
used to describe multi-stream conferences. These may or may not be
"telepresence" conferences. Alternatively it could be assumed that
all captures possess this attribute and the only captures not subject
to processing to create "telepresence" this are those marked with
"presentation". We did discuss the aspect of how an endpoint
determines if a capture relates to a computer generated image or a
real environment. An endpoint may apply different images processing
depending on a source, i.e. it may or not apply image processing to
adjust lighting levels for a telepresence experience.
This parameter indicates that "telepresence" should be associated
with the capture. E.g. real world environmental conditions are
associated with this capture. Lighting, spatial and timing
information are important aspects of the telepresence session. The
remote should apply the appropriate capture processing to maintain
integrity of this information. For example: the colour related
information associated with the original capture is important and
should be replicated when displayed/played.
5. Summary
The main proposal is a to remove the Content Attribute in favour of
describing the characteristics of captures in a more
functional(atomic) way using the above attributes as the attributes
to describe metadata regarding a capture.
6. Acknowledgements
place holder
7. IANA Considerations
TBD
Groves, et al. Expires March 14, 2013 [Page 11]
Internet-Draft New capture attributes September 2012
8. Security Considerations
TBD.
9. References
9.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
9.2. Informative References
[I-D.ietf-clue-framework]
Romanow, A., Duckworth, M., Pepperell, A., and B. Baldino,
"Framework for Telepresence Multi-Streams",
draft-ietf-clue-framework-06 (work in progress),
July 2012.
[RFC4796] Hautakorpi, J. and G. Camarillo, "The Session Description
Protocol (SDP) Content Attribute", RFC 4796,
February 2007.
[RFC5646] Phillips, A. and M. Davis, "Tags for Identifying
Languages", BCP 47, RFC 5646, September 2009.
Authors' Addresses
Christian Groves
Huawei
Australia
Email: Christian.Groves@nteczone.com
Weiwei Yang
Huawei
P.R. China
Email: tommy@huawei.com
Groves, et al. Expires March 14, 2013 [Page 12]
Internet-Draft New capture attributes September 2012
Roni Even
Huawei
Tel Aviv,
Israel
Email: roni.even@mail01.huawei.com
Groves, et al. Expires March 14, 2013 [Page 13]