CLUE WG A. Romanow
Internet-Draft Cisco Systems
Intended status: Informational M. Duckworth
Expires: January 4, 2012 Polycom
A. Pepperell
B. Baldino
Cisco Systems
M. Goryzinski
HP Visual Collaboration
July 3, 2011
Framework for Telepresence Multi-Streams
draft-romanow-clue-framework-00.txt
Abstract
This memo offers a framework for a protocol that enables devices in a
telepresence conference to interoperate by specif;ying the
relationships between multiple RTP streams.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 4, 2012.
Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
Romanow, et al. Expires January 4, 2012 [Page 1]
Internet-Draft CLUE Telepresence Framework July 2011
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Romanow, et al. Expires January 4, 2012 [Page 2]
Internet-Draft CLUE Telepresence Framework July 2011
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6
3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 6
4. Two Necessary Functions . . . . . . . . . . . . . . . . . . . 9
5. Protocol Features . . . . . . . . . . . . . . . . . . . . . . 9
6. Stream Content . . . . . . . . . . . . . . . . . . . . . . . . 10
6.1. Media capture . . . . . . . . . . . . . . . . . . . . . . 10
6.2. Attributes . . . . . . . . . . . . . . . . . . . . . . . . 11
6.3. Capture Set . . . . . . . . . . . . . . . . . . . . . . . 12
7. Choosing Streams . . . . . . . . . . . . . . . . . . . . . . . 13
7.1. Physical Simultaneity . . . . . . . . . . . . . . . . . . 14
7.2. Encoding Groups . . . . . . . . . . . . . . . . . . . . . 15
7.2.1. Sample video encoding group specification #1 . . . . . 17
7.2.2. Sample video encoding group specification #2 . . . . . 18
8. Media provider behavior . . . . . . . . . . . . . . . . . . . 19
9. Putting it together - using the Capture Set . . . . . . . . . 19
10. Media consumer behaviour . . . . . . . . . . . . . . . . . . . 22
10.1. One screen receiver configuring the example
capture-side device above . . . . . . . . . . . . . . . . 23
10.2. Two screen receiver configuring the example
capture-side device above . . . . . . . . . . . . . . . . 23
10.3. Three screen receiver configuring the example
capture-side device above . . . . . . . . . . . . . . . . 24
10.4. Configuration of sender streams by a receiver . . . . . . 24
10.5. Advertisement of capabilities sent by receiver to
sender . . . . . . . . . . . . . . . . . . . . . . . . . . 24
11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 25
12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25
13. Security Considerations . . . . . . . . . . . . . . . . . . . 25
14. Informative References . . . . . . . . . . . . . . . . . . . . 25
Appendix A. Attributes . . . . . . . . . . . . . . . . . . . . . 26
A.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 26
A.1.1. Main . . . . . . . . . . . . . . . . . . . . . . . . . 26
A.1.2. Presentation . . . . . . . . . . . . . . . . . . . . . 26
A.2. Audio mixed . . . . . . . . . . . . . . . . . . . . . . . 26
A.3. Audio Channel Format . . . . . . . . . . . . . . . . . . . 26
A.3.1. Linear Array . . . . . . . . . . . . . . . . . . . . . 26
A.3.2. Stereo . . . . . . . . . . . . . . . . . . . . . . . . 27
A.3.3. Mono . . . . . . . . . . . . . . . . . . . . . . . . . 27
A.4. Audio Linear Position . . . . . . . . . . . . . . . . . . 27
A.5. Video Scale . . . . . . . . . . . . . . . . . . . . . . . 28
A.6. Video composed . . . . . . . . . . . . . . . . . . . . . . 28
A.7. Video Auto-switched . . . . . . . . . . . . . . . . . . . 28
Appendix B. Spatial Relationship . . . . . . . . . . . . . . . . 28
B.1. Spatial relationship of audio with video . . . . . . . . . 29
Appendix C. Capture sets for the MCU Case . . . . . . . . . . . . 29
Romanow, et al. Expires January 4, 2012 [Page 3]
Internet-Draft CLUE Telepresence Framework July 2011
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30
Romanow, et al. Expires January 4, 2012 [Page 4]
Internet-Draft CLUE Telepresence Framework July 2011
1. Introduction
Current telepresence systems, though based on open standards such as
RTP and SIP, cannot easily interoperate with each other. A major
factor limiting the interoperability of telepresence systems is the
lack of a standardized way to describe and negotiate the use of the
multiple streams of audio and video comprising the media flows. This
draft provides a framework for a protocol to enable interoperability
by handling multiple streams in a standardized way. It is intended
to support the use cases described in
draft-ietf-clue-telepresence-use-cases-00 and to meet the
requirements in draft-romanow-clue-requirements-xx.
The solution described here is strongly focused on what is being done
today, rather than a vision of future conferencing. However, the
highest priority has been given to creating an extensible framework
to make it easy to add new information needed to accommodate future
conferencing functionality.
The purpose of this effort is to make it possible to handle multiple
streams of media in such a way that a satisfactory user experience is
possible even when participants are on different vendor equipment and
when they are using devices with different types of communication
capabilities. Information about the relationship of media streams
must be communicated so that audio/video rendering can be done in the
best possible manner. In addition, it is necessary to choose which
media streams are sent.
This first draft of the CLUE framework is to introduce the basic
approach. The draft is deliberately as simple as possible in order
to make it possible to focus discussion on the basic approach. Some
of the more descriptive material has been put into appendices in this
version, in order to keep the framework material from being
overwhelmed by detail. In addition, only the basic mechanism is
described here. In subsequent drafts, additional mechanisms
consistent with the basic approach will be added to handle more use
cases.
Several important use cases require such additional mechanism to be
handled. Nonetheless, we feel that it is better to go step by step,
and we are defering that material until the next version of the
model. It will provide a good illustration of how to use the
extensible feature of the framework to handle new use cases.
If you look at this framework from the perspective of trying to
catch-it-out and see where it breaks down in a special case, you will
easily be able to succeed. But we urge you to hold that perspective
temporarily in order to concentrate on how this model works in common
Romanow, et al. Expires January 4, 2012 [Page 5]
Internet-Draft CLUE Telepresence Framework July 2011
cases, and how it can be expanded to other use cases.
[Edt. Similarly, some of the wording is not as precise and accurate
as might be possible. Although of course this is very important, it
might be useful to postpone definition issues temporarily where
possible in order to concentrate on the framework.]
After the following Definitions, two short sections introduce key
concepts. The body of the text comprises three sections that deal
with in turn stream content, choosing streams and an implementation
example. The media provider and media consumer behavior are
described in separate sections as well. Several appendices describe
further details for using the framework.
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
3. Definitions
The definitions marked with an "*" are new; all the others are from
draft-wenger-clue-definitions-00-01.txt.
*Audio Capture: Media Capture for audio. Denoted as ACn.
Capture Device: A device that converts audio and video input into
an electrical signal, in most cases to be fed into a media
encoder. Cameras and microphones are examples for capture
devices.
Capture Scene: the scene that is captured by a collection of
Capture Devices. A Capture Scene may be represented by more than
one type of Media. A Capture Scene may include more than one
Media Capture of the same type. An example of a Capture Scene is
the video image of a group of people seated next to each other,
along with the sound of their voices, which could be represented
by some number of VCs and ACs. A middle box may also express
Capture Scenes that it constructs from Media streams it receives.
A Capture Set includes Media Captures that all represent some
aspect of the same Capture Scene. The items (rows) in a Capture
Set represent different alternatives for representing the same
Capture Scene.
Romanow, et al. Expires January 4, 2012 [Page 6]
Internet-Draft CLUE Telepresence Framework July 2011
Conference: used as defined in [RFC4353], A Framework for
Conferencing within the Session Initiation Protocol (SIP).
*Encoding Group: A set of encoding parameters representing one or
more media encoders. An Encoding Group describes constraints on
encoding parameters used for mapping Media Captures to encoded
Streams.
Endpoint: The logical point of final termination through
receiving, decoding and rendering, and/or initiation through
capturing, encoding, and sending of media streams. An endpoint
consists of one or more physical devices which source and sink
media streams, and exactly one [RFC4353] Participant (which, in
turn, includes exactly one SIP User Agent). In contrast to an
endpoint, an MCU may also send and receive media streams, but it
is not the initiator nor the final terminator in the sense that
Media is Captured or Rendered. Endpoints can be anything from
multiscreen/multicamera rooms to handheld devices.
Endpoint Characteristics: include placement of Capture and
Rendering Devices, capture/render angle, resolution of cameras and
screens, spatial location and mixing parameters of microphones.
Endpoint characteristics are not specific to individual media
streams sent by the endpoint.
Left: to be interpreted as a stage direction, see also
[StageDirection(Wikipadia)] (Edt. note: needs more clarification)
MCU: Multipoint Control Unit (MCU) - a device that connects two or
more endpoints together into one single multimedia conference
[RFC5117]. An MCU includes an [RFC4353] Mixer. Edt. Note:
RFC4353 is tardy in requireing that media from the mixer be sent
to EACH participant. I think we have practical use cases where
this is not the case. But the bug (if it is one) is in 4353 and
not herein.
Media: Any data that, after suitable encoding, can be conveyed
over RTP, including audio, video or timed text.
*Media Capture: a source of Media, such as from one or more
Capture Devices. A Media Capture may be the source of one or more
Media streams. A Media Capture may also be constructed from other
Media streams. A middle box can express Media Captures that it
constructs from Media streams it receives.
Romanow, et al. Expires January 4, 2012 [Page 7]
Internet-Draft CLUE Telepresence Framework July 2011
*Media Consumer: an Endpoint or middle box that receives Media
streams
*Media Provider: an Endpoint or middle box that sends Media
streams
Model: a set of assumptions a telepresence system of a given
vendor adheres to and expects the remote telepresence system(s)
also to adhere to.
Right: to be interpreted as stage direction, see also
[StageDirection(Wikipadia)] (Edt. note: needs more clarification)
Render: the process of generating a representation from a media,
such as displayed motion video or sound emitted from loudspeakers.
*Simultaneous Transmission Set: a set of media captures that can
be transmitted simultaneously from a Media Sender.
Spatial Relation: The arrangement in space of two objects, in
contrast to relation in time or other relationships. See also
Left and Right.
*Stream: RTP stream as in RFC 3550.
Stream Characteristics: include media stream attributes commonly
used in non-CLUE SIP/SDP environments (such as: media codec, bit
rate, resolution, profile/level etc.) as well as CLUE specific
attributes (which could include for example and depending on the
solution found: the I-D or spatial location of a capture device a
stream originates from).
Telepresence: an environment that gives non co-located users or
user groups a feeling of (co-located) presence - the feeling that
a Local user is in the same room with other Local users and the
Remote parties. The inclusion of Remote parties is achieved
through multimedia communication including at least audio and
video signals of high fidelity.
*Video Capture: Media Capture for video. Denoted as VCn.
Video composite: A single image that is formed from combining
visual elements from separate sources.
Romanow, et al. Expires January 4, 2012 [Page 8]
Internet-Draft CLUE Telepresence Framework July 2011
4. Two Necessary Functions
In simplified terms, here is a description of the functions in a
telepresence conference.
1. Capture media
2. FIGURE OUT WHICH MEDIA STREAMS TO SEND (CHOOSING STREAMS)
3. Encode it
4. ADD SOME NOTES (STREAM CONTENT)
5. Package it
6. Send it
7. Unpack it
8. Decode it
9. Understand the notes
10. Render the stream content according to the notes
This gross oversimplification is to show clearly that there are only
2 functions that the CLUE protocol needs to accomplish - choose which
streams the sender should send to the receiver, and add the right
information to the streams that get sent. The framework/model that
we are presenting can be understood as addressing these two issues.
5. Protocol Features
Central to the framework are stream providers and media stream
consumers. The provider's job is to advertise its capabilities (as
described here) to the consumer, whose job it is to configure the
provider's encoding capabilities (described below). Both providers
and consumers can each send and receive information, that is, we do
not have one party as the sender and one as the receiver exclusively,
but all parties have both sending and receiving parts to them. Most
devices function as both a media provider and as a media consumer.
For two devices to communicate bidirectionally, with media flowing in
both directions, both devices act as both a media provider and a
media consumer. The protocol exchange shown later in the "Choosing
Streams" section including hints, announcement and request messages,
happens twice independently between the 2 bidirectional devices.
Romanow, et al. Expires January 4, 2012 [Page 9]
Internet-Draft CLUE Telepresence Framework July 2011
For short we will sometimes refer to the media stream provider as the
"sender" and the media stream consumer as the "receiver".
Both endpoints and MCUs, or more generally a "middleboxes" can be
media senders and receivers.
The protocol resulting from the framework will be declarative rather
than negotiative. What this means here is that information is passed
in either direction, but there is no formalized or explicit agreement
between participants in the protocol.
6. Stream Content
This section describes the structure for communicating information
between senders and receivers. Figure illustrates how information to
be communicated is organized. Each construct is discussed in the
sections below. This diagram is for reference.
Diagram for Stream Content
+---------------+
| |
| Capture Set |
| |
+-------+-------+
_..-' | ``-._
_.-' | ``-._
_.-' | ``-._
+----------------+ +----------------+ +----------------+
| Media Capture | | Media Capture | | Media Capture |
| Audio or Video | | Audio or Video | | Audio or Video |
+----------------+ +----------------+ +----------------+
.' `.
.' `.
,-----. ,---------.
,' Encode`. ,' `.
( Group ) ( Attributes )
`. ,' `. ,'
`-----' `---------'
6.1. Media capture
A media capture (defined in definitions) is a fundamental concept of
the model. Media can be captured in different ways, for example by
various arrangements of cameras and microphones. The model uses the
terms "video capture" (VC) and "audio capture" (AC) to refer to
sources of media streams. To distinguish between multiple instances,
Romanow, et al. Expires January 4, 2012 [Page 10]
Internet-Draft CLUE Telepresence Framework July 2011
they are numbered for example VC1, VC2, and VC3 could refer to three
different video captures that can be used simultaneously.
Media captures are dynamic. They can come and go in a conference -
and their parameters can change. A sender can advertise a new list
of captures at any time. Both the media sender and media receiver
can send "their messages" (i.e., capture set advertisements, stream
configurations) any number of times during a call, and the other end
is always required to act on any new information received (e.g.,
stopping streams it had previously configured that are no longer
valid).
A media capture can be a media source such as video from a specific
camera, or it can be more conceptual such as a composite image from
several cameras, or an automatic dynamically switched capture
choosing from several cameras depending on who is talking or other
factors.
A media capture is described by Attributes and associated with an
Encode Group. Audio and video captures are aggregated into Capture
Sets.
6.2. Attributes
Audio and video capture attributes carry the information about
streams and their relationships that a sender or receiver wants to
communicate. [Edt: We do not mean to duplicate SDP, if an SDP
description can be used, great.]
The attributes of media streams refer to the current state of a
stream, rather than the capabilities of a video capture device which
are described in the encode capabilities, as descried below.
The mechanism of Attributes make the framework extensible. Although
we are defining some attributes now based on the most common use
cases, new attributes can be added for new use cases as they arise.
If the model does not do something you want it to, chances are
defining an attribute will handle your case.
We describe attributes by variables and their values. The current
attributes are listed below. The variable is shown in parentheses,
and the values follow after the colon:
o (Purpose): main audio, main video, presentation
o (Audio mixed): true, false
Romanow, et al. Expires January 4, 2012 [Page 11]
Internet-Draft CLUE Telepresence Framework July 2011
o (Audio Channel Format): linear array, mono, stereo, tbd
o (Audio linear position): integer 0 to 100
o (Video scale): integer indicating scale
o (Video composed): true, false
o (Video auto-switched): true, false
The attributes listed here are discussed in Appendix A, in order to
keep the emphasis of this draft on the overall approach, rather than
the more specific details.
6.3. Capture Set
A sender describes its ability to send alternatives of media streams
by defining capture sets.
A capture set is a list of media captures expressed in rows. Each
row of the capture set or list consists of either a single capture or
groups of captures. A group means the individual captures in the
group are spatially related, and the order of the captures within the
group, along with attribute values, defines the spatial ordering of
the captures. Spatial relationships are discussed in detail in
Appendix B.
The items (rows) in a capture set represent different alternatives
for representing the same Capture Scene. For example the following
are alternative ways of capturing the same Capture Scene - two
cameras each viewing half of a room, or one camera viewing the whole
room, or one stream that automatically captures the person in the
room who is currently speaking. Each row of the Capture Set contains
either a single media capture or one group of media captures.
The following example shows a capture set for an endpoint media
sender where:
o (VC0 - left camera capture, VC1 - center camera capture, VC2 -
right camera capture)
o (VC3 - capture associated with loudest)
o (VC4 - zoomed out view of all people in the room.)
o (AC0 - room audio)
The first item in this capture set example is a group of video
Romanow, et al. Expires January 4, 2012 [Page 12]
Internet-Draft CLUE Telepresence Framework July 2011
captures with a spatial relationship to each other. VC1 is to the
left of VC2, and VC0 is to the left of VC1. VC3 and VC4 are other
alternatives of how to capture the same room in different ways. The
audio capture is included in the same capture set to indicate AC0 is
associated with those video captures, meaning the audio should be
rendered along with the video in the same set.
The idea is to have sets of captures that represent the same
information ("information" in this context might be a set of people
and their associated audio / video streams, or might be a
presentation supplied by a laptop, perhaps with an accompanying audio
commentary). Spatial ordering of media captures is imposed here by
the simplicity of a left to right ordering among media captures in a
group in the set.
A media receiver could choose one row of each media type (e.g., audio
and video) from a capture set. For example a three stream receiver
could choose the first video row plus the audio row, while a single
stream receiver could choose the second or third video row plus the
audio row. An MCU receiver might choose to receive multiple rows.
The simultaneity groups and encoding groups as discussed in the next
section apply to media captures listed in capture sets. The
simultaneity groups and encoding groups MUST allow all the Media
Captures in a particular group to be used simultaneously.
7. Choosing Streams
The following diagram shows the flow of information messages between
a media provider and a media consumer. The provider sends
information about its capabilities (as specified in this section),
then the consumer chooses which streams it wants, which we refer to
as "configure". Optionally, the consumer may send hints to the
provider about its own capabilities, in which case the provider might
tailor its announcements to the consumer.
Diagram for Choosing Streams
Romanow, et al. Expires January 4, 2012 [Page 13]
Internet-Draft CLUE Telepresence Framework July 2011
Media Receiver Media Sender
-------------- ------------
| |
|------------- Hints ---------------->|
| |
| |
|<---- Capabilities (announce) -------|
| |
| |
|------ Configure (request) --------->|
| |
In order for appropriate streams to be sent from senders to
receivers, certain characteristics of the multiple streams must be
understood by both senders and receivers. Two separate aspects of
streams suffice to describe the necessary information to be shared by
senders and receivers. The first aspect we call "physical
simultaneity" and the other aspect we refer to as "encoding group".
These are described in the following sections.
7.1. Physical Simultaneity
An endpoint or MCU can send multiple captures simultaneously.
However, there may be constraints that limit which captures can be
sent simultaneously with other captures.
Physical or device simultaneity refers to fact that a device may not
be able to be used in different ways at the same time. This shapes
the way that offers are made from the sender. The offers are made so
that the receiver will choose one of several possible usages of the
device. This is easier to show in an example.
Consider the example of a room system where there are 3 cameras each
of which can send a separate capture covering 2 persons each- VC0,
VC1, VC2. The middle camera can also zoom out and show all 6
persons, VC3. But the middle camera cannot be used in both modes at
the same time - it has to either show the space where 2 participants
sit or the whole 6 seats. We refer to this as a physical device
simultaneity constraint.
The following illustration shows 3 cameras with 4 video streams. The
middle camera can be used as main video zoomed in on 2 people or it
could be used in zoomed out mode and capture the whole endpoint. The
idea here is that the middle camera cannot be used for both zoomed in
and zoomed out captures simultaneously. This is a constraint imposed
by the physical limitations of the devices.
Diagram for Simultaneity
Romanow, et al. Expires January 4, 2012 [Page 14]
Internet-Draft CLUE Telepresence Framework July 2011
`-. +--------+ VC2
.-'_Camera 3|---------->
.-' +--------+
VC3
-------->
`-. +--------+ /
.-'|Camera 2|<
.-' +--------+ \ VC1
-------->
`-. +--------+ VC0
.-'|Camera 1|---------->
.-' +--------+
VC0- video zoomed in on 2 people VC2- video zoomed in on 2 people
VC1- video zoomed in on 2 people VC3- video zoomed out on 6 people
Simultaneous transmission sets can be expressed as sets of the VCs
that could physically be transmitted at the same time, though it may
not make sense to do so.
In this example the two simultaneous sets are:
o {VC0, VC1, VC2}
o {VC0, VC3, VC2}
In this example VC0, VC1 and VC2 can be sent OR VC0, VC3 and VC2.
Only one set can be transmitted at a time. These are physical
capabilities describing what can physically be sent at the same time,
not what might make sense to send. For example, in the second set
both VC0 and VC2 are redundant if VC3 is included.
In describing its capabilities, the provider must take physical
simultaneity into account and send a list of its simultaneity groups
to the consumer.
7.2. Encoding Groups
The second aspect of multiple streams that must be understood by
senders and receivers in order to create the best experience
possible, i. e., for the "right" or "best" streams to be sent, is the
encoding characteristics of the possible streams that can be sent.
Just in the way that there is a constraint imposed on the multiple
streams due to the physical limitations, there are also constraints
due to encoding limitations. These are described in an Encoding
Group as follows.
Romanow, et al. Expires January 4, 2012 [Page 15]
Internet-Draft CLUE Telepresence Framework July 2011
An encoding group is an attribute of a video capture (VC) as
discussed above.
An encoding group has the following variables, as shown in the
following table.
+--------------+----------------------------------------------------+
| Name | Description |
+--------------+----------------------------------------------------+
| maxBandwidth | Maximum number of bits per second relating to a |
| | single video encoding |
| maxMbps | Maximum number of macroblocks per second relating |
| | to a single video encoding: ((width + 15) / 16) * |
| | ((height + 15) / 16) * framesPerSecond |
| maxWidth | Video resolution's maximum supported width, |
| | expressed in pixels |
| maxHeight | Video resolution's maximum supported height, |
| | expressed in pixels |
| maxFrameRate | Maximum supported frame rate |
+--------------+----------------------------------------------------+
An encoding group is the basic method of describing encoding
capability. There may be multiple encoding groups per endpoint. For
example, each video capture device might have an associated encoding
group that describes the video streams that can result from that
capture.
An encoding group EG<n> comprises one or more potential encodings
ENC<n>. For example,
EG0: maxMbps=489600, maxBandwidth=6000000
VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
AUDIO_ENC0: maxBandwidth=96000
AUDIO_ENC1: maxBandwidth=96000
AUDIO_ENC2: maxBandwidth=96000
Here, the encoding group is EG0. It can transmit up to two 1080p30
encodings (Mbps for 1080p = 244800), but it is capable of
transmitting a maxFrameRate of 60 frames per second (fps). To
achieve the maximum resolution (1920 x 1088) the frame rate is
limited to 30 fps. However 60 fps can be achieved at a lower
resolution if required by the receiver. Although the encoding group
is capable of transmitting up to 6Mbit/s, no individual video
encoding can exceed 4Mbit/s.
This encoding group also allows up to 3 audio encodings,
AUDIO_ENC<0-2>. It is not required that audio and video encodings
Romanow, et al. Expires January 4, 2012 [Page 16]
Internet-Draft CLUE Telepresence Framework July 2011
reside within the same encoding group, but if so then the group's
overall maxBandwidth value is a limit on the sum of all audio and
video encodings configured by the receiver. A system that does not
wish or need to combine bandwidth limitations in this way should
instead use separate encoding groups for audio and video in order for
the bandwidth limitations on audio and video to not interact.
Here is an example written with separate audio and video encode
groups.
VIDEO_EG0: maxMbps=489600, maxBandwidth=6000000
VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
AUDIO_EG0: maxBandwidth=500000
AUDIO_ENC0: maxBandwidth=96000
AUDIO_ENC1: maxBandwidth=96000
AUDIO_ENC2: maxBandwidth=96000
The following two sections describe further examples of encoding
groups. In the first example, the capability parameters are the same
across ENCs. In the second example, they vary.
7.2.1. Sample video encoding group specification #1
An endpoint that has 3 similar video capture devices would advertise
3 encoding groups that can each transmit up to 2 1080p30 encondings,
as follows:
EG0: maxMbps = 489600, maxBandwidth=6000000
ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
EG1: maxMbps = 489600, maxBandwidth=6000000
ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
EG2: maxMbps = 489600, maxBandwidth=6000000
ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
A remote receiver configures some or all of the specific encodings
such that:
o The configuration of each active ENC<n> parameter values does not
cause that encoding's maxWidth, maxHeight, maxFrameRate to be
exceeded
Romanow, et al. Expires January 4, 2012 [Page 17]
Internet-Draft CLUE Telepresence Framework July 2011
o The total bandwidth of the configured ENC <n> encodings does not
exceed the maxBandwidth of the encoding group
o The sum of the "macroblocks per second" values of each configured
encoding does not exceed the maxMbps of the encoding group
There is no requirement for all encodings within an encoding group to
be activated when configured by the receiver.
Depending on the sender's encoding methods, the receiver may be able
to request fixed encode values or choose encode values in the range
less than the maximum offered. We will discuss receiver behavior in
more detail in a section below.
7.2.2. Sample video encoding group specification #2
An endpoint that has 3 similar video capture devices would advertise
3 encoding groups that can each transmit up to 2 1080p30 encondings,
as follows:
EG0: maxMbps = 489600, maxBandwidth=6000000
ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
EG1: maxMbps = 489600, maxBandwidth=6000000
ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
EG2: maxMbps = 489600, maxBandwidth=6000000
ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
A remote receiver configures some or all of the specific encodings
such that:
o The configuration of each active ENC<n> parameter values does not
cause that encoding's maxWidth, maxHeight, maxFrameRate to be
exceeded
o The total bandwidth of the configured ENC <n> encodings does not
exceed the maxBandwidth of the encoding group
o The sum of the "macroblocks per second" values of each configured
encoding does not exceed the maxMbps of the encoding group
There is no requirement for all encodings within an encoding group to
be activated when configured by the receiver.
Depending on the sender's encoding methods, the receiver may be able
to request fixed encode values or choose encode values in the range
Romanow, et al. Expires January 4, 2012 [Page 18]
Internet-Draft CLUE Telepresence Framework July 2011
less than the maximum offered. We will discuss receiver behavior in
more detail in a section below.
8. Media provider behavior
In summary, what is included in the sender capabilities announce
messing includes:
o the list of captures and their attributes
o the list of capture sets
o the list of physical simultaneity groups
o the list of the encoding groups
9. Putting it together - using the Capture Set
This section shows how to use the framework to represent a typical
case for telepresence rooms.
Appendix B includes an additional example showing the MCU case.
[Edt. It is in the Appendix just to allow the body of the document to
focus on the basic ideas. It can be brought in to the main text in a
later draft.]
Consider an endpoint with the following characteristics:
o 3 cameras, 3 displays, a 6 person table
o Each video device can provide one capture for each 1/3 section of
the table
o A single capture representing the active speaker can be provided
o A single capture representing the active speaker with the other 2
captures shown picture in picture within the stream can be
provided
o A capture showing a zoomed out view of all 6 seats in the room can
be provided
The audio and video captures for this endpoint can be described as
follows. The Encode Group specifications can be found above in
section 6.2.2, Sample video encoding group specification #2.
Romanow, et al. Expires January 4, 2012 [Page 19]
Internet-Draft CLUE Telepresence Framework July 2011
Video Captures:
1. VC0- (the left camera stream), encoding group:EG0, attributes:
purpose=main;auto-switched:no
2. VC1- (the center camera stream), encoding group:EG1, attributes:
purpose=main; auto-switched:no
3. VC2- (the right camera stream), encoding group:EG2, attributes:
purpose=main;auto-switched:no
4. VC3- (the loudest panel stream), encoding group:EG1, attributes:
purpose=main;auto-switched:yes
5. VC4- (the loudest panel stream with PiPs), encoding group:EG1,
attributes: purpose=main; composed=true; auto-switched:yes
6. VC5- (the zoomed out view of all people in the room), encoding
group:EG1, attributes: purpose=main;auto-switched:no
7. VC6- (presentation stream), encoding group:EG1, attributes:
purpose=presentation;auto-switched:no
Summary of video captures - 3 codecs, center one is used for center
camera stream, presentation stream, auto-switched, and zoomed views.
[edt. It is arbitrary that for this example the alternative views
are on EG1 - they could have been spread out- it was not a necessary
choice.]
Audio Captures:
o AC0 (left), attributes: purpose=main;channel format=linear array;
linear position=0;
o AC1 (right), attributes: purpose=main;channel format=linear array;
linear position=100;
o AC2 (center) attributes: purpose=main;channel format=linear array;
linear position=50;
o AC3 being a simple pre-mixed audio stream from the room (mono),
attributes: purpose=main;channel format=linear array; linear
position=50; mixed=true
o AC4 audio stream associated with the presentation video (mono)
attributes: purpose=presentation;channel format=linear array;
linear position=50;
Romanow, et al. Expires January 4, 2012 [Page 20]
Internet-Draft CLUE Telepresence Framework July 2011
The physical simultaneity information is:
{VC0, VC1, VC2, VC3, VC4, VC6}
{VC0, VC2, VC5, VC6}
You can physically do any selection within one set at the same time.
This is strictly what is possible from the devices. However, using
every member in the set simultaneously may not make sense- for
example VC3(loudest) and VC4 (loudest with PIP). (In addition, there
are encoding constraints that make choosing all of the VCs in a set
impossible. VC1, VC3, VC4, VC5, VC6 all use EG1 and EG1has only 3
ENCs. This constraint shows up in the Capture list, not in the
physical simultaneity list.)
In this example there are no restrictions on which audio captures can
be sent simultaneously.
The following table represents the capture sets for this sender.
Recall that a capture set is composed of alternative captures
covering the same scene. Capture Set #1 is for the main people
captures, and Capture Set #2 is for presentation.
+----------------+
| Capture Set #1 |
+----------------+
| VC0, VC1, VC2 |
| VC3 |
| VC4 |
| VC5 |
| AC0, AC1, AC2 |
| AC3 |
+----------------+
+----------------+
| Capture Set #2 |
+----------------+
| VC6 |
| AC4 |
+----------------+
Different capture sets are unique to each other, non-overlapping. A
receiver chooses a capture row from each capture set. In this case
the three captures VC0, VC1, and VC2 are one way of representing the
video from the endpoint. These three captures should appear adjacent
next to each other. Alternatively, another way of representing the
Capture Scene is with the capture VC3, which automatically shows the
person who is talking. Similarly for the VC4 and VC5 alternatives.
Romanow, et al. Expires January 4, 2012 [Page 21]
Internet-Draft CLUE Telepresence Framework July 2011
As in the video case, the different rows of audio in Capture Set #1
represent the "same thing", in that one way to receive the audio is
with the 3 linear position audio captures (AC0, AC1, AC2), and
another way is with the single channel monaural format AC3. The
Media Consumer would choose the one audio capture row it is capable
of receiving.
The spatial ordering is understood by the left to right ordering
among the VC7lt;n>r;s on the same row of the table.
The receiver finds a "row" in each capture set #x section of the
table that it wants. It configures the streams according to the
encoding group for the row.
A Media Receiver would likely want to choose a row to receive based
in part on how many streams it can simultaneously receive. A
receiver that can receive three people streams would probably prefer
to receive the first row of Capture Set #1 (VC0, VC1, VC2) and not
receive the other rows. A receiver that can receive only one people
stream would probably choose one of the other rows.
If the receiver can receive a presentation stream too, it would also
choose to receive the only row from Capture Set #2 (VC6).
10. Media consumer behaviour
The receive side of a call needs to balance its requirements, based
on number of screens and speakers, its decoding capabilities and
available bandwidth, and the sender's capabilities in order to
optimally configure the sender's streams. Typically it would want to
receive and decode media from each capture set advertised by the
sender.
A sane, basic, algorithm might be for the receiver to go through each
capture set in turn and find the collection of video captures that
best matches the number of screens it has (this might include
consideration of screens dedicated to presentation video display
rather than "people" video) and then decide between alternative rows
in the video capture sets based either on hard-coded preferences or
user choice. Once this choice has been made, the receiver would then
decide how to configure the sender's encode groups in order to make
best use of the available network bandwidth and its own decoding
capabilities.
Romanow, et al. Expires January 4, 2012 [Page 22]
Internet-Draft CLUE Telepresence Framework July 2011
10.1. One screen receiver configuring the example capture-side device
above
The receive side of a call needs to balance its requirements, based
on number of screens and speakers, its decoding capabilities and
available bandwidth, and the sender's capabilities in order to
optimally configure the sender's streams. Typically it would want to
receive and decode media from each capture set advertised by the
sender.
A sane, basic, algorithm might be for the receiver to go through each
capture set in turn and find the collection of video captures that
best matches the number of screens it has (this might include
consideration of screens dedicated to presentation video display
rather than "people" video) and then decide between alternative rows
in the video capture sets based either on hard-coded preferences or
user choice. Once this choice has been made, the receiver would then
decide how to configure the sender's encode groups in order to make
best use of the available network bandwidth and its own decoding
capabilities.
10.2. Two screen receiver configuring the example capture-side device
above
Mixing systems with an even number of screens, "2n", and those with
"2n+1" cameras (and vice versa) is always likely to be the
problematic case. In this instance, the behaviour is likely to be
determined by whether a "2 screen" system is really a "2 decoder"
system, i.e., whether only one received stream can be displayed per
screen or whether more than 2 streams can be received and spread
across the available screen area. To enumerate 3 possible behaviours
here for the 2 screen system when it learns that the far end is
"ideally" expressed via 3 capture streams:
1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as
per the 1 screen receiver case above) and either leave one screen
blank or use it for presentation if / when a presentation becomes
active
2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens
(either with each capture being scaled to 2/3 of a screen and the
centre capture being split across 2 screens) or, as would be
necessary if there were large bezels on the screens, with each
stream being scaled to 1/2 the screen width and height and there
being a 4th "blank" panel. This 4th panel could potentially be
used for any presentation that became active during the call.
Romanow, et al. Expires January 4, 2012 [Page 23]
Internet-Draft CLUE Telepresence Framework July 2011
3. Receive 3 streams, decode all 3, and use control information
indicating which was the most active to switch between showing
the left and centre streams (one per screen) and the centre and
right streams.
For an endpoint capable of all 3 methods of working described above,
again it might be appropriate to offer the user the choice of display
mode.
10.3. Three screen receiver configuring the example capture-side device
above
This is the most straightforward case - the receiver would look to
identify a set of streams to receive that best matched its available
screens and so the VC0 plus VC1 plus VC2 should match optimally. The
spatial ordering would give sufficient information for the correct
video capture to be shown on the correct screen, and the receiver
would either need to divide a single encode group's capability by 3
to determine what resolution and frame rate to configure the sender
with or to configure the individual video captures' encode groups
with what makes most sense (taking into account the receive side
decode capabilities, overall call bandwidth, the resolution of the
screens plus any user preferences such as motion vs sharpness).
10.4. Configuration of sender streams by a receiver
After receiving a set of video capture information from a sender and
making its choice of what media streams to receive based on the
receiver's own capabilities and any sender-side simultaneity
restrictions, the receiver needs to essentially configure the sender
to transmit the chosen set.
The expectation is that this message will enumerate each of the
encoding groups and potential encoders within those groups that the
receiver wishes to be active (this may well be a subset of the
complete set available). For each such encoder within an encoding
group, the receiver would specify the video capture (i.e., VC<n&t; as
described above) along with the specifics of the video encoding
required, i.e. width, height, frame rate and bit rate. At this
stage, the receiver would also provide RTP demultiplexing information
as required to distinguish each stream from the others being
configured by the same mechanism.
10.5. Advertisement of capabilities sent by receiver to sender
In order for a maximally-capable sender to be able to advertise a
manageable number of video captures to a receiver, there is a
potential use for the receiver being able, at the start of CLUE to be
Romanow, et al. Expires January 4, 2012 [Page 24]
Internet-Draft CLUE Telepresence Framework July 2011
able to inform the sender of its capabilities. One example here
would be the video capture attribute set - a receiver could tell the
sender the complete set of video capture attributes it is able to
understand and so the sender would be able to reduce the capture set
it advertises to be tailored to the receiver.
11. Acknowledgements
We want to thank Stephen Botzko for helpful discussions on audio.
12. IANA Considerations
TBD
13. Security Considerations
TBD
14. Informative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
A., Peterson, J., Sparks, R., Handley, M., and E.
Schooler, "SIP: Session Initiation Protocol", RFC 3261,
June 2002.
[RFC4353] Rosenberg, J., "A Framework for Conferencing with the
Session Initiation Protocol (SIP)", RFC 4353,
February 2006.
[RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
January 2008.
[StageDirection(Wikipedia)]
Wikipedia, "Blocking (stage), available from http://
en.wikipedia.org/wiki/Stage_direction#Stage_directions",
May 2011, <http://en.wikipedia.org/wiki/
Stage_direction#Stage_directions>.
Romanow, et al. Expires January 4, 2012 [Page 25]
Internet-Draft CLUE Telepresence Framework July 2011
Appendix A. Attributes
This section discusses the attributes and their values in more
detail, and many have additional details provided elsewhere in the
draft. In general, the way to extend the solution to handle new
features is by adding attributes and/or values.
A.1. Purpose
A variable with enumerated values describing the purpose or role of
the Media Capture. It could be applied to any media type. Possible
values: main, presentation, others TBD.
A.1.1. Main
The audio or video capture is of one or more people participating in
a conference (or where they would be if they were there). It is of
part or all of the Capture Scene.
A.1.2. Presentation
A.2. Audio mixed
A.3. Audio Channel Format
The "channel format" attribute of an Audio Capture indicates how the
meaning of the channels is determined. It is an enumerated variable
describing the type of audio channel or channels in the Aucio
Capture. The possible values of the "channel format" attribute are:
o linear array (linear position)
o stereo
o TBD - other possible future values (to potentially include other
things like 3.0, 3.1, 5.1 surround sound and binaural)
All ACs in the same row of a Capture Set MUST have the same value of
the "channel format" attribute.
A.3.1. Linear Array
An AC with channel format = "linear array" has exactly one audio
channel. For the "linear array" channel format, there is another
required attribute to specify position within the array. This is the
"linear position" attribute, which is an integer value within the
range 0 to 100. 0 means leftmost, 100 means rightmost, with other
values spaced equally between. A value of 50 means in the center,
Romanow, et al. Expires January 4, 2012 [Page 26]
Internet-Draft CLUE Telepresence Framework July 2011
spatially. Any AC can have any value, even multiple ACs in a capture
set row can have the same value. The 0-100 linear position is
intentionally dimensionless, since we are presuming that receivers
will use different sized video displays, and the audio spatial
location can be adjusted at the receiving side to correspond to the
displays.
The linear position value is fixed until the receiver asks for a
different AC from the capture set, which may be triggered by the
provider sending an updated capture set.
The streams being sent might be correlated (that is, someone talking
might be heard in multiple captures from the same room). Echo
cancellation and stream synchronization in receivers should take this
into account.
With three audio channels representing left, center, and right:
AC0 - channel format = linear array; linear position = 0
AC1 - channel format = linear array; linear position = 50
AC2 - channel format = linear array; linear position = 100
A.3.2. Stereo
An AC with channel format = "stereo" has exactly two audio channels,
left and right, as part of the same AC. [Edt: should we mention RFC
3551 here? The channel format may be related to how Audio Captures
are mapped to RTP streams. This stereo is not the same as the effect
produced from two mono ACs one from the left and one from the right.
]
A.3.3. Mono
An AC with channel format="mono" has one audio channel. This can be
represented by audio linear position with a single member at a single
integer location. [Edt. Mono can be represented as an as a
particular case of linear array (=1]
A.4. Audio Linear Position
An integer valued variable from 0 - 100, where 0 signifies the left
and 100 signifies the right.
Romanow, et al. Expires January 4, 2012 [Page 27]
Internet-Draft CLUE Telepresence Framework July 2011
A.5. Video Scale
An optional integer valued variable indicating the spatial scale of
the video capture, for example centimeters for horizontal image
width.
A.6. Video composed
An optional Boolean variable indicating if the VC is constructed by
composing multiple other video captures together. stream incorporates
multiple composed panes (This could indicate for example a continuous
presence view of multiple images in a grid, or a large image with
smaller picture-in-picture images in it.)
A.7. Video Auto-switched
A Boolean variable. In this case the offered VC varies depending on
some rule; it is auto-switched between possible VCs. The most common
example of this is sending the video capture associated with the
"loudest" speaker according to an audio detection algorithm.
Appendix B. Spatial Relationship
Here is an example of a simple capture set with three video captures
and three audio channels, each in a separate row:
(VC0, VC1, VC2)
(AC0, AC1, AC2)
The three ACs together in a row indicate those channels are spatially
related to each other, and spatially related to the VCs in the same
capture set.
Multiple Media Captures of the same media type are often spatially
related to each other. Typically multiple Video Captures should be
rendered next to each other in a particular order, or multiple audio
channels should be rendered to match different speakers in a
particular way. Also, media of different types are often associated
with each other, for example a group of Video Captures can be
associated with a group of Audio Captures meaning they should be
rendered together.
Media Captures of the same media type are associated with each other
by grouping them together in a single row of a Capture Set. Media
Captures of different media types are associated with each other by
putting them in different rows of the same Capture Set.
Romanow, et al. Expires January 4, 2012 [Page 28]
Internet-Draft CLUE Telepresence Framework July 2011
For video the spatial relationship is horizontal adjacency in one
dimension. So Video Captures can be described as being adjacent to
each other, in a horizontal row, ordered left to right. When VCs are
grouped together in a capture set row, it means they are horizontally
adjacent to each other, such that when more than one of them are
rendered together they should be rendered next to each other in the
proper order. The first VC in the group is the leftmost (from the
point of view of a person looking at the rendered images), and so on
towards the right.
[Edt: Additional attributes can be added, such as the ability to
handle two dimensional array instead of just a one dimensional row of
video images.]
Audio Captures that are in the same Capture Set with Video Captures
are related to each other spatially, such that the multiple audio
channels should be rendered such that the overall audio field covers
roughly the same horizontal extent as the rendered video. This gives
a reasonable spatial correlation between audio and video. A more
exact relationship is out of scope of this framework.
B.1. Spatial relationship of audio with video
A row of audio is spatially related to a row of video in the same
capture set. The audio and video should be rendered such that they
appear spatially coincident. Audio with a linear position of 0
corresponds to the leftmost side of the group of VCs in the same
capture set. Audio with a linear position of 50 corresponds to the
center of the group of VCs. Audio with a linear position of 100
corresponds to the rightmost side of the group of VCs.
Likewise, for stereo audio, the spatial extent of the audio should be
coincident with the spatial extent of the corresponding video.
Appendix C. Capture sets for the MCU Case
This shows how an MCU might express its Capture Sets, intending to
offer different choices for receivers that can handle different
numbers of streams. A single audio capture stream is provided for
all single and multi-screen configurations that can be associated
(e.g. lip-synced) with any combination of video captures at the
receiver.
Romanow, et al. Expires January 4, 2012 [Page 29]
Internet-Draft CLUE Telepresence Framework July 2011
+--------------------+---------------------------------------------+
| Capture Set #1 | note |
+--------------------+---------------------------------------------+
| VC0 | video capture for single screen receiver |
| VC1, VC2 | video capture for 2 screen receiver |
| VC3, VC4, VC5 | video capture for 3 screen receiver |
| VC6, VC7, VC8, VC9 | video capture for 4 screen receiver |
| AC0 | audio capture representing all participants |
+--------------------+---------------------------------------------+
If / when a presentation stream becomes active within the conference,
the MCU might re-advertise the available media as:
+----------------+--------------------------------------+
| Capture Set #2 | note |
+----------------+--------------------------------------+
| VC10 | video capture for presentation |
| AC1 | presentation audio to accompany VC10 |
+----------------+--------------------------------------+
Authors' Addresses
Allyn Romanow
Cisco Systems
San Jose, CA 95134
USA
Email: allyn@cisco.com
Mark Duckworth
Polycom
Andover, MA 01810
US
Email: mark.duckworth@polycom.com
Andrew Pepperell
Cisco Systems
Langely, England
UK
Email: apeppere@cisco.com
Romanow, et al. Expires January 4, 2012 [Page 30]
Internet-Draft CLUE Telepresence Framework July 2011
Brian Baldino
Cisco Systems
San Jose, CA 95134
US
Email: bbaldino@polycom.com
Mark Goryzinski
HP Visual Collaboration
Corvallis, OR
USA
Email: mark.gorzynski@hp.com
Romanow, et al. Expires January 4, 2012 [Page 31]