Minutes for CLUE at interim-2012-clue-2

Meeting Minutes ControLling mUltiple streams for tElepresence (clue) WG
Title Minutes for CLUE at interim-2012-clue-2
State Active
Other versions plain text
Last updated 2012-06-19

Meeting Minutes

CLUE WG Interim Meeting (June 7-8, 2012)
Stockholm, Sweden
Hosted by Ericsson
Meeting summary by Mary Barnes (version 4, June 19, 2012)

Detailed meeting notes by Bo Burman, Keith Drage, Charles Eckel, Roni Even, Rob
Hansen, Andy Pepperell, Allyn Romanow and Magnus Westerlund

1) There can be multiple entries for the "text" attribute in the capture scene
along with  a unique "language" attribute in the framework and data model

2) Remove appendices from framework. A.1 is out of scope.  A.2&A.3 are
related to A.1, thus out of scope.   A.4 and A.5 have been superseded by
individual drafts which capture the issues and solution options in more detail.
Any consensus around these proposals will be considered in terms of
additions/updates to the framework, as appropriate.

3) Add an optional element to the framework (and data model) for the "axis of
capture" to aid in properly rendering in 3D scenarios.

4) RTP topologies: support per Magnus' presentation: p2p, distributed endpoint,
3 types of mixers.

5) Content-type: reference RFC4796, describe and limit the values (TBD as to
whether it's just "main" and "slides") that are used & specify semantics
specific to CLUE.  Must ensure that the element values are extensible.

6) Data model: general agreement with  basic approach, format and content of
the data model

7) Criteria: will not continue working on this.  Information will be used to
inform the decision when we work through the signaling solution.

Action Items:
-  Magnus: send text to clarify concerns with regards to the framework and
encoding groups (per slide 6 of FW presentation) - Roni will work with Jonathan
to produce one RTP document including topologies - Mark: update the framework
based on discussions/conclusions. Remove Appendices per discussion - - Stephan:
send text that needs to be added to the framework for item 3) above. - Charles:
once new RTP doc is available, forward to SIPREC as FYI and for any feedback. -
Mary: send link to IMTC document to CLUE. - Allyn: develop and submit an
initial call flow document for discussion. - Roni: submit use cases for
switched capture

New Issues/Tickets:
- FW needs more detail on switched capture (BUT this needs to wait until there
is more agreement on the RTP mapping and usage). - Content-type: Decide whether
we have a value in SDP and if so, describe how things work with the value in
the framework - Add a ticket for VAD

CLUE Interim meeting minutes 09:00-12:00 (Thursday, June 7, 2012)
Notetakers: Magnus Westerlund, Keith Drage

Notes by Magnus Westerlund

9:10 – Framework
Mark Duckworth Presenting

Consensuses on allowing multiple scene descriptions to provide alternative
language versions of the description.

Question about Content attribute, what values are allowed? All in the registry
or a limted set applicable to CLUE use cases? Roni Even, we should describe the
ones that applicable. Bozkot, we need to describe interoperability. Espen, we
may for once have the possibility to use some that has been difficult before.
Jonathan, some values doesn’t make sense, but it depends on where we use them.
If we use them in SDP there is a different backwards compatibility story,
compared to in the data model. For backwards compatibility we don’t need to use
all. We should clarify what they mean as the RFC is vague on definitions. Roni
Even, we don’t get interoperability by just including the content tags. Keith,
if we define something, we are creating an alterntive definition. Thus the
choices are to create new values with the definitions we want or do an update
of RFC. Allyn it causes confusion if you allow multiple tags and the ones that
are confusing. Which of main and speaker should use. Clue don’t need that
distinction at this stage.  Putting this issue on the list of issues needing
future discussion.

Magnus asked about simulcast, i.e. providing multiple encodings of a particular
media capture. Magnus thinks the language is not clear.  Jonathan worried about
if this works or creates combinatorial explosion.  Roni, thinks it works but
you need to take care when describing multiple end-condings.

Roni, we need to clarify that simultaneous sets apply across all capture
scenes. Stephen B think we have a bit of problem with simultaneously possible.
Mark responsed that encoding group do have simultaneously applicability. Andy
the group constraints provide a limit over the combinations.

Charles asked that is a capture can only have audio or video. Then you appear
to be limited in describing. TO be taken later

Jonathan, of you have 15 different configurations then do you need 15 different
encoding groups. Andy not, you rather use the group constraints to ensure the
boundaries and then the consumer asks for what it wants.

Ticket #8 is an instance of Ticket #10. Espen, given that you make an end-point
with two screens. If you don’t really care it would be good to provide the
consumer with the providers preferred set. Roni, why put effort into this. The
consumer can always chose something. Andy we shouldn’t get into the thinking
that the provider knows better than the consumer. That will complicate the
mode. More discussion.

Switched capture example.  Keith, why is the switch a problem for the clue
system. Jonathan, will talk about this later. Mary do we need the proposal to
the framework. Appears not based on mailing list feedback. Roni, commented that
you don’t know how a media capture is represented on RTP level. Charles, think
we should have something in the framework. Stephen, lets the RTP discussion
mature a bit.

A.1 video layout arrangement. Allyn thinks there are two separate issues in
this slide. Mark, this is within a single piece of composed media capture,
there is a higher layer composition. That resolved Allyn and she is fine with
removing the text. Keith, if there is only announcement configuration and no
negotiation this can’t happen. The meeting agrees to remove.

Christer Holmberg, A.1 does not prevent source selection. But if A.2 and A.3
are being removed we can’t do source selection.  After discussion agreed that
source selection is currently out of scope for clue. It is for future and wider

A.4 remove it from the framework and continue discuss in individual draft until
agreement in WG exist to include it in the solution/framework.

A.5, Roni saying that VAD issues and the audio rendering tag is not related.
Jonathan clarified that there are two issues and both are discussed in the
framework. Will create a new ticket for the VAD and then discussion continues
about the audio rendering tag.

Switched Attribute
Questions if

Roni, isn’t what Magnus asked before about simulcast? Andy, no that is already
supported by the framework.

Paul, when you introduced presentation this gets messy. Put in an another
scene. Andy, it may still be a capture. If one stops presenting then one would
remove the capture. May be multiple capture scenes.

Need for having multi source synch.

Charles: How do you specify N number of captures provided?

Jonathan why do I need tie the N number of most important be tied to switch.
Why not have a capture entry list be explicitly indicating that this is a
prioritized list.

Stephen, support Jonathan, should be explicit, rather than implicit.

Magnus asked if there isn’t a number of differnet sets of conceptual captures.

Roni, asked if you are not trying to realize a full mesh camera. Nothing
prevents you from doing a full mesh conference. Andy, responded that there is

Jonathan thinks it is an important feature, but it is being done the wrong way.
It also forces to much logic/policy into the middlebox. It is important to
ensure that the right balance between Middlebox doing things and providing
enough information for the end-point.

Charles how strict is the ordering in this set. Andy: A provider if it knows
that a consumer gets 4 streams, then you don’t reorder the VC within the set
you provide, only replace them that are active.

Stephen Botzko, thinks this is a new set of capture type.  Troubled by the
consumer providing the layout to the provider. Andy, there are differences in
the layout that affects what. Stephen, prefer that we provide the information
and consumer can act on the situation, rather then being told by provider.

Roni even,  Andy the case to resolve is the case when the most significant room
is changing between a 3-camera room and one camera room.

How do consumer compose. Andy: based on the assumption that a consumer can
decode multiple streams and scale and compose the media according to desire.

No conclusions about this draft. Will be part of later priority discussion.

Audio tag
Stephen B thinks tags is probably a good thing. But for audio you probably need
to be able to address individual channels.

Jonathan, the simple cases can probably be done with RTCP. For the complex
cases it is insufficient.

Espen, a question of how to associate audio and video.  Andy: the consumer
tells the provider  how to associate this by the tag.

Roni, is the assumption, is the assumption that each audio has its own SSRC.
Stephens case

Stephen, the issue is the switching in the middle is going to loose information.

Jonathan, this appears to be quite similar to the previous presentation. We
need the information.

Keith, I am concerned that we cant really treat the directionality information
for audio like video and cameras. The microphones are more omnidirectional and
will pick up audio

Conclusions: Interesting problem but not clear if it is the right solution.
Continue discussion.

Notes by Keith Drage:
   Thursday, June 6th
   09:00-09:10 Status, agenda bash (Chairs)

   09:10-09:55 Framework: draft-ietf-clue-framework-05.txt
     1) Over view of changes (Mark Duckworth) (15)
     2) Open issues (Mark, WG) (30)


Mark Duckworth presenting

Slide 3

Magnus - Do we allow multiple language attributes or just a single one.
Mark: Currently only allow one however there has been a suggestion to allow
multiple scene descriptions each with a different language. Magnus and Roni
support General conclusion of room to adopt

Charles - Content attribute currently dependent on RFC 4796 - do we use all or
limit only to those applicable to clue. Mark: No resolution to previous
discussion Roni: Values can be used by applications. Unless you have something
that describes what they mean, then interoperability issues. Allyn: Just limit
our usage to two tags. Stephen: Defining interop doesn't mean limiting the
usage of other values. Espen: Support values already there. Jonathan: Some of
the 4796 values don't make sense in clue. Keith: Side issue: Note Sign language
doesn't give which version is used. Main point - Either need to update 4796 if
decide to use SDP, but if one doesn't do this, then one is essentially defining
a parallel set of values that is not 4796. Roni: Main and speaker being used in
some systems right now.

Come back to this issue in more detail later.

Slide 6

Magnus: How to capture simultaeneous transfer in multiple encodings?
Mark: Associate media capture in a group that has more than one encoding in it.
Andy: Elaborated
Jonathan: Possible explosion of combinations - work through a complicated case.

Roni (back on slide 5): Encoding group - clarify set is across all capture
scenes, not one. Stephen: Based on this structure do have problem if want to
indicate multiple simultaneous encodings. Mark: Everything in the encoding
group has to be simulaneous. Andy: Encoding groups can either be specified with
a individual max or say a total max is ...

Slide 7

Captured for discussion later

Slide 8

Espen: Wants to be able to let the provider indicate the best option to choose.
Roni: ...
Mark: First part of first bullet framework already allows.
Andy: Should not capture two screens out of three where the scene offers three.
Also not clear that the provider should tell the consumer to use.

Slide 9

Cover later in RTP discussion.
Chair indicates nothing to add to framework at this time.
Charles: One of the other drafts talks about this which is Jonathan's draft.

Slide 10

Proposal to remove from appendix, as no longer an issue. Meeting agreed to

Slide 11

Proposal to remove these issues. Ticket #1 has been closed. Meeting agreed to

Slide 12

See hansen draft and deal with the issue based on that draft as that is a
better explanation. Remove this issue as a result. Meeting agreed to remove.

Slide 13

See romanov draft on audio-rendering which is a better explanation of the
problem. Roni: Not the same problem. Meeting agreed to remove A.5 but raise a
new ticket to cover the Roni issue.

  09:55-10:15  Framework: Proposals

   10:15-10:30: Break (with coffee/snack)

   10:30-11:30 Framework: Proposals

     3) Switched Capture attribute & spatial coordinates (Andy Pepperell)


Andy Pepperell presenting.

Slide 4

Paul: Value judgement that 1 through 4 is better than 5.

Slide 5

Paul: When include presentation then gets confusing. Particularly if in same

Slide 8

Jonathan (Supported by Steve): This is a new attribute. Overloading "switched"
is not the way to go.

Steve: Do not want to see a new advertisment everytime someone joins the

Slide 16

Magnus: Need to identify what the conceptual capture is. Get the impression
have several different types. Site switching, segment switching.

Roni: Looks like using multisream work in order to achieve a meshed conference.
Andy: This is not about all the streams going to everyone.
Jonathan: Dont like the way proposing coordinates, although this is an
important feature. Charles: How strict is the ordered list? Steve: Thinks this
is another kind of capture scene entry rather than "switched". Not really
enabling a full  meshed conference. Is concerned about the layout. Knowing the
layout is not putting enough control on the layout to the endpoint. Does not
want the provider to make the decision that a 2 x 2 layout is required -
endpoint decision. Roni: Letting the consumer creat soem displays that can be

     1) Audio Rendering Tag (Andy Pepperell) (20)


Andy Pepperell presenting.

Steve: In geenral having RTP tags is a good idea. Two specific things to keep
in mind. Need to Create tags for the individual channel, e.g. stereo. Two
screens with audio channel taf.

Espen: How to associate audio with video. There is only implicit knowledge
between audio and video.

Roni: Does each audio have its own SSRCs. If do, then can map using SRC name.

Andy: If have multiple cameras then is the relation of audio defined.

Thursday, June 7, 2012 3:00-18:00 (Notetakers: Andy, Allyn/Rob)

Notes by Andy Pepperell
Note: some gaps due to trips to the microphone.

Afternoon session

Rob: Consumer spatial information

The need for switching: conference may have 100+ participants, 1000+ captures
Consumers generally want to receive the active speakers
It may not be possible to provide the spatial information from the originator
to all receivers

Multiscreen layout concerns – switching 3 captures out to consumers could go
wrong if render order not known Solved by provider-side spatial information

Roni: not sure understand what you want to achieve here. No problem for the
provider to provide co-ordinates.

Rob: MCU ↔ MCU case still problematic, as neither consumer nor provider in a
position to give co-ordinates to the other party

Gyubong: How does provider use the consumer “Area of Display” information? Rob:
if the provider knew the consumer was rendering a 2x2 layout it would know not
to put the top right capture next to the bottom left (i.e. it would know they
were not adjacent).

Roni: Consumer needs to convey information to the provider to get a meaningful
advertisement. Provider won't send correct advertisement unless is has layout
information from consumer. If you give user a control over layout, they will
change it every second, and CLUE should cope.

Jonathan: we need to figure out whether the layout is controlling the streams
being sent (e.g. groups) or the other way round, i.e. layout is determined by
who's loudest.

Charles: could use chunks of ordered speakers so the provider could advertise
up to 25 loudest speaker captures organised in chunks of 3.

Keith: how much is reverse advertisement or negotiation, and can we re-use the
provider advertisement on the consumer side?

Rob: we did consider various cut-down neighbor-oriented protocols but the
number of additional attributes etc. needed made it less complex to just go for
a co-ordinate scheme instead.

Jonathan: really only makes sense for real displaying endpoints rather than
cascaded cases. Need to consider receiver-driven case and if it can also apply
to MCU cascading case.

Roni: can look at what MCUs do today when forming layouts, and apply that. They
look at the windows they want and 1+5 etc. layouts. It's up to them to know how
to build it, and it's a similar thing we want here, right streams, right
resolutions etc. Andy: that works well until some sources have multiple
captures (camera streams) which have adjacency restrictions, at which point
some layouts become invalid.

Steve: Is “area of display” a number of screens hint, or a layout? Rob: it's
related to layouts and not necessarily physical displays. Steve: so it applies
to MCU cases too? Rob: yes, but there are still problem cases.

Paul: in the cascaded case, I'd presume that the receiving MCU would want all
raw captures not switched captures? Rob: problem with bandwidth and not being
able to receive all the raw captures, so restricted to a loudest speaker subset.

Stephane: whiteboard drawing: 3 camera room with 3ccameras in a linear row, but
an unusual screen arrangement on the receiving endpoint, 2 screens above a 3rd.
Without render-side co-ordinates, MCU might send out 3 captures which would be
rendered wrongly. Andy: believe it's an invalid example because consumer
wouldn't ask for 3 captures “in a row” if it knew its 3 screens were so
arranged, and in fact this is more like a 2 screen endpoint with a separate
presentation screen.

Charles: imagine a vertical list of captures with the most important at the
top. Could ask for, say, 10 cells in blocks of 3 and the MCU knowing not to
split capture groups across rows. Andy: would you need to be updated when, say,
a 4 camera system joined? Charles: yes.

Steve: could see how when we ask for a capture scene the consumer could provide
a layout description. More about layouts than physical monitors.

Stephane: you might want both: information on physical characteristics of
monitors, where they are, what angles they're at, etc. and perhaps also want
the wishful thinking of the consumer, how he would like to see things arranged.

Paul: does this have to be done at the time of selection or can it be a

Stephane: might choose to allocate different areas of my screen for video
conference, and change this over time. The equivalent of “where are the screens
nailed to the wall” can change dynamically, so needs to be at selection time
rather than a new capability.

Espen: given that the user chooses a type of layout, that might not be the best
choice if you dial into a conference with a 3 camera endpoint. Might need to
factor in some information on who's in the conference. None of the examples
have covered what information you need up front to decide what layout is best.

Roni: thinking about if its; a capability or a selection, it depends on how
it's going to be used.

Paul: it seems like the proposal is to change the model in a fundamental way.
Looks like we're talking about removing the capability message and move it into
the selection message.

Stephan: Axis of capture
Room layouts: 2 different room layouts: one way is with multiple cameras in the
centre of the room and capture a semi-circle. Other way is with a camera
attached to each screen in a straight line

If you display a picture captured from a side panel of an “ellipsoidal” room on
a “linear” room side panel the picture becomes distorted (specifically, a 20
degree angle error).

Proposal: information about the capture axis allows consumer-side geometric
correction. An axis in 3D space is defined by 2 3D points. We already have
(optional) point for the camera position. Add one more optional point in camera
definition “axis of capture point” in 3D space. Solely to define axis of
capture. So as long as point of capture is there then addition on a single 3D
co-ordinate will give the axis of capture, and renderer can correct. Just using
the center of the area of capture doesn't work – axis of capture new point does
not need to be on the plane of the area of capture.

Christer: PURPOSE: Define signaling criteria, in order to determine what
transport mechanism is most suitable for transporting CLUE related information.
Non-CLUE specific information will be transported using existing mechanisms

Keith: do we mean “Whether the CLUE information and the media description (SDP)
need to be in the same dialogue” rather than message. I don't think the
advertisement needs to be a dialogue.

Allyn: Data Model. Andy did an initial version for previous interim, but
organised in terms of messages rather than data structures. Newer version, with
Mary's input, organised more in terms of basic information structures.

Top-level “CLUE-info” element; includes capture-description, capture-scene,
simultaneous-sets, stream-description, encoding, encoding-group. Not all
messages would use all elements. Charles: in encoding-group, do you have audio
or video encodings or is it an and? Allyn: this is just a structure definition,
once you come to use the structure it gets concretized to inclusion of actual

Paul: looks to be an example of an XML document rather than a schema for an XML
document (which would be harder to read). If I was to map this to XML, there
would be a top-level CLUE-info structure in each message, which messages'
elements drawn from the set of defined elements. Would now need to construct
individual messages' definitions.

Roni: mapping RTP streams to CLUE media captures. Jonathan: should be clear
that SDP describes RTP sessions, not streams. Multipoint signalling is based on
centralized conference server using one of the RFC5117 topologies: Topo-Mixer,
Topo-video-switch-MCU or Topo-RTCP-Terminating-MCU. Magnus: you seem to be
trying to define how to put media captures into SDP. Rob: if using fixed SSRCs
in SDP, middle boxes need to modify SSRC values, and so need to re-encrypt when
using SRTP (rather than just re-authenticate, which is cheaper)

Jonathan: Captures from “same room” need to be synchronized. Andy: could be
some cases where streams (even within a “main” capture scene) might not be
synchronizable. Can provide encoding ID in RTP header extension, or use a
consumer-chosen ID here instead (to allow the consumer to put some structure on
the Ids for fast filtering, and to allow encoding IDs to potentially be more
verbose). Steve: what about transcoded audio plus forwarded video.

Magnus: RTP topologies. Signaling can restrict what topologies are supported,
and thus the functionality a CLUE system may have. Circular dependencies.

Mary: issues for discussion on Friday: do we need capabilities exchange?

Notes by Allyn Romanow
Afternoon June 7

Rob, Consumer Spatial Information
His draft has a specific point and solution.
Switching is needed. Don’t’ have spatial information
Incorrect rendering can occur
If you don’t have Left Right Center, you don’t know the order in which to put
the captures Could split up people incorrectly If you have switched without
coords,  don’t know how.  Switched with coords works fine. Issues though If
Increase number of offers and increase number of possible layouts

Roni- what’s the issue?
If the consumer wants 12 streams laid out 3 above, rest below
Rob - What about 2x2?
We don’t have the facility to allow consumer to specify. That is what Is being
proposed here Roni thinks we should not talk about it here. This is not
composed Roni not sure we should be discussing this

Steve B – in pt to pt case, coords enough for rendering. With multipoint we
lose this info. want to have provider do layout, get info to him to have him do
it. Steve wants consistency.  Doesn’t want to do things in different ways  for
switched captures and non-switched captures.

Stefan – generally supports Rob’s proposal. Brings up cascaded MCUs vs where
there is a receiving endpoint. They don’t have the same information.  Makes it
necessary to distinguish between endpoint consumer and middlebox consumer

Andy -Consumer side and provider side coords.  If mcu to mcu neither side has
real coords. Needs more work.

Rob – consumer capability message  could be another way to solve the problem

Roni- makes sense to say the number of screens the consumer has. He sees the
need for capability message in order to form the provider advertisement. Layout
inside a screen is not fixed in time. Would there be a later opportunity to
change? Rob – that’s why he wants to provide information  from the consumer
request- because of real-time changes. How often does layout change? User
pushing buttons.

Jonathan- which is cause and which is effect?  Does the source depend on layout
or layout depend on source?  Are we looking for a sender gets it right or
receiver gets it right?
 Rob - this is provider gets it right and receiver provides hints

Charles-sketches a strips and chunks approach.  up to N streams, ordered list,
switched. Send 10 streams of chunk size 3.  Put forth by provider Rob – we
wanted a more abstract approach. Don’t think provider driven strips would work

Espen – what kind of info does provider need? Consumer says 1 large and one
group. wants 2 and 2 groups.  What does the receiver need to know to group
streams properly? Rob – number of attributes to describe strips is large,  can
figure it  out using coords – no need for explicit grouping. Lets provider know
which streams should be put where. Nothing new in protocols.

Jonathan- Only makes sense for endpoints,  not for cascaded cases. Doesn’t want
2 solutions -- one for cascaded and one for endpoints. A receiver based
solution would do both. Where the receiver sends the spatial info in real time.
We should think about it. If one solution works for cascaded case, we should do

Roni – what does MCU need to do to build layout? - find out what size of
window, what image resolution.  Then it is up to them how to build it. This Is
a similar thing. Get streams at right resolutions, then you can build it.

Steve B- prefers spatial constraints on the ordered list on sender side. Layout
within each display? Or of the display?

Paul K – in cascaded case. Doesn’t MCU want raw captures not switched? Rob –
bandwidth constraints may make it not possible.

Stefan example.  Not supported without getting info about geometry on receiver

Charles – Consider a different scheme, one with a prioritized group, specify
number of screens and layout from the list. Rob- this assumes adjacency only
Andy – this constrains the layouts the consumer can do. What happens when a 4
screen endpoint joins. Then what happens? It changes all the layouts.  An MCU
works this way, but it has all the knowledge. Here we are one step away. Would
have to have new messages to say what the maximum number is as it changes.

Roni – agrees with Stefan.

Steve B. – when ask for a capture scene optionally provide a layout. Can see
how it helps.

Paul K- does this need to be done at selection timeor can it be as a consumer
capability message before the provider advertisement? Stefan – the placements
on a monitor are not nailed down, for a big screen. Very dynamic so needs to be
at selection time Rob - Needs to based on what’s available, which is learned
from the provider advertisement

Espen-what kind of information do you need up front?
Rob – this should be a subsequent discussion
Roni – a capability or selection? Depends on how it is going to be used. He
thinks it helps provider to know what to offer. Paul – seems like proposal is
to change model in a fundamental way and get rid of capability message. Rob- he
hasn’t said anything about capability message. Paul- restricts functionality to
those entries that are switched.

Keith- we are discussing capability vs selection message, but he doesn’t know
what these mean. Comment – they are in the framework

Andy – doesn’t think this is changing fundamental nature of things. Also wrt
the initial capability message, we haven’t discussed it yet. We want an
asynchronous message for this purpose. Selection, already has the  max
resolution and hints for layouts and video encoding, etc. He doesn’t think the
proposal fundamentally changes  the model or makes the  case for a need for the
capability message.

Mary’s recap. Of specific points.
Allyn says do we have consensus that this is an important issue?  Even though
we don’t have agreement on how it should be solved.

Stefan Task #9 Axis of camera
2 different arrangements of cameras, screens, tables, etc. for 2 different
telepresence systems In one, it’s arranged in a semicircle, in another  it is
flat to the wall with  the presentation screen on top. In the semicircle, 3
cameras are in the center. In the flat arrangement, the ameras not in the
center but are far away from each other. We want to make these 2 systems
interoperable If one captures from center and if other assumes it’s a flat
scenario,  then the rendering is 20 degree off capture angle. People are

Stefan’s proposal – if we know the axis of capture.. the geometric relation of
the cameras, we can do render side geometric correction. Only the render side
knows it’s own  display technology.

We have already in our capture data structure the point of capture. Add an
optional additional point in camera definition “axis of capture point” in 3D
space. Solely to define axis of capture.

Need an additional data point.
Details need more defining.

Stefan-  a function of the camera itself . pick any point on the line and it
defines the line Keith- not sure if we need to do this. Stefan – need to know
the problem exists in order to do something about it Andy—works well. But there
are issues with middle boxes not carrying the data thruWe have to figure it out.

Data Model, Christer
Define signaling criteria in order to determine which transport mechanism is
best for transporting clue related info. Signaling and transport Criteria –
what he proposed on the ML Is this useful? Do we already know?

Keith – Asks whether clue info and media description in SDP need to be in same
message. Why message? Does it need to be a dialogue?

Allyn Clue Data Model

Allyn: Andy put together an initial data model constructed in terms of
messages. Mary didn't think constructing it in this fashion was optimal, and
suggested instead structuring it after RFC 6501 (XCON).

Allyn: Two initial points. Firstly, the aim is to decide whether this is a
sensible methodology for using the model that we want to continue with. The
second was that the model includes a few new elements, which are listed in
capital letters- we shouldn't consider this now, it's a distraction. Ignore for

Allyn: The initial description may appear to match the provider advertisment,
and some may wonder where the consumer request is; remember that these aren't
messages, but are the elements from which messages will later be defined.

Allyn now goes on to describe the capture description. This contains a number
of elements, which aren't defined as belonging to specific messages, but are
instead usable in a range of messages. The description includes a new element
DERIVED, which is an evolution of the original 'composed' element. A second new
element, part of 'spatial-description', is NATIVE-ASPECT-RATIO, which was not

Allyn: Any thoughts so far?

Allyn now describes elements in , which contains elements matching the
framework. Recently 'capture-scene-text- and -capture-scene-langauge- have been
added. 'capture-scene-spatial-description' has also been added based on
previous discussion.

Next,  entry is quickly convered, which has its own element because it doesn't
fit into anywhere else.

Finally,  is discussed, which contains  elements. Jonathon notes that  should
be in caps, as it comes from a draft and isn't part of a framework.
AUDIO-RENDERING-ID is also a new element, based on the audio tagging draft.

Roni: Wouldn't it make sense to divide this structure into codec-specific and
general attributes, particularly because of the max-H264-Mbps element.

Allyn: We'll need to discuss that specifically seperately, though it sounds

There was a question about encoding-id and whether it appeared in SDP - others
opined that no, it was to match it to captures.

Finally,  contained a set of elements.

Allyn: I think we should establish if this is a good representation of the work
we've been doing

Jonathon: I haven't gone through it in detail, but capitalised sections aside
it seems a faithful representation of the framework document

Charles: I thought  was at the capture-scene level? Though I like it better

Others said that this wasn't the case and that media-type must only be
consistent between capture scene entries.

Mark: I agree with Jonathon that this is good. But these don't include the
messages - when the messages are defined will they be part of the data model?

Mary: No, they will be seperate but will draw on the data model.

Paul: This is similar to an XML document rather than the schema for an XML
document; it's easier to read but less tightly defined. My only concern is that
because messages will contain some elements but not others this is closer to
schema definition, with a seperate document to define how the messages are
constructed from these elements.

Roni Mapping RTP Streams
Is there the same or different ssrc when zoom? in usual conferencing, not in
CLUE? Roni says same ssrc comment - if the stream continues, not if there are
stops and starts Impossible to distinguish between a single device with zoom in
and out and two devices, one zoomed in and the other zoomed out
 A given ssrc can provide one or more video captures over time, fits with
 switched model

Jonathan- We are doing multiplexing SSRCs, single RTP session
Need for multiple RTP sessions sometimes – for backward compatibility,
decomposed hardware

Andy- wouldn’t backward compatibility be best with a few mlines. Not
introducing extra mlines. Roni wants 1 mline per simultaneous capture.

Jonathan RTP Usage for CLUE- what’s new in draft since Paris
•       Added a new requirement- the need to synchronize even for switched
captures •       Description of his architecture •       Proposed Architecture

Media requirement #12-  need for synchronizing
Correlate advertisement and requests …
Multiple transport flows
Coupling between sources and captures for switched captures is loose
Loose and dynamic

Single RTP session on a single UDP transport per media type.  Source

MCU is a translator for static and switched, and an RTP mixer for locally
generated composited captures
 Steve – if audio is mixed but video is switched, this case needs to be
 described and examined.

Switched – means provider chosen , not necessarily forwarded rather than coded

RTP Topologies Magnus
Differences between topologies and issues
Signaling topologies and functionality all interrelate.
Functionality->signaling->topologies ->

What RTP functionality a given topology enables
Derive correct requirements for signaling

Evaluation critieria
•       Security- key management, who has the keys, source authentication
End to end verifiable
Trust in central node

•       Congestion control
One or multiple receivers of the same RTP stream
Media aggregate adjustments

•       Bandwidth consumption
•       Media quality
•       Distribution of complexity

Topologies outline-  list
•       Point to point
•       Distributed end-point
•       Mesh multi-unicast
•       Media-mixer – own SSRC
•       Media switching mixer- this what Polycom does. The handling of ssrcs
varies. That is why it’s relevant •       Source projection mixer- this is what
Vidyo •       Relay (transport translator) 5117 With mixers trust middle box.
Here in common session. Anyone can claim to be another source. Relay could do
some basic checking. But could do more. TESLA (RFC 4082) or similar is needed.
. Won’t see media content.

There is possibility of congestion if any are congested it creates. Have to
share capacity on all paths. Have to explicitly manage. Share bandwidth
Complexity - all in endpoints. Not negotiable by offer answer •       Selective
forwarding switch. Not supported today by RTP. Switch turns individual source
on and off based on policy Congestion – detection confused by disappearing and
reappearing sources. If we want to do this, we need to extend RTP. Heard of
people wanting to implement this. Not need keys but optimize content •      
End-point forwarding Not for clue, for rtcweb. •       Any source multicast •  
    Source specific multicast Lecture hall use case from use case doc, maybe,
more work needed

Do we need to select supported topologies?
Does clue signaling need to take all into consideration?
Jonathan- some of the CLUE assumptions of how to use MCU rules out some of the

Summary and plans for tomorrow
Mary’s notes
Do signaling and call flow tomorrow

1.      Do we need consumer capability message?
2.      Content type – what should we do? Options
3.      Relationship between capture scene entries p.13 framework doc
4.      Ticket #8. How to differentiate between different capture scenes
5.      Switched capture – discuss solution options
6.      Consumer layout – Rob

Data Model
1.      Evaluation of criteria against data elements
2.      Differentiate between encoding parameters from codec specific
parameters? 3.      Do we agree with the basic approach, format and content of
the data model

1.      What topologies should we support?

Mary will email the list, wants answers by 6 am.
Start at 8:45

Friday, June 8, 2012 9am-noon  (Notetakers: Charles Eckel, Roni Even)

Notes by Charles Eckel

9am start of meeting

Agenda Bash
Mary: Should we have a discussion on call flows
Discussion: We are not ready/prepared to have a useful discussion.
Action Item: Allyn to prepare draft to guide discussion at future

Framework (FW):

1) Do we need a consumer capability message?
           Media Consumer                         Media Provider
           --------------                         ------------
                 |                                     |
                 |----- Consumer Capability ---------->|
                 |                                     |
                 |                                     |
                 |<---- Capture advertisement ---------|
                 |                                     |
                 |                                     |
                 |------ Configure encodings --------->|
                 |                                     |

Discussion regarding historical purpose and current thoughts on value of
consumer capabilities messages.
How many people think we need the message, as currently defined in
framework: 0
How many people think we should remove it: 13
How many people think we need more information: 7
Action Item: Take to the list to validate removal

5) Switch capture:
- discuss proposed solution options:
- e.g., suggestion to use a new attribute for ordering - versus
overloading attributes.
- encoding groups


6) Consumer layout and capture selection (i.e., discuss Rob's proposal)

Andy's proposal (summary): based on overloading switched attribute plus
the absence of any spatial relationship of media captures to mean that
set 'n' media captures is something that can be rendered in a useful
fashion even if only a subset of the media captures within the capture
scene is received.
Rob and Andy agree that this and Rob's layout draft both address the
same problem, and both propose similar solutions.
Consumer has ability to convey its layout preferences such that provide
can try to align what it sends.
Overloading switched this way is problematic because very possible to
have capture scene with combination of media captures both with and
without spatial information.
Paul and Steve: this is a model change from providing advertising and
consumer choosing to consumer advertising and provider filling.
This adds flexibility to expense of consistency.
RTP/RTCP can be used to convey actual spatial information, outside of
provider advertisement or consumer capabilities message. However, this
requires waiting some amount of time to learn this information before
making video visible; else may need to relocate or change layout.
Consumer layout coordinate is attempt to minimize or remove this need
while facilitating rendering video in real-time.
Action Item: Andy and/or Rob to update drafts to include use
case/requirements/problem statement to help clarify the problem they are
trying to solve; then have discussion of solution.

4) Ticket #8
How do consumers differentiate between multiple capture scenes?
Paul sent an e-mail on this previously, no response.
Discussion raised more problems, no solutions.
Is user selection based on textual description sufficient - no.
Action Item: request group to review Paul's e-mail and provide comments.

Lunch at noon.
End of notes.

Notes by Roni Even

Agenda bashing
Should we discuss call flow.
Christer – are we ready for this work.
It looks like people are not sure how to progress
Allyn volunteered to write a draft for next meeting with a call flow proposal.
Do we need a consumer capability message.
Discussed at the microphone
Paul: Trying to reduce the set of configuration in the advertisements.
Roni: we will need a consumer request message if no capability
Jonathan: need some basic capabilities, like I can do CLUE, clues version XX
Keith: You can start with either of the message
Andy: Initial thought was for the consumer to say which attributes it supports.

Summary: some people think that we do not need the consumer capability. We need
to have flexibility in the advertisement and configuration. For the sense of
the room: Remove the consumer capability message from the framework – 13. 0
leave it. 6 do not think we can decide now.

Switched capture:
Andy / Mark discussion: On the spatial relation similar to what suggested in
Rob’s draft which describes the consumer layout. Andy: the distinction is if
the provide can provide based on the offered layout (Rob’s) which may cause the
provider for example in the case when the layout is 2 by 2 to not send a 3
camera view that cannot be divide while Andy suggest that the consumer will
chose the views and build the layout based on the priority from the provider.

The discussion is about what information should be conveyed, who decide what to
send and how do you know what happens when a switch occurs and respond to it
fast enough. We need to have the application usage agreed and then the optional
solution. There is Rob’s and Andy’s draft. Roni proposed another way that is
based on source selection. Roni to post use case/s

Ticket #8
Paul – example of different meeting room
Roni – example of medical , Rob example of security
Question: Is user selection based on the text description enough to address the
topic. Need use cases.

Friday, June 8th, 2012 - 13:00-16:00  (Notetakers: Bo Burman, Rob Hansen)

Notes by Bo Burmann
Issue prio 4: RTP Topologies
Topology media mixer supported. New topologies media switching mixer and the
source projection mixer, supported? Jonathan: Support all these models if you
want to support certain functionalities. Roni: Support at RTP layer is
different than on application layer. Magnus: Think we need to support all
three. Main difference between media switching and source projection is what
you have to do to support identity. You need additional information about
identity and capture in both cases. Roni: How handle RTCP? Magnus: Mixer will
have a role where it has to manipulate RTCP information. It is partly
implementation choices. Roni: Will need to describe RTCP behavior. Magnus: Yes.
There is a congestion control WS, including BoF, in Vancouver. Jonathan: There
are cascading issues and all CLUE-enabled boxes will have to work together. The
topologies that CLUE decides to support will not only have to work
independently, but will also need to work together. The only interesting case
is the switched capture. Jonathan: The fixed case collapses to a switched which
never switches. Paul: Also a media mixer that does not mix or transcode.
Magnus: Andy or Rob made the point yesterday that it will need to support
different topologies simultaneously. They degenerate together, which is one of
the reasons that we need to support all of them. Paul: Does it matter to the
receiver which model the sender uses? Magnus: Slightly. At least the number of
simultaneously active SSRC differs. Mary: We need to reflect this in a
document. Roni and Jonathan work together and update the existing document.
Magnus will contribute by reviewing. Paul: Do we need different RTP extensions
for the different models to map RTP to media captures? That should be
investigated and go into the document. Jonathan: We also need to ensure that
the signaling supports multiple transports, like for the distributed end-point
or for different QoS. Magnus: We seem to be in agreement for distributed
end-point. What about mesh? Is there CLUE information problems there? Think
that we need to discuss it. Jonathan: Don’t think there will be CLUE problems.
We may have to discuss it. Keith: So far charters excluded mesh. My assumption
is that we should exclude it. My understanding that all work made in DCON
excluded mesh. Gonzalo: There was not enough energy in the DCON BoF to start
mesh work. Paul: Consensus call: humming unanimously opposes bringing up
support for mesh in CLUE. Mary: If someone thinks we should support, they need
to bring a draft. Magnus: I would like to conclude that that the type of Relay
described in the RTP Topologies presentation is not in scope. Charles: The
end-point forwarding reminds me of siprec. Will look into that. Jonathan: We
probably want to consider how we siprec CLUE. Mary: We could put an issue on
the tracker. Keith: Suggest a paragraph in the framework that siprec can start
to work with, since it is not in CLUE charter. Mary: We have no use case and
nothing in the charter. Siprec likely does not have it in their charter either.
Paul: At some level siprec could already deal with CLUE RTP data. It has
extensible mechanism to deal with metadata. Roni: siprec was motivated by
people doing recording systems, not end-points Charles: some of CLUE complexity
does not apply to siprec. Regarding RTP models, the thing discussed here will
be a good starting point for siprec, probably based on an updated CLUE RTP
draft. Keith: I believe we should have text in framework. Mary: Disagree. When
we have a draft we could send that fyi to siprec. Jonathan: One endpoint in
CLUE is a recording device. Mary: When we have call signaling and rtp outlined
for CLUE we could send fyi to siprec. Magnus: What about multicast? Paul: I
don’t know anyone that made multicast work with SIP. Keith: It is described in
H.361, but was never updated to really make it work. Magnus: Summary, p2p,
distributed end-point, all three mixer models should be supported in CLUE.

Issue prio 5: Content Type

 *   Borrow 4796 and refine semantics in clue
 *   Update 4796 with additional semantics to ensure interop
 *   Reference imtc document, if published informational / AD-sponsored document
 *   Something new

Charles: If it was an accessible IMTC document, would that be sufficient? It is
a separate issue. Mary: Yes. It would be good if Charles could look into that.
Roni: How do we define values? If separate than 4796, we can define other
values. Do we signal the information in two places, in SIP and elsewhere?
Keith: What do we need to have the same values for? Jonathan: The IMTC document
discusses procedures to handle fallback. The one benefit it would have is for
CLUE to be backwards compatible, which is somewhat different than defining
values. Steve: IMTC procedure is a legacy interop procedure. Paul: If that is a
requirement, we need to write it down, explicitly. Mark: Is there yet another
issue with the IMTC document, using bfcp. Do we need bfcp floor control, even
when not talking with other clue end point. Seems to be agreement. Keith: Want
link to document (Charles posted it to mailing list). If we need new semantics
or new values for content type, that should be in 4796bis, not in CLUE. Paul:
Whatever you call it, is there an implication to use floor control that has a
certain type, but not for other types? Charles: Don’t think so. If people see
use for new values, they are welcome to define them. Jonathan: I think there is
an issue with bfcp. Don’t know how to say that in CLUE, since CLUE side is all
provider side and consumer side. Don’t know how to hook a bfcp to a capture in
CLUE. Keith: Bfcp is getting special roles. What does that imply regarding
media from that end-point. It has to be defined per media type. Charles:
Talking about bfcp is pre-mature at this point in time, even before we have a
data model. Keith: We can progress CLUE signaling semantics independently from
what is available in SDP today. Jonathan: Don’t know how to fit use of BFCP
into our framework. That is a problem. IMTC document has nothing to do with
this. If we define a structure where this is impossible, that would be a
problem. Mary: Floor control is considered out of scope for CLUE. Roni: That
was for conference control, not for handling presentations. Allyn: Agree with
Jonathan that our framework should allow using bfcp. Propose putting it as a
Task item. Want to use only “main” and “slides”, “sl” and extensions (not
“speaker”), subset of 4796. Keith: Valid to look into how to support existing
floor control in CLUE, not to define our own floor control. Paul: Ask Jonathan
to write something of the bfcp concerns to the mailing list. Mary: Reference
4796, describe and limit the values that are used and define semantics for CLUE
Keith: The only way of doing this is to update 4796 Roni: That CLUE defines
more precise semantics does not require changing anything in 4796 Mary: a)
reference, b) update, c) something new (don’t reference). Raise of hands: 11 a,
1 b, 1 c. Majority support option a. Roni: Do we have the value in the SDP?
Keith: Think we should start fresh regarding content attribute for CLUE. Mary:
People don’t want alt and speaker, want main and slides, and are mostly neutral
on sl (sign language). Only use values main and slides; raised hands 8. We
concluded that we’re going to reference 4796, but don’t have full consensus and
will take it to the list.

Issue prio 6: Do we agree with the basic approach, format (structure) and
content of the data model? Mary: Is this a good start? 12 raised hands.

Issue prio 7: Relation between capture scene entries
Charles: How is the capture scene entry represented, especially in relation to
a capture scene? Allyn: See the draft.

Issue prio 8: Evaluation of criteria against data elements
Christer: Is this data model is to be seen as a single entity, or should we
start to map it to various protocols? Mary: Will not continue working on this.
We record also criteria in framework and in the data model, when we work though
the signaling solution.

Issue prio 9: Relationships between capture scene entries [some confusion in
which order we discuss issues here!] Charles: Some media captures could be more
useful than others, how can the provider announce that? See mail. Mark: Think
usefulness of AC0,1,2 could be more useful with we pursue audio tag draft.
Andy: Why not use 0-2 if you have those rendering capabilities. Choose VC3 or 4
could be user-based what he/she wants. Choice of AC3 may be bandwidth. Roni:
VC4 can be identified as being full room from spatial information. All
combinations may be useful. Jonathan: Agree with Charles. Consumer need
sufficient information. Steve: You can distinguish since you have the spatial
information. VC3 and AC0-2 can be good for a mobile or limited device. Charles:
Saw it as a problem. Some people see it as a problem, some not. Need to think
about it. Rob: Maybe VC3 is less appropriate for spatial audio than VC4 since
VC3 is switched, meaning that sometimes the spatial audio will come from the
left, sometimes right, but video will always be front. We would need to express
that we get a segment switched into a single video. Roni: Renderers are allowed
to use the information differently.

Conclusions (notes outside of conclusions as presented to the meeting)
Roni: Move conclusions switched capture priority to below call flows.
Mary: Update milestones to reflect that use cases and requirements were decided
to be left open until we have a framework that is mature enough.

Notes by Rob Hansen (13:00-16:00, June 8, 2012)

1pm: Meeting restarts

Topic: What RTP topologies should we support?

Identified three topologies as in use:

-media mixer
-media switching mixer
-source projection mixer

no one initially felt any of these three could be discarded as not desirable to

the RTCP behavior for all the topologies was fully defined

Jonathon brought up that cascading was an issue, and that it was important that
in CLUE middle boxes with different topologies should work together - there was
general agreement on this

Paul brought up the question of whether a given topology would be used for
different use-cases in CLUE, or whether all of CLUE should function with each
topology. Generally it was felt that the latter was the case, though.

It was proposed that the media mixer topology had a strong connection with
offering composed captures in CLUE and was the only sensible way to do it, so
the difficult case was that switching could be done either by media switching
or source projection.

Magnus reminded people that Andy brought up the point that a consumer might
receive packets from multiple topologies (eg, both switched and composed
mixers). People opinied that this was true, but that hopefully it didn't make
sense for a single middle box to combine media switching and source projection
mixer modes.

Paul wanted to know if the consumer had to operate differently to cope with
different topologies - Magnus said that there would be slight differences based
on source identification.

Mary asked if all of this was documented - Roni said no, but that he and
Jonathon would work on a single document to support those topologies,
potentially with Magnus and Allyn's help.

there was consensus that we would need to support the three topologies
initially stated, and that the document would examine the mechanisms for doing
this, along with ensuring that the methodology would work for scenarios such as
decomposed captures.

Magnus asked if there was interested in supporting a multi-unicast mesh
topology. There was general agreement that this would require little change on
the RTP level. There was debate at what this would mean at the SIP level, and
Keith pointed out that within IETF at present that mesh topologies were
currently out of scope. A consensus call was made, and there was unanimous
concensus that this was not something we should pursue. The chairs stated that
this conclusion would be put on the list, and anyone who disagreed should also
be willing to do some of the work on this issue.

Charles believed that the end-point forwarding topolgy shouldn't be
immediatelly dismissed, and that there was a larger question of how SIPREC and
CLUE would interact. Mary pointed out that there was no current use-case for
recording, and questioned whether there was a requirement to add one. Roni
suggested that SIPREC was designed to be transparent at the SIP level, and that
it was their responsibility to add support for CLUE if they felt it had. There
was dicussion of the next step to take; Mary proposed that once we had made
progress this send to SIPREC for review to see if the RTP and signalling would
fit well with SIPREC.

Multicast was proposed and there was general feeling that no one wanted to
explore this.

As such the valid topoliges for consideration were concluded to be
point-to-point, distributed endpoint and all three mixer models.

1:50pm - Topic: content-type

There was discussion on the approaches for making changes to content-type.
Options included defining semantics for RFC 4796 for CLUE, updating RFC 4796
with semantics, finding a way to publish the updated IMTC document on the
issue, or doing something new entirely.

There was discussion of whether there was actually value in actually using RFC
4796; for fields in CLUE there's a less strong requirement to reuse RFC 4796.

It was concluded that while we have a general requirement to interoperate with
current SIP devices, it would be valuable to explicitely write down that we
want to ensure that existing methods for content in SIP should continue to work
when connected to a CLUE device.

Charles said that he woult look at making the IMTC document publically

Paul asked if there should be a requirement for media types labelled as
'presentation' to have BFCP linkages. This lead into a discussion of how BFCP
and CLUE would interact, though there were no firm conclusions and it was felt
that we were too early to come to a conclusion. Jonathon was concerned that he
couldn't see any way to attach BFCP floors to the CLUE framework, and that
before we got too much further we should ensure that we weren't designing an
architecture that was incompatible with BFCP floors. It was felt that those
with time and interest should look at how BFCP and CLUE would interact.

Returning to RFC 4796 there was procedural debate about how, if we referenced
RFC 4796, the best way to do so in IETF.

A hum was taken with three options:

a) reference RFC4796, describe and limit the values that are used & specify
semantics specific to CLUE. Must ensure that the element values are extensible
b) update RFC4796 with semantics for the purposes of CLUe c) something new
(values and semantics) (don't reference RFC 4796)

there was strong, but non-unanimous concensus for (a).

Having done so there was a consensus call for whether we should limit the
RFC4796 values to 'main' and 'slides'. Those who expressed opinions agreed, and
no one actively disagreed, but some in the room were concerned that there
wasn't yet sufficient information, and as such there wasn't sufficient support
to be considered that there was concensus.

2:40 - break

2:55 - Topic: Whether the approach, format and content being taken with the
data model.

Allyn provided a bit of clarification and then there was a call for concensus
on whether people thought the current document was a good starting point. There
was concensus that this was a good approach, though consensus was not strong
enough for this to be made into a clue document without further iterations.

It was agreed that the requirements criteria would not be folded into the data
model, as there isn't good overlap between them. It was agreed that creating
the requirements criteria was useful and that it would serve as a good
reference when making protocol decisions, and could help to resolve disputes.

3:10 - Topic: Charles' query on selecting capture scene entries.

Charles questioned if there was enough information at present to choose between
capture scene entries, and if so what the criteria of selection would be.
Specifically, it was a question that involved how you would differentiate
between a composed view giving a view of the entire room, and a switched
capture showing one-third of the room at any given time; in the first case it
makes sense to do have spatial audio left-centre-right, while for the latter it
may make more sense to centre everything to mono.

3:30 - Wrap up

Mary presented a set of conclusions, action items and new issues/tickets. A
list giving the way forward for documents was presented.

There was a brief discussion of SCTP over UDP as there was some concern that
there had been discussed but not documented anywhere. The discussion resolved
that there was not yet a conclusion that there would be a seperate data
channel, but that if we did use a seperate data channel it would be a strong

The way forward for documents were:

Framework (mark to make updates)

Data model (Allyn to make further progress)

RTP: Roni and Jonathon to work together and include topologies (magnus to

Call Flows: Allyn will put together an initial call flow document

Switched Capture: need to further discuss this

Audio Rendering tag: more discussion needed