Minutes from CLUE Interim


Date:    Tuesday Oct 11 – Wednesday Oct 12, 2011

Location:         Polycom office, 100 Minuteman Road, Andover, Massachusetts

Chairs:            Mary Barnes, Paul Kyzivat

Note Takers:    Allyn Romanow, Roni Even, Spencer Dawkins

Minutes Editor:           Paul Kyzivat & Mary Barnes

Recorded playback:    Tuesday / Wednesday:

Attendees (in no particular order):

Stephan Wenger, John Leslie,  Mark Duckworth, Roni Even, Mary Barnes, Paul Kyzivat, Gonzalo Camarillo, Jonathan Lennox, Steve Botzko, Spencer Dawkins, Andy Pepperell, Brian Baldino, Hadriel Kaplan,  Allyn Romanow,  Bo Burman, Gyubong Oh, Marshall Eubanks 

Webex:   Alan Johnston, Espen Berger, Gerard Fernando, Shida Schubert, Stephane Cazeaux


·      Use Cases:   The pre-WGLC review for draft-ietf-clue-telepresence-use-cases did not garner a lot of WG feedback.  We really need folks to thoroughly review this document.

·      Requirements:  draft-ietf-clue-telepresence-requirements is about ready for a pre-WGLC review.  We also need to consider how comprehensive the document needs to be and whether everything in the solution must be mapped back to the requirements.  The general sense was that the latter is not required.

·      Framework:    draft-romanow-clue-framework-01 was agreed as a WG draft (pending additional WG feedback on mailing list)

·      General:

      Based on the inter-dependencies between the various documents, we will likely progress them as a set.   Although, we still want to do early pre-WGLCs to ensure the requirements and use cases are as complete as possible as we start the detailed framework and protocol work.  

      Discussion of signaling protocol for CLUE led to the suggestion that there is a need for a BoF for TCP-over-UDP since CLUE needs something like this and several other WGs have similar needs.


Summary of Action Items:

Note: these are derived from the issue summary discussed at the end of the meeting, as well as actions gathered during the meeting.

·      Use Cases:

·      Requirements:

·      Framework:

    1. Point/areas of capture  (more discussion on mailing list + add text with assumptions) (Mark/Brian):

                                               i.     Origin/coordinate system (more discussion on mailing list)  (Marshall)

                                             ii.     Screens vs cameras  (for the origination of coordinate system – see above)

                                            iii.     Consider addition of 3rd dimension as part of the origin/coordinate system

    1. Layout (including component/switching algorithms) (more discussion)  (Mark)
    2. Source selection – e.g. ability to pin sources (Need specific use case, are there new requirements?, how functionality is accomplished with framework) (Jonathan)
    3. Attributes for capture sets – each level of the hierarchy should have attributes (need text per Roni’s issues)
    4. Describe composed picture so endpoint can decompose and put “these things” on “several things” – more like a tree structure (i.e. reuse spatial coordinates) (Stephan to provide text proposal to mailing list)
    5. Section 6.1.2 – need renderer description (more discussion on mailing list)  (Stephan)
    6. VAD (general approach agreed, need more details – how to convey, etc.?)  (Andy)
    7. Clarify architectural model in framework (framework authors) (See “Architectural Model” slide in http://trac.tools.ietf.org/wg/clue/trac/raw-attachment/wiki/WikiStart/Framework-comments-roni-v1.ppt)
    8. Clarify how to define extensions (draft authors)
    9. Separate policy management from attributes (draft authors)

·      General:

a.     Signaling (Stephan/Roni/Marshall to put together a draft)

b.     Relation to RTP draft (Jonathan/Allyn based on presentation)

c.     Definition of multi-view (Roni – send text)


·       Tuesday:

o   08:30-09:00 Coffee/getting settled

o   09:00-09:10 Chair - status, agenda bash

o   09:10-09:50 Discussion of any remaining open issues for the use cases and requirements

o   09:50-10:00 Short coffee break

o   10:00-11:30 Framework:

§   Overview of changes, Areas of Capture, Composition & Switching Algorithms (60) (Mark Duckworth)

§   Comments & Issues (40) (Roni Even)

o   11:30-13:00 Lunch

o   13:00-15:00 Framework:

§   Telepresence Coordinate System (30) (Marshall Eubanks)

§   Voice Activity Detection (VAD) (30) (Andy Pepperell)

§   Message exchange model (provider/consumer) vs SDP offer/answer (30) (Allyn Romanow)

§   Issue discussion (30) (Allyn Romanow)

o   15:00-15:20 Break

o   15:20-16:30 Issue discussion (cont.)

o   16:30-17:00  CLUE and RTCWEB (Mary, as individual)

o   17:00-18:00 Summary and identification of key items for discussion on Wednesday

·       Wednesday (note this agenda was finalized based on the outcome of Tuesday's meeting:

o   08:30-08:45 Plans for the day

o   08:45-09:40  Relation to RTP (Jonathan)

o   09:40-10:30  SDP/Signaling (Stephan)

o   10:30-10:45 Break

o   10:45-11:40  Point of view/areas of capture (Brian)

o   11:40-12:30  Source Selection (Jonathan)

o   12:30-13:00 Discussion of way forward and  summary of action items


Note Takers & Raw Notes:



http://trac.tools.ietf.org/wg/clue/trac/raw-attachment/wiki/WikiStart/Interim-Oct 2011-Tuesday-notes-allyn.docx

Tues PM



Wed AM




The following is a collation and reformatting of the raw notes.

Notes from Tuesday AM (Allyn)

Action Items marked AI

Attendees - Introductions – Stephan Wenger, John Leslie,  Mark Duckworth, Roni Evan, Mary Barnes, Paul K, Gonzalo, Jonathan, Steve Botzko, Spencer Dawkins, Andy Pepperell, Brian Baldino, Hadriel Kaplan, Bo, .. see roster

Note Well

Agenda, see slides

Use Cases

Not enough review

AI:  Roni  --Data collaboration – what can do over RTP is ok. Maybe add something to text can use any RTP payload. See what’s happening in RTCWEB on same topic.

Mary - Won’t send use cases to publication because we might want to add something

Gonzalo – Send drafts as a bundle when ready


Ready to be reviewed

How comprehensive does it need to be? Does not need to  map back everything in the solution

AI:  draft authors, numbered bullets

1.     Necessary to add  requirement for updating properties . The term capabilities is problematic here. Add this into requirements. How  it should be worded needs to be worked out


2.     Adding a third dimension. Needs more discussion on the list. How it should be worded. Reqmt 1b not the right place for it. 1b is about ordering

What does depth mean in terms of images?

3.     Multi-view—should be in reqmts or why is it in the document. Also in the use-cases.  Add a requirement for this. Roni will draft

Are the requirements for the basic functionality or also the (possible)  extensions?

Should we mark some requirement as not for immediate specification. We should document ideas that are for later.

There are 2 types of multi-view- the kind described in the use cases. And there is also 3 –D. the requirement can reference the specific use case.  Multi-view is not a good term because it has multiple meanings.

We need a definition of multi-view. Want 2 cameras looking at the same object.

3-D further out. Not included now.


4.     Should we add Jonathan’s case – wants to permanently view another endpoint “pin”.

Mark – use case 3.3 mentions switching video and mentions manual control.

AI:  Jonathan Should add to the use case. Then drive requirement from there. Jonathan will send to mailing list.


Framework Discussion

Mark Duckworth presentation of  New Stuff in draft see slides

More detailed spatial information about media captures

Additional topics –VAD, media source selection, composition and switching algs – switched and composition how to better describe, selecting algs. Between EP and middle box

Brian  present Area and Point of Capture

Assign coordinate system to a room, independent of any camera and microphone placement.

X coord. Left to right, z low to high, y front to back

Camera right and camera left. Front is closer to the camera

Where is the origin? Brian says it’s the implementer’s choice, it doesn’t matter

No point in describing parts of the room that are irrelevant, but nothing prevents it

In the diagram, example with field of view 1, 0,0 is bottom right hand corner

Area of capture the segment the camera captures

Begin and end for each of the 3 coords x, y, z

What is the info in the second 2 coords useful for? How will they be used?

Stephan- very useful. May want to render a gap, depending on your application. 

Make it be able to turn, allowing for curve, geometric correction is possible. This is not new or unused technology.

Roni- the gap between the cameras.. if someone stands between the 2 cameras, it should look good. Current systems do that.

Brian – enables you to get in there with a ruler, may not require this.

Does this degrade gracefully? Allows you to be as accurate as you want to be. But all systems don’t require this level of accuracy.

If you have a curved table, systems don’t care about representing a curved table. Even for 2 rows of curved tables, won’t care.

Want to enable detail, but not require it.

Bevel, actual size. We haven’t drilled down how relates to CLUE’s charter and interoperability.

Roni – renderer can render what wants. But CLUE provides info for the best quality.

Jonathan- units are arbitrary or mm? can be either

This example is real world millimeters

What is optional and what is mandatory? Who really knows this info? people who set it up. Not the people who write software. Can’t make it mandatory. Systems won’t have this info accurately.

We should be fail-soft  - as information is removed, less info sent, should still have  good info, just not as accurate

Roni- looking at high quality system interoperability.

Purpose of Origin – for preserving ordering


Discussion of Composition and switching algs. Raise questions not necessarily settle now

Booleans for switched and composed. Is it enough or not? People have suggested more info  needs to be conveyed. 

Roni has questions about this in the preso he will give

Jonathan – the concept that the media capture is switched,  is it the right concept?  It’s more that you are receiving a particular capture. Doesn’t think that switched is the right way to describe it.  Switching between captures. What is implied lower down in protocol is different than this. Capture with different sources that are switched.  Is it one capture where the source is switched, or are you switching between captures? How does this relate to RTP? 

Roni presentation on major issues

Capture area. The real scene is curved

Concerned about the gap between monitors when the captures are displayed. Wants to have information so this is seamless

Where is the point of origin? Current draft not clear on this

What would be a good way to describe what you want?

Jonathan – doesn’t matter where you put the origin as long as you support negative numbers. Only need to know the relative position of the cameras. But don’t need the origin point.

Roni - Importance of knowing the gap between video captures. Where gaps depends on vendor. renderer can make better experience if you know

Doesn’t think millimeters should be a Boolean

Architectural  model

First 2 bullets - Agree with arch in the doc but it should be better explained

AI – draft authors – clarify  draft, as per Roni’s comments

One issue is no different between capture set and scene.  Scene is not necessary.

Andy – thinks EP isn’t useful in the specification, makes sense to talk about it, but it doesn’t add anything to specification

Media stream providers and consumers, EP does both, rather than an element of the model.

Roni – need to define media consumer and provider.

Andy- Can have just audio or video, streaming or recording.  EP just happens to do both. Or an EP could be either or the other.

Jonathan- agrees EP not important. For example, presentation isn’t connected to a room.

Discussion of what’s a CLUE endpoint. Something running CLUE.

Roni - Add an SDP param TPcap – what attributes you support.  Agree to hold discussion of implementation as a separate conversation

Need to define how to define extensions – agree

AI – draft authors – clarify how to define extensions

Composed captures  – Roni wants more information than just the image is composed. 


Andy- this is a precomposed image. A midllebox would like to know this information so that it won’t further compose the image.


If iimage is precomposed of most active speakers, it would change often, and you wouldn’t want a heavyweight message mechanism, would want something more dynamic like VAD.  Never thought of sending info of what is in the composition. Just whether the stream is composed or not.


Roni says it’s in RTCP today. SSRC, CSRC, not adequate.


Andy- Need a new orthogonal dynamic channel.  VAD for example.

Whether capture is switched  and/or precomposed is static so it is appropriate to describe in capture set

Don’t want to change every time there is a change in speaker for active speaker


Need both channels – fast changing and slower changing


Could see offering different composed streams

This is the current speaker composition, this is the current presentation composition, for example


AI – draft authors, make changes as suggested

Document is unclear on second point  In the slides


Document needs to cover the third point. Should be added to framework, and maybe not mandatory


Relevant to endpoint and to MCU. Associate a media capture with endpoint


4th point – okay


Notes from Tuesday PM (Allyn)

Updated agenda


 Marshall presentation 3-D locations: a coordinate system for telepresence

Slide with 3 screens, 3 co-located cameras, Axis of symmetry of the particular unit.  This gives  an origin and a coordinate system.


Question – what has an axis of symmetry? Answer  each system


Wants polar coordinates for what cameras are seeing


What doesn’t fit in this model? A person with a podium with camera off to the side.


If there is a podium? It breaks the symmetry


There is math, even in the simple cases, not clear we have to specify the origin. What is gained?

Origin is useful only to know where things you care about are

If  it’s implementers choice, doesn’t make anything harder


A podium cannot be rendered in this particular set up.

Similarly, additional capture points inside the room


Origin important for polar cords, not so much for Cartesian


Axis of symmetry- everyone has a middle..


Multiple cameras and screens  for different sitting arrangements


 We need to discuss multi-row further


The finest grained info- Location of the camera and the direction its pointing and its fov


Should not be a function of the physical walls of the room


Andy- doesn’t replace the capture region, Cartesian positions of the cameras


Paul – preferred origin, doesn’t believe that people will agree where it is


Steve thinks it make sense - Where rows are, as distance from the camera.


Allyn questions ability to specify rows within one camera view


Allyn – meta issue - understands Marshall’s description is correct, but is it usable in this context?

Same comment for defining origin, feels not necessary for the renderer


Hadriel – thinks don’t need to have this info to match mics to videos. Thinks can get that info specifically without having to do analysis to determine it.


Jonathan- this way of describing relative captures breaks down for synthetic captures, such as presentations. He wants a way to describe those. And wants to be able to use the same language for synthetic and real.

Marshall’s only works for physical-based captures. Jonathan doesn’t want 2 languages one for synthetic and another for physical

MCU decides who goes on which screen.. or sharing right and left monitor. This language doesn’t make sense.

Language that makes sense is 2 dimensional. Start with simple case,which  doesn’t do trigonometry or geometric transforms


Gonzalo – what about zooming cameras? What about focal length?

Marshall doesn’t like this aspect of the framework. Thinks won’t work well for this case


What about  when the view is changing all the time?


Stephen B. – relative adjacency wasn’t sufficient  to match audio and video


Mark – It is possible to represent a curved plane as flat. The renderer needs to know same info in both cases. The 1 dimensional description will work for synthetic captures


Andy- wants to say the same thing. Wants to have 0-99 mandatory absolutely needed. Needed for segment switching. Essential.

 With  origin, position, focal length..  optional.


Marshall wants to build a structure that can be used for a long time, so that it will be there later when people need it.


Roni- on synthetic, presentation, and people cases. Agrees can have simple and more info. not enough text about presentation and the mcu case in the draft. Requirements different. 

AI – draft authors – clarify as per Roni’s suggestion


Jonathan – doesn’t want to have 2 different languages, want the full geo to be an extension of the simpler language…



Andy VAD -  A New thing in framework draft

See slides

Want middle boxes to be able to switch without having to decode audio

Determine active speaker for intra-room segment switching


VAD algs must be consistent

                  Standardizing algorithm


Don’t want to disadvantage a receiver that has one only audio stream


Details- video linear range system, muddle box receiving multiple captures needs to figure out which is loudest overall and which is loudest within a particular room. Wants to know the energy level associated with that position.  


Being able to determine which capture is from the active speaker. How determine which video is active.

What position(s) hold the loudest speaker as well as over all VAD. Consumer can determine which is loudest whether receives all or one pre-mixed.


Tagging active position in a room, audio energy is not the only way – for example, could have buttons on a pad to choose who is active.

Stream configure message specify VAD alg. Provider says which algs,  consumer chooses which

Through an RTP header extension

Security of VAD info, is there something that needs to be decoded? Does it need to be encrypted? Traffic  analysis could be done.


Marshall suggests talking with Colin Perkins to get his feedback

AI – draft authors, add to draft as suggested


Roni – the number of audio and video streams is not one to one


Jonathan is currently working on header encryption and energy levels


Jonathan – don’t want anything on the ACs. But for segment switching. Want it on the VC- the energy of the audio for the this VC.  What do you want the semantics to be? Rather than a correlation of audio and video, he thinks what want is info about the speaker is in this camera.

Andy – issue with that if have a 2 VC case.

Thinks doesn’t work. MCU receives from many multi-video has to compare audio.

One receiver with many audios

See Jonathan’s draft for carrying energy in the header. Avtext Jonathan sent reference to the list


Brian - Didn’t want to do it in video, might want to flow control it off


Roni- agrees shouldn’t be in the video.  Important to associate, but not mix by putting audio in video stream.


AI – draft authors – fill out VAD


Paul – asks about effect of different audio levels.

Stephen B thinks there is a bias.


Bo – it’s a  real time position stream.


Jonathan and Andy- what about AC0, AC1, AC2 – do they have the same VAD info?

Rather than AC0 VAD just for itself, etc.

But then have a challenge with AC3- that has to combine the 3

Either the same in each or each describes itself


Paul – if replicate info in all of them.

For further discussion


Jonathan – think about. Understand case of getting soundscape right.


Gonzalo – they are setting up an SDP directorate




Messaging – Allyn

Difference in messaging type between SDP’s offer answer and CLUE’s “publish and subscribe”. One conversation in SDP, 2 in CLUE.

Way to characterize the difference is that offer answer is symmetric, and CLUE’s proposal is for asymmetric


Roni doesn’t like term “capabilities” would rather use the term “proposal”


Gonzalo- think about backwards compatibility when think whether will use SDP or not

Stephan- will have real time updates

Jonathan- advertise all the available scenes. Don’t want  a full update. Want partial updates.

Partial state update rules out SDP.



Messages can happen whenever


Roni and Stephan want to talk about the signaling and transport tomorrow. They feel this is an important part of the framework.

Mary suggests they make slides for tomorrow


List of Framework  Issues to discuss- see slides

1.     Point/area of capture - Brian

2.     Layout – Mark

3.     Source selection - Jonathan

4.     Attributes for capture sets (along with those already for media capture) – per the hierarchy each level should have attributes - Roni

5.     Describe composed picture, rather than just a flag, so endpoint can decompose and put these

things on several  things (i. e., resuse mechanism for spatial coordinates, etc) more like tree structure- Stephan will write up something  - Stephan

6.     Consumer description is not fleshed out

7.     Relations to RTP – Jonathan

8.     Definition of multi-view

9.     Origin/coordinates

10.  Screens vs cameras – Marshall. Does the media producer need to know do/anything with screens?

11.  Multiple row use case

12.  VAD – input and output


Priority voting – for what to cover, choose 5.


13, 3, 2, 1 plus Stephan’s signaling and transport


Mary RTCWEB presentation


Overview – doesn’t need to interoperate with CLUE


Proposal for CLUE wrt RTCWEB

·       Ensure that SDP usage is compatible and consistent to ensure that CLUE and RTCWEB do not define 2 separate ways of doing the same thing

·       Evaluate usages of SDP/RTP as framework is being developed

·       CLUE needs to consider RTC WEB decisions in terms of handling multi-streams-

o   Multiplex over a single rtp session

o   Or multiple rtp sessions

Stephan says CLUE spec doesn’t need to reference RTCWEB


Hadriel - CLUE should make sure that RTCWEB doesn’t require  a symmetric model

Sending data in the media path- rtp extension headers

Browser support audio video only

Data channel will look different, will be RTCWEB specific

TCP over UDP for example

Need congestion

Covering of Framework Issues


Layout – Mark Duckworth


Switching.. describe a media capture as being switched. Can be from an EP or an MCU. Framework just has a Boolean. No info on an alg. For how is being switched.

Called auto-switched.

Roni – has an issue with this

What info is he looking for? How often switched,  differentiation on the side of the provider.  Wants the consumer to know the basis of switching, so can choose which alg. It wants.

AI – authors add in as Roni suggests


Jonathan – multipoint case. Scene switching, if have different geometries, need to know the concrete scene. Has a lot of implications. Whether have a switched capture. Or switching among captures.

Have all info

If switching, want to tell which scene you are seeing

Means need to know geometry for each room

If MCU switches and you don’t know, you can’t do meaningful adjustments. Rely on what the MCU does for you

Request the current loud speaker.

Partial update for new scene in conversation. Capture 38. At run time gets info, you are now seeing 38.


Spencer – from Booleans to enums. Is this capping innovation? If not on the list, will you know what to do? These are the 34 things you can do.. restrictive


Roni- description of options can use

How do you describe new options

Alot of algs can decide how to switch


Stephen - Separate negotiation of policy from attributes need to know

Want MCU to be able to know whether what it’s receiving is composed


AI – authors Separate policy management from attributes


Want to see the current speaker and the current preso

Switched capture


Jonathan – site switching. How do you ask for site switching?

might need other attributes for composition

for switching.. don’t need to describe on the forward  path


Andy- should there be attributes of capture set rows?


This isn’t just for multipoint,

Consider 3-1 point to point case for switching


Mark – where is this leading? What’s the use case?

Roni- value in getting info about what the switching alg.  Either by the provider who can offer different algs to the consumer.


Meta information about the streams. Getting 3 switched streams, which is which? Depending on policy, which is which?

Tied to RTP mapping

Would a light weight message be fast enough?

Ask for 3 active speaker, get 3 or 2 +1. Sending a dynamic update. But figure out how to do it. Which participant this is.


Discussion of RTP

Jonathan- How do you bind SSRC to a virtual capture?


Assumption demux on SSRC or not. Andy is assuming demuxing on something other than SSRC – what was jonathan’s issue?

payload, CSRC. Jonathan was assuming demux on SSRC.


Distinction between real and virtual source.


Potential decryption issues in the MCU


Have identified some constraints.. switching and composition needs to be known

Need to know original source, distinction between virtual and real source. Messaging and RTP implications.


If you know the actual source, need to know what it is – roster list … way we bind?

Tricky to act at high speed..


Static pinning- and he becomes loudest speaker, want to get  the data only once.  Individual mute.

Someone needs to write some text.

This is intertwined  with what are talking about for tomorrow


MCU wants to know if composed or original. This is a Boolean.


Auto-switch Boolean and auto-switch with policy. Haven’t totally decided if auto-switch Boolean is okay


Plans for tomorrow

Issue 13, signaling and transport, break,

Area capture- Represent things in the simple way. Agree on this. To be able to use the same language.

Source selection

Notes from Tuesday PM (Roni)

3D Locations -  Marshall

Capture plane is the line that goes through the camera and the origin is on the middle camera.

What about podium how do you put it in the room

Andy- is the axis per system or capture set

Marshall – per system.

Andy – what is the axis for presentation.

Steve- each capture set has its own axis

Jonathan – there is enough math in the solution, why specify the origin

Marshall – the reason to specify the origin, is that the origin will not be consistent.

Jonathan- what you care is where the cameras are. He is not sure if it is a problem if it is anywhere in the room.

Stephan – if there is no screen for the podium there is no way to render correct. Suggest to leave out the podium use case. The system on the other side will not have the monitor to display. Suggest to limit the scope. So a podium does not have a position

Steve – if we define the origin as is nothing break and have some value in the document. The concern is additional capture points in the room.

Marshall – are the other capture points important and do you need to have the relation between them

Gonzalo – polar co-ordinate requires the origin.

Marshall – important if the cameras are not in the center.

Steve – You can have a system that captures a room that is like a U with cameras on all side. It is not TP system.

Mark – is there a value to know the difference between the camera positions.

Jonathan – simple to just know the order – left to right, while the more complicated for those who want to do the geometry.

Steve- we have not done the multi row case enough.

Marshal – center of camera and FOV should be conveyed.

Andy – the area of capture  is not the line of the seating but the volume.

Brian – the Cartesian  coordinates can be used the same

Paul – people may choose different origin. The receiver may have to calculate the axis based on his system.

Steve – think that the idea to have the distance of the rows based on the capture line is good

Marshall – the systems may be constructed in arcs. We can allow to describe this but think it is not necessary

Allyn – what is the important of the distance to row if one camera

Steve- helps with identified the speaker

Allyn – not clear if the information on the area and origin is useful for the renderer.

Marshall – two camera system and the other side 3 screen. Do you want to know where to render.

Mark – wants to have eye contact.

Hadriel – the simplicity of just knowing left and right. Easy to know if the origin is in the center camera.

Jonathan – does not work for composite picture  which do not have a camera location.

Gonzalo – are focal of the camera and the zoom and pan of camera.

Marshall – there is no far end camera control.

Steve – went from relative left to right to help match audio to video. Need to keep it.

Mark – if the consumer maps everything to straight axis. But if we use the curves the simple subtraction of only one axis will not work.

Marshall – there are specific cases that may be treated cameras.

Brian – need the simple linear for the simple left to right case and the polar arc case for the ones who want it.

Jonathan – Should have a similar representation and not two different ones for the 3 screen system and the presentation or composed systems.

VAD in clue - Andy

Marshall – talk with Colin of RTP guys.

Jonathan – maybe you want to say that this is the level of audio for this video capture. Or say the speaker is in this camera

Andy – if you put it on a composite video you do not know which one it is.

Roni – the number of audio captures can be different from video

Brian – the audio level should be in the audio stream.

Roni – need to be able to associate the audio to video but not by having the level in the audio stream

CLUE messaging model – Allyn

Started with two of the messages. The provider send what it can send, the consumer selects and the provider sends and there is a similar exchange on the other side.

Stephan – no decision if to use SDP.

Gonzalo – need backward computability.

Stephan – we may need to change the mode during a call.

Jonathan – want partial state update which rules out SDP.

Paul – there will SDP offer answer. It is about the extra information. It is not independent from SDP.

Stephan – the messaging and SDP relation should be part of the framework.

Stephan – how do we get the CLUE information over the wire.

Framework issue


The composition will be deferred till Stephan provide text.

Auto switch – provide information about the auto switch algorithm

Jonathan – suggest information about what is the area covered in the switch. Are you switching between captures, provide the capture you provide will help.

The switching and composition need to be known by middle box and there need to be a way to describe the streams.

Steve propose one Boolean – composed or not. About auto switch – we defer it until we discuss RTP relation.

Notes from Wednesday AM (Spencer)

RTP Issues - Jonathan Lennox

SSRC Multiplexing - Asserting that CLUE should use a single RTP session per media type, multiplexing sources by SSRC.

Why not per capture set? Number of capture sets could also be large and asymetric.

SSRC should be associated with "real" resources? SSRC is the switched source, CSRC is the physical source?

Not possible for individual cameras to generate their own streams? Is that just not done? Is because of key distribution, which is a hard problem anyway.

Would you change SSRCs at the middlebox, or just switch them? Matters because of multiplexing ...

SSRC as a way of identifying who you're seeing, having a switching middlebox change that is a problem. But can't devices change their SSRCs at any time? If there's a collision, or ...

This is a layering thing - I'm interested in left screen/right screen. Releationship needs to be in the signaling layer. This is turning into a protocol discussion, but we need to handle the relationship.

The same thing is happening in RTCWeb. Jonathan - we should have these conversations in AVTCORE, not here.

Do we have to have one m-line with all the choices? If you have two m-lines with one RTP session, that's when we have problems.

Steven - binding SSRCs to captures should happen up front.

Andy thought this was orthogonal to SSRC multiplexing.

Roni - we had this conversation in RTCWeb.

Jonathan - when you send a request to see a capture, you say "add this byte as an RTP extension header" as well. You get your own demux tag and can do what you need to do based on the tag you provided. Don't need to standardize meaning, just pick the right demux tag.

Hadriel trying to avoid changing the format of RTP packets :-)

Paul - sender has to keep per-receiver RTP state because the demux byte would be different for each receiver. Issue for RTPWeb was interop with devices that don't do multiplexing.

Stephen - we have a set of RTP sources that need to be grouped.

Jonathan thinks we need to know this mapping before you can display anything.

Roni - issue is when someone joins a conference. How do you get the state of the conference?

Some semantics of SDP descriptions of heavily-muxed sessions get confusing (this is a problem for AVTcore and MMUSIC, not for Clue).

We need RFC 5576 source descriptions in SDP if CLUE is encoded in SDP - less clear if it's not.

Andy and Mark - separating m-lines by roles is the wrong thing to do.

Jonathan - don't have two m-lines unless you have two physical sources - then you have to.

Hadriel - if there's a valid reason you have to do something, you have to do it - doesn't matter why.

H.281 FECC is broken - assumes you know which camera you're controlling, no way to specify which stream you're controlling in a session.

Options to Transport CLUE Messages - Stephan

Chartered to use SIP, but we're doing a three-stage handshake, not offer-answer. Would be desirable to have one exchange, but that's not compatible with the current framework draft.

Is there a conceptual difference between the initial information communicated, and the information communicated during the session lifetime? We think not.

Could piggyback on INFO, UPDATE, Re-INVITE, could do content indirection using multi-MIME body, use CLUE stream as SIP-negotiated "media" stream.

Re-INVITE is the wrong answer because current devices re-initialize EVERYTHING when they see re-INVITEs.

SIP-negotiated assumed to use UDP and MSRP. CLUE messages may not fit into one MTU. ICE-TCP could be an alternative. Proposing BFCP-like handshake.

Not constrained to use SDP, XML is "natural candidate". Gonzalo, Roni, Stephan to write -00 draft (but Gonzalo is not advocating this).

Hadriel - why are we doing this again? Take this to RAI and tell them MSRP doesn't work?

How often are these messages? every few messages, but may be bursty.

What's wrong with INFO? Maybe nothing ... the INFO problem was old-style INFO, without packages.

What's wrong with MEDIACTRL? Would be a good candidate, but not sure anyone has looked at MEDIACTRL. MEDIACTRL is TCP-based ("that won't work").

Paul - similar discussions in SIPREC. Looked at SIP (same questions, same constraints). Discarded INFO because if you could piggyback on INVITE, you could use the same mechanism on UPDATE/Re-INVITE. Decided to use SIP over TCP in restricted environment.

Stephan - don't think we should care about size of CLUE packet.

Steven - m-line approach will allow peer-to-peer, but do people need that?

*Spencer - still concerned about application-by-application UDP-ization. Need to solve this via indirection. TCP over UDP or something. ("RTCWeb is going to do this anyway")

Jonathan - some of this maps to SIP SUBSCRIBE/NOTIFY operation (partial updates of a roster, etc.)

Gonzalo - need to look at burstiness.

Allyn - also did analysis of MEDIACTRL and came to same conclusions.

How big can these messages get?

Do we have a bunch of people who have a preference for SIP and a bunch of people who have a preference for media streams?

Need to know what we're sending before we can do anything.

Stephen - in some cases, we're looking at CLUE messages that are multiple KBs. Have to design for that.

Allyn - "reliable UDP"?

Spencer - we had the conversation about TCP-over-UDP in like 2007 during LEDBAT discussions and decided not to go that way. It's four years later now. How much longer are we going to go the way we're going?

Gonzalo - we should have a BOF about this.

Stephen - with 100 meeting rooms, you'll be transmitting often.

Marshall - we need to worry about congestive collapse, because we send out different CLUE messages when participants fall out of the conference :-(

Mark and Brian - Capture topics

No need for receiver to render as an exact duplicate of sending room - just make it useful.

Stephen - if you can do geometric correction, we need to accommodate that. But we'll discuss offline.

Marshall - this is MUST versus MAY - X positions is a MUST. We need adjacencies. Everything else is MAY.

Stephen - but don't restrict your syntax so much you can't include Y and Z - it will be difficult to add this later.

Do we want to allow both cartesian and polar systems, but not require both of them?

Steven - the reason this works in practice is that camera angles are almost always perpendicular to the participants.

Stephen - Looking for a requirement to support curved conference tables, etc.

Andy - if you have XYZ for camera and for participants, you can figure anything out.

Roni - need to be very clear about assumptions and what happens when they don't apply.

Brian - need to enable highly accurate representation, but don't want to require that.

Brian - if you're not doing gap correction, you don't need a perfect description.

Brian and Roni will chat about what it means to have a "good rendering experience".

Hadriel - why do we have to define units for XY? why would we have anything BUT millimeters? Whenever you have options, you have interop problems. Even in synthetic scenes - synthesize XY, too.

Mark - receiver would like to know that the XY is NOT millimeters.

Jonathan/Steven - you want people to be actual size. You can blow up presentations, but not people.

Hadriel - are moveable cameras out of scope? How does that work when cameras move around?

Brian - could give view of entire potential range, could provide updates, could punt on first down.

Hadriel - are moveable tables out of scope? Having harder and harder time to think XY will ever be meaningful.

Mark - moveable cameras today have a small number of presets - you don't support all possible XY values.

Need to talk about "good experience" one of these days ... :D

Brian - want to provide something simple, allow for something more complex ...

Steven - don't negotiate simple versus complex - subset so you have something simple.

Mark - answer also depends on where camera is, and where it's pointing.

Stephen - just pick an "undefined" value - don't negotiate whether you send something or not.

Roni - various X values need to use the SAME units (so 100-wide objects are all to scale), and Y values need to use the same units as X values.

Allyn - concerned that we might build a system that's extremely correct but not usable. Don't want to put in things that no one will use. We're talking about things that implementers aren't using today, and aren't planning to use in the future.

Marshall - this is a red herring - there's no CLUE yet, right?

Stephen - but stuff can be on roadmaps we can't talk about, and we might just not know the future.

Marshall - of course no one is using this information, because there's no protocol to carry it (no CLUE yet).

Allyn - but implementers say they don't know how to use this information even if they DID have it now.

Hadriel - think of "simple" from an installer's perspective. If it's complicated to install, no one will use it. If people don't use it, that's not useless, it's dangerous. If people are going to use it, put it in.

Source Selection in CLUE - Jonathan Lennox

"Static" source selection - "pinning". "the current speaker and my boss's reaction"

Senders and receivers don't want to send two copies of the boss's stream if the boss is ALSO the speaker. Do we need to specify what happens when the boss starts speaking?

Scene information from CLUE plus roster information from something like XCON - associate both? potentially a lot of information as people join/leave.

Steven - how far do we need to go in the first version of the protocol? most controls don't get used much.

Jonathan - my understanding is that this is cultural.

Steven - might be.

Jonathan - there are cases where I'm permanently looking at someone small, and switching between images that are big.

Marshall - if you don't understand information, you don't switch - that's the proposal.

Roni - this is video, right? but not always.

Allyn - need to think about interaction with rosters.

Rosters are "out of scope" - what does that mean for CLUE in practice?

What if you offer multiple views of yourself? :D

Conversely - "I never want to see/hear this person".

Paul - remember that we want to see/don't want to see captures, not people.

We're back to assuming XCON or something similar.

Jonathan - use case coverage of this is sketchy, requirements don't mention this at all.

Chair Discussion - Mary and Paul

Ready to adopt Framework draft? Is this a data model? and/or a framework?

Part of document is framework, some is protocol - split them? attributes are the data model.

Is there going to be an update before we adopt as clue-00? We'll look at issues before we decide.

Roni thinks there will be many changes - would like to see revisions.

Does anyone disagree with adopting now and making changes to WG draft?

Humming to adopt - strong sense of the room to adopt now.

 (I stopped typing because Mary was going through the slide with our open issues list)