Minutes from CLUE Interim
Location: Polycom office, 100 Minuteman Road, Andover, Massachusetts
Chairs: Mary Barnes, Paul Kyzivat
Note Takers: Allyn Romanow, Roni Even, Spencer Dawkins
Minutes Editor: Paul Kyzivat & Mary Barnes
Recorded playback: Tuesday / Wednesday:
Attendees (in no particular order):
Stephan Wenger, John Leslie, Mark Duckworth, Roni Even, Mary Barnes, Paul Kyzivat, Gonzalo Camarillo, Jonathan Lennox, Steve Botzko, Spencer Dawkins, Andy Pepperell, Brian Baldino, Hadriel Kaplan, Allyn Romanow, Bo Burman, Gyubong Oh, Marshall Eubanks
Webex: Alan Johnston, Espen Berger, Gerard Fernando, Shida Schubert, Stephane Cazeaux
· Use Cases: The pre-WGLC review for draft-ietf-clue-telepresence-use-cases did not garner a lot of WG feedback. We really need folks to thoroughly review this document.
· Requirements: draft-ietf-clue-telepresence-requirements is about ready for a pre-WGLC review. We also need to consider how comprehensive the document needs to be and whether everything in the solution must be mapped back to the requirements. The general sense was that the latter is not required.
· Framework: draft-romanow-clue-framework-01 was agreed as a WG draft (pending additional WG feedback on mailing list)
– Based on the inter-dependencies between the various documents, we will likely progress them as a set. Although, we still want to do early pre-WGLCs to ensure the requirements and use cases are as complete as possible as we start the detailed framework and protocol work.
– Discussion of signaling protocol for CLUE led to the suggestion that there is a need for a BoF for TCP-over-UDP since CLUE needs something like this and several other WGs have similar needs.
Summary of Action Items:
Note: these are derived from the issue summary discussed at the end of the meeting, as well as actions gathered during the meeting.
· Use Cases:
i. Origin/coordinate system (more discussion on mailing list) (Marshall)
ii. Screens vs cameras (for the origination of coordinate system – see above)
iii. Consider addition of 3rd dimension as part of the origin/coordinate system
a. Signaling (Stephan/Roni/Marshall to put together a draft)
b. Relation to RTP draft (Jonathan/Allyn based on presentation)
c. Definition of multi-view (Roni – send text)
o 08:30-09:00 Coffee/getting settled
o 09:00-09:10 Chair - status, agenda bash
o 09:10-09:50 Discussion of any remaining open issues for the use cases and requirements
o 09:50-10:00 Short coffee break
o 10:00-11:30 Framework:
§ Overview of changes, Areas of Capture, Composition & Switching Algorithms (60) (Mark Duckworth)
§ Comments & Issues (40) (Roni Even)
o 11:30-13:00 Lunch
o 13:00-15:00 Framework:
§ Telepresence Coordinate System (30) (Marshall Eubanks)
§ Voice Activity Detection (VAD) (30) (Andy Pepperell)
§ Message exchange model (provider/consumer) vs SDP offer/answer (30) (Allyn Romanow)
§ Issue discussion (30) (Allyn Romanow)
o 15:00-15:20 Break
o 15:20-16:30 Issue discussion (cont.)
o 16:30-17:00 CLUE and RTCWEB (Mary, as individual)
o 17:00-18:00 Summary and identification of key items for discussion on Wednesday
· Wednesday (note this agenda was finalized based on the outcome of Tuesday's meeting:
o 08:30-08:45 Plans for the day
o 08:45-09:40 Relation to RTP (Jonathan)
o 09:40-10:30 SDP/Signaling (Stephan)
o 10:30-10:45 Break
o 10:45-11:40 Point of view/areas of capture (Brian)
o 11:40-12:30 Source Selection (Jonathan)
o 12:30-13:00 Discussion of way forward and summary of action items
Note Takers & Raw Notes:
The following is a collation and reformatting of the raw notes.
Notes from Tuesday AM (Allyn)
Action Items marked AI
Attendees - Introductions – Stephan Wenger, John Leslie, Mark Duckworth, Roni Evan, Mary Barnes, Paul K, Gonzalo, Jonathan, Steve Botzko, Spencer Dawkins, Andy Pepperell, Brian Baldino, Hadriel Kaplan, Bo, .. see roster
Agenda, see slides
Not enough review
AI: Roni --Data collaboration – what can do over RTP is ok. Maybe add something to text can use any RTP payload. See what’s happening in RTCWEB on same topic.
Mary - Won’t send use cases to publication because we might want to add something
Gonzalo – Send drafts as a bundle when ready
Ready to be reviewed
How comprehensive does it need to be? Does not need to map back everything in the solution
AI: draft authors, numbered bullets
1. Necessary to add requirement for updating properties . The term capabilities is problematic here. Add this into requirements. How it should be worded needs to be worked out
2. Adding a third dimension. Needs more discussion on the list. How it should be worded. Reqmt 1b not the right place for it. 1b is about ordering
What does depth mean in terms of images?
3. Multi-view—should be in reqmts or why is it in the document. Also in the use-cases. Add a requirement for this. Roni will draft
Are the requirements for the basic functionality or also the (possible) extensions?
Should we mark some requirement as not for immediate specification. We should document ideas that are for later.
There are 2 types of multi-view- the kind described in the use cases. And there is also 3 –D. the requirement can reference the specific use case. Multi-view is not a good term because it has multiple meanings.
We need a definition of multi-view. Want 2 cameras looking at the same object.
3-D further out. Not included now.
4. Should we add Jonathan’s case – wants to permanently view another endpoint “pin”.
Mark – use case 3.3 mentions switching video and mentions manual control.
AI: Jonathan Should add to the use case. Then drive requirement from there. Jonathan will send to mailing list.
Mark Duckworth presentation of New Stuff in draft see slides
More detailed spatial information about media captures
Additional topics –VAD, media source selection, composition and switching algs – switched and composition how to better describe, selecting algs. Between EP and middle box
Brian present Area and Point of Capture
Assign coordinate system to a room, independent of any camera and microphone placement.
X coord. Left to right, z low to high, y front to back
Camera right and camera left. Front is closer to the camera
Where is the origin? Brian says it’s the implementer’s choice, it doesn’t matter
No point in describing parts of the room that are irrelevant, but nothing prevents it
In the diagram, example with field of view 1, 0,0 is bottom right hand corner
Area of capture the segment the camera captures
Begin and end for each of the 3 coords x, y, z
What is the info in the second 2 coords useful for? How will they be used?
Stephan- very useful. May want to render a gap, depending on your application.
Make it be able to turn, allowing for curve, geometric correction is possible. This is not new or unused technology.
Roni- the gap between the cameras.. if someone stands between the 2 cameras, it should look good. Current systems do that.
Brian – enables you to get in there with a ruler, may not require this.
Does this degrade gracefully? Allows you to be as accurate as you want to be. But all systems don’t require this level of accuracy.
If you have a curved table, systems don’t care about representing a curved table. Even for 2 rows of curved tables, won’t care.
Want to enable detail, but not require it.
Bevel, actual size. We haven’t drilled down how relates to CLUE’s charter and interoperability.
Roni – renderer can render what wants. But CLUE provides info for the best quality.
Jonathan- units are arbitrary or mm? can be either
This example is real world millimeters
What is optional and what is mandatory? Who really knows this info? people who set it up. Not the people who write software. Can’t make it mandatory. Systems won’t have this info accurately.
We should be fail-soft - as information is removed, less info sent, should still have good info, just not as accurate
Roni- looking at high quality system interoperability.
Purpose of Origin – for preserving ordering
Discussion of Composition and switching algs. Raise questions not necessarily settle now
Booleans for switched and composed. Is it enough or not? People have suggested more info needs to be conveyed.
Roni has questions about this in the preso he will give
Jonathan – the concept that the media capture is switched, is it the right concept? It’s more that you are receiving a particular capture. Doesn’t think that switched is the right way to describe it. Switching between captures. What is implied lower down in protocol is different than this. Capture with different sources that are switched. Is it one capture where the source is switched, or are you switching between captures? How does this relate to RTP?
Roni presentation on major issues
Capture area. The real scene is curved
Concerned about the gap between monitors when the captures are displayed. Wants to have information so this is seamless
Where is the point of origin? Current draft not clear on this
What would be a good way to describe what you want?
Jonathan – doesn’t matter where you put the origin as long as you support negative numbers. Only need to know the relative position of the cameras. But don’t need the origin point.
Roni - Importance of knowing the gap between video captures. Where gaps depends on vendor. renderer can make better experience if you know
Doesn’t think millimeters should be a Boolean
Architectural model –
First 2 bullets - Agree with arch in the doc but it should be better explained
AI – draft authors – clarify draft, as per Roni’s comments
One issue is no different between capture set and scene. Scene is not necessary.
Andy – thinks EP isn’t useful in the specification, makes sense to talk about it, but it doesn’t add anything to specification
Media stream providers and consumers, EP does both, rather than an element of the model.
Roni – need to define media consumer and provider.
Andy- Can have just audio or video, streaming or recording. EP just happens to do both. Or an EP could be either or the other.
Jonathan- agrees EP not important. For example, presentation isn’t connected to a room.
Discussion of what’s a CLUE endpoint. Something running CLUE.
Roni - Add an SDP param TPcap – what attributes you support. Agree to hold discussion of implementation as a separate conversation
Need to define how to define extensions – agree
AI – draft authors – clarify how to define extensions
Composed captures – Roni wants more information than just the image is composed.
Andy- this is a precomposed image. A midllebox would like to know this information so that it won’t further compose the image.
If iimage is precomposed of most active speakers, it would change often, and you wouldn’t want a heavyweight message mechanism, would want something more dynamic like VAD. Never thought of sending info of what is in the composition. Just whether the stream is composed or not.
Roni says it’s in RTCP today. SSRC, CSRC, not adequate.
Andy- Need a new orthogonal dynamic channel. VAD for example.
Whether capture is switched and/or precomposed is static so it is appropriate to describe in capture set
Don’t want to change every time there is a change in speaker for active speaker
Need both channels – fast changing and slower changing
Could see offering different composed streams
This is the current speaker composition, this is the current presentation composition, for example
AI – draft authors, make changes as suggested
Document is unclear on second point In the slides
Document needs to cover the third point. Should be added to framework, and maybe not mandatory
Relevant to endpoint and to MCU. Associate a media capture with endpoint
4th point – okay
Notes from Tuesday PM (Allyn)
Marshall presentation 3-D locations: a coordinate system for telepresence
Slide with 3 screens, 3 co-located cameras, Axis of symmetry of the particular unit. This gives an origin and a coordinate system.
Question – what has an axis of symmetry? Answer each system
Wants polar coordinates for what cameras are seeing
What doesn’t fit in this model? A person with a podium with camera off to the side.
If there is a podium? It breaks the symmetry
There is math, even in the simple cases, not clear we have to specify the origin. What is gained?
Origin is useful only to know where things you care about are
If it’s implementers choice, doesn’t make anything harder
A podium cannot be rendered in this particular set up.
Similarly, additional capture points inside the room
Origin important for polar cords, not so much for Cartesian
Axis of symmetry- everyone has a middle..
Multiple cameras and screens for different sitting arrangements
We need to discuss multi-row further
The finest grained info- Location of the camera and the direction its pointing and its fov
Should not be a function of the physical walls of the room
Andy- doesn’t replace the capture region, Cartesian positions of the cameras
Paul – preferred origin, doesn’t believe that people will agree where it is
Steve thinks it make sense - Where rows are, as distance from the camera.
Allyn questions ability to specify rows within one camera view
Allyn – meta issue - understands Marshall’s description is correct, but is it usable in this context?
Same comment for defining origin, feels not necessary for the renderer
Hadriel – thinks don’t need to have this info to match mics to videos. Thinks can get that info specifically without having to do analysis to determine it.
Jonathan- this way of describing relative captures breaks down for synthetic captures, such as presentations. He wants a way to describe those. And wants to be able to use the same language for synthetic and real.
Marshall’s only works for physical-based captures. Jonathan doesn’t want 2 languages one for synthetic and another for physical
MCU decides who goes on which screen.. or sharing right and left monitor. This language doesn’t make sense.
Language that makes sense is 2 dimensional. Start with simple case,which doesn’t do trigonometry or geometric transforms
Gonzalo – what about zooming cameras? What about focal length?
Marshall doesn’t like this aspect of the framework. Thinks won’t work well for this case
What about when the view is changing all the time?
Stephen B. – relative adjacency wasn’t sufficient to match audio and video
Mark – It is possible to represent a curved plane as flat. The renderer needs to know same info in both cases. The 1 dimensional description will work for synthetic captures
Andy- wants to say the same thing. Wants to have 0-99 mandatory absolutely needed. Needed for segment switching. Essential.
With origin, position, focal length.. optional.
Marshall wants to build a structure that can be used for a long time, so that it will be there later when people need it.
Roni- on synthetic, presentation, and people cases. Agrees can have simple and more info. not enough text about presentation and the mcu case in the draft. Requirements different.
AI – draft authors – clarify as per Roni’s suggestion
Jonathan – doesn’t want to have 2 different languages, want the full geo to be an extension of the simpler language…
Andy VAD - A New thing in framework draft
Want middle boxes to be able to switch without having to decode audio
Determine active speaker for intra-room segment switching
VAD algs must be consistent
Don’t want to disadvantage a receiver that has one only audio stream
Details- video linear range system, muddle box receiving multiple captures needs to figure out which is loudest overall and which is loudest within a particular room. Wants to know the energy level associated with that position.
Being able to determine which capture is from the active speaker. How determine which video is active.
What position(s) hold the loudest speaker as well as over all VAD. Consumer can determine which is loudest whether receives all or one pre-mixed.
Tagging active position in a room, audio energy is not the only way – for example, could have buttons on a pad to choose who is active.
Stream configure message specify VAD alg. Provider says which algs, consumer chooses which
Through an RTP header extension
Security of VAD info, is there something that needs to be decoded? Does it need to be encrypted? Traffic analysis could be done.
Marshall suggests talking with Colin Perkins to get his feedback
AI – draft authors, add to draft as suggested
Roni – the number of audio and video streams is not one to one
Jonathan is currently working on header encryption and energy levels
Jonathan – don’t want anything on the ACs. But for segment switching. Want it on the VC- the energy of the audio for the this VC. What do you want the semantics to be? Rather than a correlation of audio and video, he thinks what want is info about the speaker is in this camera.
Andy – issue with that if have a 2 VC case.
Thinks doesn’t work. MCU receives from many multi-video has to compare audio.
One receiver with many audios
See Jonathan’s draft for carrying energy in the header. Avtext Jonathan sent reference to the list
Brian - Didn’t want to do it in video, might want to flow control it off
Roni- agrees shouldn’t be in the video. Important to associate, but not mix by putting audio in video stream.
AI – draft authors – fill out VAD
Paul – asks about effect of different audio levels.
Stephen B thinks there is a bias.
Bo – it’s a real time position stream.
Jonathan and Andy- what about AC0, AC1, AC2 – do they have the same VAD info?
Rather than AC0 VAD just for itself, etc.
But then have a challenge with AC3- that has to combine the 3
Either the same in each or each describes itself
Paul – if replicate info in all of them.
For further discussion
Jonathan – think about. Understand case of getting soundscape right.
Gonzalo – they are setting up an SDP directorate
Messaging – Allyn
Difference in messaging type between SDP’s offer answer and CLUE’s “publish and subscribe”. One conversation in SDP, 2 in CLUE.
Way to characterize the difference is that offer answer is symmetric, and CLUE’s proposal is for asymmetric
Roni doesn’t like term “capabilities” would rather use the term “proposal”
Gonzalo- think about backwards compatibility when think whether will use SDP or not
Stephan- will have real time updates
Jonathan- advertise all the available scenes. Don’t want a full update. Want partial updates.
Partial state update rules out SDP.
Messages can happen whenever
Roni and Stephan want to talk about the signaling and transport tomorrow. They feel this is an important part of the framework.
Mary suggests they make slides for tomorrow
List of Framework Issues to discuss- see slides
1. Point/area of capture - Brian
2. Layout – Mark
3. Source selection - Jonathan
4. Attributes for capture sets (along with those already for media capture) – per the hierarchy each level should have attributes - Roni
5. Describe composed picture, rather than just a flag, so endpoint can decompose and put these
things on several things (i. e., resuse mechanism for spatial coordinates, etc) more like tree structure- Stephan will write up something - Stephan
6. Consumer description is not fleshed out
7. Relations to RTP – Jonathan
8. Definition of multi-view
10. Screens vs cameras – Marshall. Does the media producer need to know do/anything with screens?
11. Multiple row use case
12. VAD – input and output
Priority voting – for what to cover, choose 5.
13, 3, 2, 1 plus Stephan’s signaling and transport
Mary RTCWEB presentation
Overview – doesn’t need to interoperate with CLUE
Proposal for CLUE wrt RTCWEB
· Ensure that SDP usage is compatible and consistent to ensure that CLUE and RTCWEB do not define 2 separate ways of doing the same thing
· Evaluate usages of SDP/RTP as framework is being developed
· CLUE needs to consider RTC WEB decisions in terms of handling multi-streams-
o Multiplex over a single rtp session
o Or multiple rtp sessions
Stephan says CLUE spec doesn’t need to reference RTCWEB
Hadriel - CLUE should make sure that RTCWEB doesn’t require a symmetric model
Sending data in the media path- rtp extension headers
Browser support audio video only
Data channel will look different, will be RTCWEB specific
TCP over UDP for example
Covering of Framework Issues
Layout – Mark Duckworth
Switching.. describe a media capture as being switched. Can be from an EP or an MCU. Framework just has a Boolean. No info on an alg. For how is being switched.
Roni – has an issue with this
What info is he looking for? How often switched, differentiation on the side of the provider. Wants the consumer to know the basis of switching, so can choose which alg. It wants.
AI – authors add in as Roni suggests
Jonathan – multipoint case. Scene switching, if have different geometries, need to know the concrete scene. Has a lot of implications. Whether have a switched capture. Or switching among captures.
Have all info
If switching, want to tell which scene you are seeing
Means need to know geometry for each room
If MCU switches and you don’t know, you can’t do meaningful adjustments. Rely on what the MCU does for you
Request the current loud speaker.
Partial update for new scene in conversation. Capture 38. At run time gets info, you are now seeing 38.
Spencer – from Booleans to enums. Is this capping innovation? If not on the list, will you know what to do? These are the 34 things you can do.. restrictive
Roni- description of options can use
How do you describe new options
Alot of algs can decide how to switch
Stephen - Separate negotiation of policy from attributes need to know
Want MCU to be able to know whether what it’s receiving is composed
AI – authors Separate policy management from attributes
Want to see the current speaker and the current preso
Jonathan – site switching. How do you ask for site switching?
might need other attributes for composition
for switching.. don’t need to describe on the forward path
Andy- should there be attributes of capture set rows?
This isn’t just for multipoint,
Consider 3-1 point to point case for switching
Mark – where is this leading? What’s the use case?
Roni- value in getting info about what the switching alg. Either by the provider who can offer different algs to the consumer.
Meta information about the streams. Getting 3 switched streams, which is which? Depending on policy, which is which?
Tied to RTP mapping
Would a light weight message be fast enough?
Ask for 3 active speaker, get 3 or 2 +1. Sending a dynamic update. But figure out how to do it. Which participant this is.
Discussion of RTP
Jonathan- How do you bind SSRC to a virtual capture?
Assumption demux on SSRC or not. Andy is assuming demuxing on something other than SSRC – what was jonathan’s issue?
payload, CSRC. Jonathan was assuming demux on SSRC.
Distinction between real and virtual source.
Potential decryption issues in the MCU
Have identified some constraints.. switching and composition needs to be known
Need to know original source, distinction between virtual and real source. Messaging and RTP implications.
If you know the actual source, need to know what it is – roster list … way we bind?
Tricky to act at high speed..
Static pinning- and he becomes loudest speaker, want to get the data only once. Individual mute.
Someone needs to write some text.
This is intertwined with what are talking about for tomorrow
MCU wants to know if composed or original. This is a Boolean.
Auto-switch Boolean and auto-switch with policy. Haven’t totally decided if auto-switch Boolean is okay
Plans for tomorrow
Issue 13, signaling and transport, break,
Area capture- Represent things in the simple way. Agree on this. To be able to use the same language.
Notes from Tuesday PM (Roni)
3D Locations - Marshall
Capture plane is the line that goes through the camera and the origin is on the middle camera.
What about podium how do you put it in the room
Andy- is the axis per system or capture set
Marshall – per system.
Andy – what is the axis for presentation.
Steve- each capture set has its own axis
Jonathan – there is enough math in the solution, why specify the origin
Marshall – the reason to specify the origin, is that the origin will not be consistent.
Jonathan- what you care is where the cameras are. He is not sure if it is a problem if it is anywhere in the room.
Stephan – if there is no screen for the podium there is no way to render correct. Suggest to leave out the podium use case. The system on the other side will not have the monitor to display. Suggest to limit the scope. So a podium does not have a position
Steve – if we define the origin as is nothing break and have some value in the document. The concern is additional capture points in the room.
Marshall – are the other capture points important and do you need to have the relation between them
Gonzalo – polar co-ordinate requires the origin.
Marshall – important if the cameras are not in the center.
Steve – You can have a system that captures a room that is like a U with cameras on all side. It is not TP system.
Mark – is there a value to know the difference between the camera positions.
Jonathan – simple to just know the order – left to right, while the more complicated for those who want to do the geometry.
Steve- we have not done the multi row case enough.
Marshal – center of camera and FOV should be conveyed.
Andy – the area of capture is not the line of the seating but the volume.
Brian – the Cartesian coordinates can be used the same
Paul – people may choose different origin. The receiver may have to calculate the axis based on his system.
Steve – think that the idea to have the distance of the rows based on the capture line is good
Marshall – the systems may be constructed in arcs. We can allow to describe this but think it is not necessary
Allyn – what is the important of the distance to row if one camera
Steve- helps with identified the speaker
Allyn – not clear if the information on the area and origin is useful for the renderer.
Marshall – two camera system and the other side 3 screen. Do you want to know where to render.
Mark – wants to have eye contact.
Hadriel – the simplicity of just knowing left and right. Easy to know if the origin is in the center camera.
Jonathan – does not work for composite picture which do not have a camera location.
Gonzalo – are focal of the camera and the zoom and pan of camera.
Marshall – there is no far end camera control.
Steve – went from relative left to right to help match audio to video. Need to keep it.
Mark – if the consumer maps everything to straight axis. But if we use the curves the simple subtraction of only one axis will not work.
Marshall – there are specific cases that may be treated cameras.
Brian – need the simple linear for the simple left to right case and the polar arc case for the ones who want it.
Jonathan – Should have a similar representation and not two different ones for the 3 screen system and the presentation or composed systems.
VAD in clue - Andy
Marshall – talk with Colin of RTP guys.
Jonathan – maybe you want to say that this is the level of audio for this video capture. Or say the speaker is in this camera
Andy – if you put it on a composite video you do not know which one it is.
Roni – the number of audio captures can be different from video
Brian – the audio level should be in the audio stream.
Roni – need to be able to associate the audio to video but not by having the level in the audio stream
CLUE messaging model – Allyn
Started with two of the messages. The provider send what it can send, the consumer selects and the provider sends and there is a similar exchange on the other side.
Stephan – no decision if to use SDP.
Gonzalo – need backward computability.
Stephan – we may need to change the mode during a call.
Jonathan – want partial state update which rules out SDP.
Paul – there will SDP offer answer. It is about the extra information. It is not independent from SDP.
Stephan – the messaging and SDP relation should be part of the framework.
Stephan – how do we get the CLUE information over the wire.
The composition will be deferred till Stephan provide text.
Auto switch – provide information about the auto switch algorithm
Jonathan – suggest information about what is the area covered in the switch. Are you switching between captures, provide the capture you provide will help.
The switching and composition need to be known by middle box and there need to be a way to describe the streams.
Steve propose one Boolean – composed or not. About auto switch – we defer it until we discuss RTP relation.
Notes from Wednesday AM (Spencer)
RTP Issues - Jonathan Lennox
SSRC Multiplexing - Asserting that CLUE should use a single RTP session per media type, multiplexing sources by SSRC.
Why not per capture set? Number of capture sets could also be large and asymetric.
SSRC should be associated with "real" resources? SSRC is the switched source, CSRC is the physical source?
Not possible for individual cameras to generate their own streams? Is that just not done? Is because of key distribution, which is a hard problem anyway.
Would you change SSRCs at the middlebox, or just switch them? Matters because of multiplexing ...
SSRC as a way of identifying who you're seeing, having a switching middlebox change that is a problem. But can't devices change their SSRCs at any time? If there's a collision, or ...
This is a layering thing - I'm interested in left screen/right screen. Releationship needs to be in the signaling layer. This is turning into a protocol discussion, but we need to handle the relationship.
The same thing is happening in RTCWeb. Jonathan - we should have these conversations in AVTCORE, not here.
Do we have to have one m-line with all the choices? If you have two m-lines with one RTP session, that's when we have problems.
Steven - binding SSRCs to captures should happen up front.
Andy thought this was orthogonal to SSRC multiplexing.
Roni - we had this conversation in RTCWeb.
Jonathan - when you send a request to see a capture, you say "add this byte as an RTP extension header" as well. You get your own demux tag and can do what you need to do based on the tag you provided. Don't need to standardize meaning, just pick the right demux tag.
Hadriel trying to avoid changing the format of RTP packets :-)
Paul - sender has to keep per-receiver RTP state because the demux byte would be different for each receiver. Issue for RTPWeb was interop with devices that don't do multiplexing.
Stephen - we have a set of RTP sources that need to be grouped.
Jonathan thinks we need to know this mapping before you can display anything.
Roni - issue is when someone joins a conference. How do you get the state of the conference?
Some semantics of SDP descriptions of heavily-muxed sessions get confusing (this is a problem for AVTcore and MMUSIC, not for Clue).
We need RFC 5576 source descriptions in SDP if CLUE is encoded in SDP - less clear if it's not.
Andy and Mark - separating m-lines by roles is the wrong thing to do.
Jonathan - don't have two m-lines unless you have two physical sources - then you have to.
Hadriel - if there's a valid reason you have to do something, you have to do it - doesn't matter why.
H.281 FECC is broken - assumes you know which camera you're controlling, no way to specify which stream you're controlling in a session.
Options to Transport CLUE Messages - Stephan
Chartered to use SIP, but we're doing a three-stage handshake, not offer-answer. Would be desirable to have one exchange, but that's not compatible with the current framework draft.
Is there a conceptual difference between the initial information communicated, and the information communicated during the session lifetime? We think not.
Could piggyback on INFO, UPDATE, Re-INVITE, could do content indirection using multi-MIME body, use CLUE stream as SIP-negotiated "media" stream.
Re-INVITE is the wrong answer because current devices re-initialize EVERYTHING when they see re-INVITEs.
SIP-negotiated assumed to use UDP and MSRP. CLUE messages may not fit into one MTU. ICE-TCP could be an alternative. Proposing BFCP-like handshake.
Not constrained to use SDP, XML is "natural candidate". Gonzalo, Roni, Stephan to write -00 draft (but Gonzalo is not advocating this).
Hadriel - why are we doing this again? Take this to RAI and tell them MSRP doesn't work?
How often are these messages? every few messages, but may be bursty.
What's wrong with INFO? Maybe nothing ... the INFO problem was old-style INFO, without packages.
What's wrong with MEDIACTRL? Would be a good candidate, but not sure anyone has looked at MEDIACTRL. MEDIACTRL is TCP-based ("that won't work").
Paul - similar discussions in SIPREC. Looked at SIP (same questions, same constraints). Discarded INFO because if you could piggyback on INVITE, you could use the same mechanism on UPDATE/Re-INVITE. Decided to use SIP over TCP in restricted environment.
Stephan - don't think we should care about size of CLUE packet.
Steven - m-line approach will allow peer-to-peer, but do people need that?
*Spencer - still concerned about application-by-application UDP-ization. Need to solve this via indirection. TCP over UDP or something. ("RTCWeb is going to do this anyway")
Jonathan - some of this maps to SIP SUBSCRIBE/NOTIFY operation (partial updates of a roster, etc.)
Gonzalo - need to look at burstiness.
Allyn - also did analysis of MEDIACTRL and came to same conclusions.
How big can these messages get?
Do we have a bunch of people who have a preference for SIP and a bunch of people who have a preference for media streams?
Need to know what we're sending before we can do anything.
Stephen - in some cases, we're looking at CLUE messages that are multiple KBs. Have to design for that.
Allyn - "reliable UDP"?
Spencer - we had the conversation about TCP-over-UDP in like 2007 during LEDBAT discussions and decided not to go that way. It's four years later now. How much longer are we going to go the way we're going?
Gonzalo - we should have a BOF about this.
Stephen - with 100 meeting rooms, you'll be transmitting often.
Marshall - we need to worry about congestive collapse, because we send out different CLUE messages when participants fall out of the conference :-(
Mark and Brian - Capture topics
No need for receiver to render as an exact duplicate of sending room - just make it useful.
Stephen - if you can do geometric correction, we need to accommodate that. But we'll discuss offline.
Marshall - this is MUST versus MAY - X positions is a MUST. We need adjacencies. Everything else is MAY.
Stephen - but don't restrict your syntax so much you can't include Y and Z - it will be difficult to add this later.
Do we want to allow both cartesian and polar systems, but not require both of them?
Steven - the reason this works in practice is that camera angles are almost always perpendicular to the participants.
Stephen - Looking for a requirement to support curved conference tables, etc.
Andy - if you have XYZ for camera and for participants, you can figure anything out.
Roni - need to be very clear about assumptions and what happens when they don't apply.
Brian - need to enable highly accurate representation, but don't want to require that.
Brian - if you're not doing gap correction, you don't need a perfect description.
Brian and Roni will chat about what it means to have a "good rendering experience".
Hadriel - why do we have to define units for XY? why would we have anything BUT millimeters? Whenever you have options, you have interop problems. Even in synthetic scenes - synthesize XY, too.
Mark - receiver would like to know that the XY is NOT millimeters.
Jonathan/Steven - you want people to be actual size. You can blow up presentations, but not people.
Hadriel - are moveable cameras out of scope? How does that work when cameras move around?
Brian - could give view of entire potential range, could provide updates, could punt on first down.
Hadriel - are moveable tables out of scope? Having harder and harder time to think XY will ever be meaningful.
Mark - moveable cameras today have a small number of presets - you don't support all possible XY values.
Need to talk about "good experience" one of these days ... :D
Brian - want to provide something simple, allow for something more complex ...
Steven - don't negotiate simple versus complex - subset so you have something simple.
Mark - answer also depends on where camera is, and where it's pointing.
Stephen - just pick an "undefined" value - don't negotiate whether you send something or not.
Roni - various X values need to use the SAME units (so 100-wide objects are all to scale), and Y values need to use the same units as X values.
Allyn - concerned that we might build a system that's extremely correct but not usable. Don't want to put in things that no one will use. We're talking about things that implementers aren't using today, and aren't planning to use in the future.
Marshall - this is a red herring - there's no CLUE yet, right?
Stephen - but stuff can be on roadmaps we can't talk about, and we might just not know the future.
Marshall - of course no one is using this information, because there's no protocol to carry it (no CLUE yet).
Allyn - but implementers say they don't know how to use this information even if they DID have it now.
Hadriel - think of "simple" from an installer's perspective. If it's complicated to install, no one will use it. If people don't use it, that's not useless, it's dangerous. If people are going to use it, put it in.
Source Selection in CLUE - Jonathan Lennox
"Static" source selection - "pinning". "the current speaker and my boss's reaction"
Senders and receivers don't want to send two copies of the boss's stream if the boss is ALSO the speaker. Do we need to specify what happens when the boss starts speaking?
Scene information from CLUE plus roster information from something like XCON - associate both? potentially a lot of information as people join/leave.
Steven - how far do we need to go in the first version of the protocol? most controls don't get used much.
Jonathan - my understanding is that this is cultural.
Steven - might be.
Jonathan - there are cases where I'm permanently looking at someone small, and switching between images that are big.
Marshall - if you don't understand information, you don't switch - that's the proposal.
Roni - this is video, right? but not always.
Allyn - need to think about interaction with rosters.
Rosters are "out of scope" - what does that mean for CLUE in practice?
What if you offer multiple views of yourself? :D
Conversely - "I never want to see/hear this person".
Paul - remember that we want to see/don't want to see captures, not people.
We're back to assuming XCON or something similar.
Jonathan - use case coverage of this is sketchy, requirements don't mention this at all.
Chair Discussion - Mary and Paul
Ready to adopt Framework draft? Is this a data model? and/or a framework?
Part of document is framework, some is protocol - split them? attributes are the data model.
Is there going to be an update before we adopt as clue-00? We'll look at issues before we decide.
Roni thinks there will be many changes - would like to see revisions.
Does anyone disagree with adopting now and making changes to WG draft?
Humming to adopt - strong sense of the room to adopt now.
(I stopped typing because Mary was going through the slide with our open issues list)