CLUE WG Interim Meeting (June 7-8, 2012) ======================================== Stockholm, Sweden Hosted by Ericsson Meeting summary by Mary Barnes (version 4, June 19, 2012) Detailed meeting notes by Bo Burman, Keith Drage, Charles Eckel, Roni Even, Rob Hansen, Andy Pepperell, Allyn Romanow and Magnus Westerlund Conclusions: ------------ 1) There can be multiple entries for the "text" attribute in the capture scene along with a unique "language" attribute in the framework and data model 2) Remove appendices from framework. A.1 is out of scope. A.2&A.3 are related to A.1, thus out of scope. A.4 and A.5 have been superseded by individual drafts which capture the issues and solution options in more detail. Any consensus around these proposals will be considered in terms of additions/updates to the framework, as appropriate. 3) Add an optional element to the framework (and data model) for the "axis of capture" to aid in properly rendering in 3D scenarios. 4) RTP topologies: support per Magnus' presentation: p2p, distributed endpoint, 3 types of mixers. 5) Content-type: reference RFC4796, describe and limit the values (TBD as to whether it's just "main" and "slides") that are used & specify semantics specific to CLUE. Must ensure that the element values are extensible. 6) Data model: general agreement with basic approach, format and content of the data model 7) Criteria: will not continue working on this. Information will be used to inform the decision when we work through the signaling solution. Action Items: ------------- - Magnus: send text to clarify concerns with regards to the framework and encoding groups (per slide 6 of FW presentation) - Roni will work with Jonathan to produce one RTP document including topologies - Mark: update the framework based on discussions/conclusions. Remove Appendices per discussion - - Stephan: send text that needs to be added to the framework for item 3) above. - Charles: once new RTP doc is available, forward to SIPREC as FYI and for any feedback. - Mary: send link to IMTC document to CLUE. - Allyn: develop and submit an initial call flow document for discussion. - Roni: submit use cases for switched capture New Issues/Tickets: -------------------- - FW needs more detail on switched capture (BUT this needs to wait until there is more agreement on the RTP mapping and usage). - Content-type: Decide whether we have a value in SDP and if so, describe how things work with the value in the framework - Add a ticket for VAD CLUE Interim meeting minutes 09:00-12:00 (Thursday, June 7, 2012) ================================================================= Notetakers: Magnus Westerlund, Keith Drage Notes by Magnus Westerlund -------------------------- 9:10 – Framework --------------------------- Mark Duckworth Presenting Consensuses on allowing multiple scene descriptions to provide alternative language versions of the description. Question about Content attribute, what values are allowed? All in the registry or a limted set applicable to CLUE use cases? Roni Even, we should describe the ones that applicable. Bozkot, we need to describe interoperability. Espen, we may for once have the possibility to use some that has been difficult before. Jonathan, some values doesn’t make sense, but it depends on where we use them. If we use them in SDP there is a different backwards compatibility story, compared to in the data model. For backwards compatibility we don’t need to use all. We should clarify what they mean as the RFC is vague on definitions. Roni Even, we don’t get interoperability by just including the content tags. Keith, if we define something, we are creating an alterntive definition. Thus the choices are to create new values with the definitions we want or do an update of RFC. Allyn it causes confusion if you allow multiple tags and the ones that are confusing. Which of main and speaker should use. Clue don’t need that distinction at this stage. Putting this issue on the list of issues needing future discussion. Magnus asked about simulcast, i.e. providing multiple encodings of a particular media capture. Magnus thinks the language is not clear. Jonathan worried about if this works or creates combinatorial explosion. Roni, thinks it works but you need to take care when describing multiple end-condings. Roni, we need to clarify that simultaneous sets apply across all capture scenes. Stephen B think we have a bit of problem with simultaneously possible. Mark responsed that encoding group do have simultaneously applicability. Andy the group constraints provide a limit over the combinations. Charles asked that is a capture can only have audio or video. Then you appear to be limited in describing. TO be taken later Jonathan, of you have 15 different configurations then do you need 15 different encoding groups. Andy not, you rather use the group constraints to ensure the boundaries and then the consumer asks for what it wants. Ticket #8 is an instance of Ticket #10. Espen, given that you make an end-point with two screens. If you don’t really care it would be good to provide the consumer with the providers preferred set. Roni, why put effort into this. The consumer can always chose something. Andy we shouldn’t get into the thinking that the provider knows better than the consumer. That will complicate the mode. More discussion. Switched capture example. Keith, why is the switch a problem for the clue system. Jonathan, will talk about this later. Mary do we need the proposal to the framework. Appears not based on mailing list feedback. Roni, commented that you don’t know how a media capture is represented on RTP level. Charles, think we should have something in the framework. Stephen, lets the RTP discussion mature a bit. A.1 video layout arrangement. Allyn thinks there are two separate issues in this slide. Mark, this is within a single piece of composed media capture, there is a higher layer composition. That resolved Allyn and she is fine with removing the text. Keith, if there is only announcement configuration and no negotiation this can’t happen. The meeting agrees to remove. Christer Holmberg, A.1 does not prevent source selection. But if A.2 and A.3 are being removed we can’t do source selection. After discussion agreed that source selection is currently out of scope for clue. It is for future and wider discussion. A.4 remove it from the framework and continue discuss in individual draft until agreement in WG exist to include it in the solution/framework. A.5, Roni saying that VAD issues and the audio rendering tag is not related. Jonathan clarified that there are two issues and both are discussed in the framework. Will create a new ticket for the VAD and then discussion continues about the audio rendering tag. Switched Attribute ----------------------------- Questions if Roni, isn’t what Magnus asked before about simulcast? Andy, no that is already supported by the framework. Paul, when you introduced presentation this gets messy. Put in an another scene. Andy, it may still be a capture. If one stops presenting then one would remove the capture. May be multiple capture scenes. Need for having multi source synch. Charles: How do you specify N number of captures provided? Jonathan why do I need tie the N number of most important be tied to switch. Why not have a capture entry list be explicitly indicating that this is a prioritized list. Stephen, support Jonathan, should be explicit, rather than implicit. Magnus asked if there isn’t a number of differnet sets of conceptual captures. Roni, asked if you are not trying to realize a full mesh camera. Nothing prevents you from doing a full mesh conference. Andy, responded that there is limitations. Jonathan thinks it is an important feature, but it is being done the wrong way. It also forces to much logic/policy into the middlebox. It is important to ensure that the right balance between Middlebox doing things and providing enough information for the end-point. Charles how strict is the ordering in this set. Andy: A provider if it knows that a consumer gets 4 streams, then you don’t reorder the VC within the set you provide, only replace them that are active. Stephen Botzko, thinks this is a new set of capture type. Troubled by the consumer providing the layout to the provider. Andy, there are differences in the layout that affects what. Stephen, prefer that we provide the information and consumer can act on the situation, rather then being told by provider. Roni even, Andy the case to resolve is the case when the most significant room is changing between a 3-camera room and one camera room. How do consumer compose. Andy: based on the assumption that a consumer can decode multiple streams and scale and compose the media according to desire. No conclusions about this draft. Will be part of later priority discussion. Audio tag --------------- Stephen B thinks tags is probably a good thing. But for audio you probably need to be able to address individual channels. Jonathan, the simple cases can probably be done with RTCP. For the complex cases it is insufficient. Espen, a question of how to associate audio and video. Andy: the consumer tells the provider how to associate this by the tag. Roni, is the assumption, is the assumption that each audio has its own SSRC. Stephens case Stephen, the issue is the switching in the middle is going to loose information. Jonathan, this appears to be quite similar to the previous presentation. We need the information. Keith, I am concerned that we cant really treat the directionality information for audio like video and cameras. The microphones are more omnidirectional and will pick up audio Conclusions: Interesting problem but not clear if it is the right solution. Continue discussion. Notes by Keith Drage: --------------------- Thursday, June 6th 09:00-09:10 Status, agenda bash (Chairs) 09:10-09:55 Framework: draft-ietf-clue-framework-05.txt 1) Over view of changes (Mark Duckworth) (15) 2) Open issues (Mark, WG) (30) http://www.ietf.org/proceedings/interim/2012/06/07/clue/slides/slides-interim-2012-clue-2-8.pdf Mark Duckworth presenting Slide 3 Magnus - Do we allow multiple language attributes or just a single one. Mark: Currently only allow one however there has been a suggestion to allow multiple scene descriptions each with a different language. Magnus and Roni support General conclusion of room to adopt Charles - Content attribute currently dependent on RFC 4796 - do we use all or limit only to those applicable to clue. Mark: No resolution to previous discussion Roni: Values can be used by applications. Unless you have something that describes what they mean, then interoperability issues. Allyn: Just limit our usage to two tags. Stephen: Defining interop doesn't mean limiting the usage of other values. Espen: Support values already there. Jonathan: Some of the 4796 values don't make sense in clue. Keith: Side issue: Note Sign language doesn't give which version is used. Main point - Either need to update 4796 if decide to use SDP, but if one doesn't do this, then one is essentially defining a parallel set of values that is not 4796. Roni: Main and speaker being used in some systems right now. Come back to this issue in more detail later. Slide 6 Magnus: How to capture simultaeneous transfer in multiple encodings? Mark: Associate media capture in a group that has more than one encoding in it. Andy: Elaborated Jonathan: Possible explosion of combinations - work through a complicated case. Roni (back on slide 5): Encoding group - clarify set is across all capture scenes, not one. Stephen: Based on this structure do have problem if want to indicate multiple simultaneous encodings. Mark: Everything in the encoding group has to be simulaneous. Andy: Encoding groups can either be specified with a individual max or say a total max is ... Slide 7 Captured for discussion later Slide 8 Espen: Wants to be able to let the provider indicate the best option to choose. Roni: ... Mark: First part of first bullet framework already allows. Andy: Should not capture two screens out of three where the scene offers three. Also not clear that the provider should tell the consumer to use. Slide 9 Cover later in RTP discussion. Chair indicates nothing to add to framework at this time. Charles: One of the other drafts talks about this which is Jonathan's draft. Slide 10 Proposal to remove from appendix, as no longer an issue. Meeting agreed to remove. Slide 11 Proposal to remove these issues. Ticket #1 has been closed. Meeting agreed to remove. Slide 12 See hansen draft and deal with the issue based on that draft as that is a better explanation. Remove this issue as a result. Meeting agreed to remove. Slide 13 See romanov draft on audio-rendering which is a better explanation of the problem. Roni: Not the same problem. Meeting agreed to remove A.5 but raise a new ticket to cover the Roni issue. 09:55-10:15 Framework: Proposals 10:15-10:30: Break (with coffee/snack) 10:30-11:30 Framework: Proposals 3) Switched Capture attribute & spatial coordinates (Andy Pepperell) (20) draft-pepperell-clue-switched-attribute-00 http://www.ietf.org/proceedings/interim/2012/06/07/clue/slides/slides-interim-2012-clue-2-2.pdf Andy Pepperell presenting. Slide 4 Paul: Value judgement that 1 through 4 is better than 5. Slide 5 Paul: When include presentation then gets confusing. Particularly if in same scene. Slide 8 Jonathan (Supported by Steve): This is a new attribute. Overloading "switched" is not the way to go. Steve: Do not want to see a new advertisment everytime someone joins the conference. Slide 16 Magnus: Need to identify what the conceptual capture is. Get the impression have several different types. Site switching, segment switching. Roni: Looks like using multisream work in order to achieve a meshed conference. Andy: This is not about all the streams going to everyone. Jonathan: Dont like the way proposing coordinates, although this is an important feature. Charles: How strict is the ordered list? Steve: Thinks this is another kind of capture scene entry rather than "switched". Not really enabling a full meshed conference. Is concerned about the layout. Knowing the layout is not putting enough control on the layout to the endpoint. Does not want the provider to make the decision that a 2 x 2 layout is required - endpoint decision. Roni: Letting the consumer creat soem displays that can be bogus. 1) Audio Rendering Tag (Andy Pepperell) (20) draft-romanow-clue-audio-rendering-tag-00 http://www.ietf.org/proceedings/interim/2012/06/07/clue/slides/slides-interim-2012-clue-2-10.pdf Andy Pepperell presenting. Steve: In geenral having RTP tags is a good idea. Two specific things to keep in mind. Need to Create tags for the individual channel, e.g. stereo. Two screens with audio channel taf. Espen: How to associate audio with video. There is only implicit knowledge between audio and video. Roni: Does each audio have its own SSRCs. If do, then can map using SRC name. Andy: If have multiple cameras then is the relation of audio defined. Thursday, June 7, 2012 3:00-18:00 (Notetakers: Andy, Allyn/Rob) ================================================================= Notes by Andy Pepperell ----------------------- Note: some gaps due to trips to the microphone. Afternoon session Rob: Consumer spatial information The need for switching: conference may have 100+ participants, 1000+ captures Consumers generally want to receive the active speakers It may not be possible to provide the spatial information from the originator to all receivers Multiscreen layout concerns – switching 3 captures out to consumers could go wrong if render order not known Solved by provider-side spatial information Roni: not sure understand what you want to achieve here. No problem for the provider to provide co-ordinates. Rob: MCU ↔ MCU case still problematic, as neither consumer nor provider in a position to give co-ordinates to the other party Gyubong: How does provider use the consumer “Area of Display” information? Rob: if the provider knew the consumer was rendering a 2x2 layout it would know not to put the top right capture next to the bottom left (i.e. it would know they were not adjacent). Roni: Consumer needs to convey information to the provider to get a meaningful advertisement. Provider won't send correct advertisement unless is has layout information from consumer. If you give user a control over layout, they will change it every second, and CLUE should cope. Jonathan: we need to figure out whether the layout is controlling the streams being sent (e.g. groups) or the other way round, i.e. layout is determined by who's loudest. Charles: could use chunks of ordered speakers so the provider could advertise up to 25 loudest speaker captures organised in chunks of 3. Keith: how much is reverse advertisement or negotiation, and can we re-use the provider advertisement on the consumer side? Rob: we did consider various cut-down neighbor-oriented protocols but the number of additional attributes etc. needed made it less complex to just go for a co-ordinate scheme instead. Jonathan: really only makes sense for real displaying endpoints rather than cascaded cases. Need to consider receiver-driven case and if it can also apply to MCU cascading case. Roni: can look at what MCUs do today when forming layouts, and apply that. They look at the windows they want and 1+5 etc. layouts. It's up to them to know how to build it, and it's a similar thing we want here, right streams, right resolutions etc. Andy: that works well until some sources have multiple captures (camera streams) which have adjacency restrictions, at which point some layouts become invalid. Steve: Is “area of display” a number of screens hint, or a layout? Rob: it's related to layouts and not necessarily physical displays. Steve: so it applies to MCU cases too? Rob: yes, but there are still problem cases. Paul: in the cascaded case, I'd presume that the receiving MCU would want all raw captures not switched captures? Rob: problem with bandwidth and not being able to receive all the raw captures, so restricted to a loudest speaker subset. Stephane: whiteboard drawing: 3 camera room with 3ccameras in a linear row, but an unusual screen arrangement on the receiving endpoint, 2 screens above a 3rd. Without render-side co-ordinates, MCU might send out 3 captures which would be rendered wrongly. Andy: believe it's an invalid example because consumer wouldn't ask for 3 captures “in a row” if it knew its 3 screens were so arranged, and in fact this is more like a 2 screen endpoint with a separate presentation screen. Charles: imagine a vertical list of captures with the most important at the top. Could ask for, say, 10 cells in blocks of 3 and the MCU knowing not to split capture groups across rows. Andy: would you need to be updated when, say, a 4 camera system joined? Charles: yes. Steve: could see how when we ask for a capture scene the consumer could provide a layout description. More about layouts than physical monitors. Stephane: you might want both: information on physical characteristics of monitors, where they are, what angles they're at, etc. and perhaps also want the wishful thinking of the consumer, how he would like to see things arranged. Paul: does this have to be done at the time of selection or can it be a capability? Stephane: might choose to allocate different areas of my screen for video conference, and change this over time. The equivalent of “where are the screens nailed to the wall” can change dynamically, so needs to be at selection time rather than a new capability. Espen: given that the user chooses a type of layout, that might not be the best choice if you dial into a conference with a 3 camera endpoint. Might need to factor in some information on who's in the conference. None of the examples have covered what information you need up front to decide what layout is best. Roni: thinking about if its; a capability or a selection, it depends on how it's going to be used. Paul: it seems like the proposal is to change the model in a fundamental way. Looks like we're talking about removing the capability message and move it into the selection message. Stephan: Axis of capture Room layouts: 2 different room layouts: one way is with multiple cameras in the centre of the room and capture a semi-circle. Other way is with a camera attached to each screen in a straight line If you display a picture captured from a side panel of an “ellipsoidal” room on a “linear” room side panel the picture becomes distorted (specifically, a 20 degree angle error). Proposal: information about the capture axis allows consumer-side geometric correction. An axis in 3D space is defined by 2 3D points. We already have (optional) point for the camera position. Add one more optional point in camera definition “axis of capture point” in 3D space. Solely to define axis of capture. So as long as point of capture is there then addition on a single 3D co-ordinate will give the axis of capture, and renderer can correct. Just using the center of the area of capture doesn't work – axis of capture new point does not need to be on the plane of the area of capture. Christer: PURPOSE: Define signaling criteria, in order to determine what transport mechanism is most suitable for transporting CLUE related information. Non-CLUE specific information will be transported using existing mechanisms (SIP, SDP). Keith: do we mean “Whether the CLUE information and the media description (SDP) need to be in the same dialogue” rather than message. I don't think the advertisement needs to be a dialogue. Allyn: Data Model. Andy did an initial version for previous interim, but organised in terms of messages rather than data structures. Newer version, with Mary's input, organised more in terms of basic information structures. Top-level “CLUE-info” element; includes capture-description, capture-scene, simultaneous-sets, stream-description, encoding, encoding-group. Not all messages would use all elements. Charles: in encoding-group, do you have audio or video encodings or is it an and? Allyn: this is just a structure definition, once you come to use the structure it gets concretized to inclusion of actual elements. Paul: looks to be an example of an XML document rather than a schema for an XML document (which would be harder to read). If I was to map this to XML, there would be a top-level CLUE-info structure in each message, which messages' elements drawn from the set of defined elements. Would now need to construct individual messages' definitions. Roni: mapping RTP streams to CLUE media captures. Jonathan: should be clear that SDP describes RTP sessions, not streams. Multipoint signalling is based on centralized conference server using one of the RFC5117 topologies: Topo-Mixer, Topo-video-switch-MCU or Topo-RTCP-Terminating-MCU. Magnus: you seem to be trying to define how to put media captures into SDP. Rob: if using fixed SSRCs in SDP, middle boxes need to modify SSRC values, and so need to re-encrypt when using SRTP (rather than just re-authenticate, which is cheaper) Jonathan: Captures from “same room” need to be synchronized. Andy: could be some cases where streams (even within a “main” capture scene) might not be synchronizable. Can provide encoding ID in RTP header extension, or use a consumer-chosen ID here instead (to allow the consumer to put some structure on the Ids for fast filtering, and to allow encoding IDs to potentially be more verbose). Steve: what about transcoded audio plus forwarded video. Magnus: RTP topologies. Signaling can restrict what topologies are supported, and thus the functionality a CLUE system may have. Circular dependencies. Mary: issues for discussion on Friday: do we need capabilities exchange? Notes by Allyn Romanow ----------------------- Afternoon June 7 Rob, Consumer Spatial Information His draft has a specific point and solution. Switching is needed. Don’t’ have spatial information Incorrect rendering can occur If you don’t have Left Right Center, you don’t know the order in which to put the captures Could split up people incorrectly If you have switched without coords, don’t know how. Switched with coords works fine. Issues though If Increase number of offers and increase number of possible layouts Roni- what’s the issue? If the consumer wants 12 streams laid out 3 above, rest below Rob - What about 2x2? We don’t have the facility to allow consumer to specify. That is what Is being proposed here Roni thinks we should not talk about it here. This is not composed Roni not sure we should be discussing this Steve B – in pt to pt case, coords enough for rendering. With multipoint we lose this info. want to have provider do layout, get info to him to have him do it. Steve wants consistency. Doesn’t want to do things in different ways for switched captures and non-switched captures. Stefan – generally supports Rob’s proposal. Brings up cascaded MCUs vs where there is a receiving endpoint. They don’t have the same information. Makes it necessary to distinguish between endpoint consumer and middlebox consumer Andy -Consumer side and provider side coords. If mcu to mcu neither side has real coords. Needs more work. Rob – consumer capability message could be another way to solve the problem Roni- makes sense to say the number of screens the consumer has. He sees the need for capability message in order to form the provider advertisement. Layout inside a screen is not fixed in time. Would there be a later opportunity to change? Rob – that’s why he wants to provide information from the consumer request- because of real-time changes. How often does layout change? User pushing buttons. Jonathan- which is cause and which is effect? Does the source depend on layout or layout depend on source? Are we looking for a sender gets it right or receiver gets it right? Rob - this is provider gets it right and receiver provides hints Charles-sketches a strips and chunks approach. up to N streams, ordered list, switched. Send 10 streams of chunk size 3. Put forth by provider Rob – we wanted a more abstract approach. Don’t think provider driven strips would work Espen – what kind of info does provider need? Consumer says 1 large and one group. wants 2 and 2 groups. What does the receiver need to know to group streams properly? Rob – number of attributes to describe strips is large, can figure it out using coords – no need for explicit grouping. Lets provider know which streams should be put where. Nothing new in protocols. Jonathan- Only makes sense for endpoints, not for cascaded cases. Doesn’t want 2 solutions -- one for cascaded and one for endpoints. A receiver based solution would do both. Where the receiver sends the spatial info in real time. We should think about it. If one solution works for cascaded case, we should do that. Roni – what does MCU need to do to build layout? - find out what size of window, what image resolution. Then it is up to them how to build it. This Is a similar thing. Get streams at right resolutions, then you can build it. Steve B- prefers spatial constraints on the ordered list on sender side. Layout within each display? Or of the display? Paul K – in cascaded case. Doesn’t MCU want raw captures not switched? Rob – bandwidth constraints may make it not possible. Stefan example. Not supported without getting info about geometry on receiver side Charles – Consider a different scheme, one with a prioritized group, specify number of screens and layout from the list. Rob- this assumes adjacency only Andy – this constrains the layouts the consumer can do. What happens when a 4 screen endpoint joins. Then what happens? It changes all the layouts. An MCU works this way, but it has all the knowledge. Here we are one step away. Would have to have new messages to say what the maximum number is as it changes. Roni – agrees with Stefan. Steve B. – when ask for a capture scene optionally provide a layout. Can see how it helps. Paul K- does this need to be done at selection timeor can it be as a consumer capability message before the provider advertisement? Stefan – the placements on a monitor are not nailed down, for a big screen. Very dynamic so needs to be at selection time Rob - Needs to based on what’s available, which is learned from the provider advertisement Espen-what kind of information do you need up front? Rob – this should be a subsequent discussion Roni – a capability or selection? Depends on how it is going to be used. He thinks it helps provider to know what to offer. Paul – seems like proposal is to change model in a fundamental way and get rid of capability message. Rob- he hasn’t said anything about capability message. Paul- restricts functionality to those entries that are switched. Keith- we are discussing capability vs selection message, but he doesn’t know what these mean. Comment – they are in the framework Andy – doesn’t think this is changing fundamental nature of things. Also wrt the initial capability message, we haven’t discussed it yet. We want an asynchronous message for this purpose. Selection, already has the max resolution and hints for layouts and video encoding, etc. He doesn’t think the proposal fundamentally changes the model or makes the case for a need for the capability message. Mary’s recap. Of specific points. Allyn says do we have consensus that this is an important issue? Even though we don’t have agreement on how it should be solved. Stefan Task #9 Axis of camera Slides 2 different arrangements of cameras, screens, tables, etc. for 2 different telepresence systems In one, it’s arranged in a semicircle, in another it is flat to the wall with the presentation screen on top. In the semicircle, 3 cameras are in the center. In the flat arrangement, the ameras not in the center but are far away from each other. We want to make these 2 systems interoperable If one captures from center and if other assumes it’s a flat scenario, then the rendering is 20 degree off capture angle. People are distorted. Stefan’s proposal – if we know the axis of capture.. the geometric relation of the cameras, we can do render side geometric correction. Only the render side knows it’s own display technology. We have already in our capture data structure the point of capture. Add an optional additional point in camera definition “axis of capture point” in 3D space. Solely to define axis of capture. Need an additional data point. Details need more defining. Stefan- a function of the camera itself . pick any point on the line and it defines the line Keith- not sure if we need to do this. Stefan – need to know the problem exists in order to do something about it Andy—works well. But there are issues with middle boxes not carrying the data thruWe have to figure it out. Data Model, Christer Define signaling criteria in order to determine which transport mechanism is best for transporting clue related info. Signaling and transport Criteria – what he proposed on the ML Is this useful? Do we already know? Keith – Asks whether clue info and media description in SDP need to be in same message. Why message? Does it need to be a dialogue? Allyn Clue Data Model Allyn: Andy put together an initial data model constructed in terms of messages. Mary didn't think constructing it in this fashion was optimal, and suggested instead structuring it after RFC 6501 (XCON). Allyn: Two initial points. Firstly, the aim is to decide whether this is a sensible methodology for using the model that we want to continue with. The second was that the model includes a few new elements, which are listed in capital letters- we shouldn't consider this now, it's a distraction. Ignore for now. Allyn: The initial description may appear to match the provider advertisment, and some may wonder where the consumer request is; remember that these aren't messages, but are the elements from which messages will later be defined. Allyn now goes on to describe the capture description. This contains a number of elements, which aren't defined as belonging to specific messages, but are instead usable in a range of messages. The description includes a new element DERIVED, which is an evolution of the original 'composed' element. A second new element, part of 'spatial-description', is NATIVE-ASPECT-RATIO, which was not discussed. Allyn: Any thoughts so far? Allyn now describes elements in , which contains elements matching the framework. Recently 'capture-scene-text- and -capture-scene-langauge- have been added. 'capture-scene-spatial-description' has also been added based on previous discussion. Next, entry is quickly convered, which has its own element because it doesn't fit into anywhere else. Finally, is discussed, which contains elements. Jonathon notes that should be in caps, as it comes from a draft and isn't part of a framework. AUDIO-RENDERING-ID is also a new element, based on the audio tagging draft. Roni: Wouldn't it make sense to divide this structure into codec-specific and general attributes, particularly because of the max-H264-Mbps element. Allyn: We'll need to discuss that specifically seperately, though it sounds sensible There was a question about encoding-id and whether it appeared in SDP - others opined that no, it was to match it to captures. Finally, contained a set of elements. Allyn: I think we should establish if this is a good representation of the work we've been doing Jonathon: I haven't gone through it in detail, but capitalised sections aside it seems a faithful representation of the framework document Charles: I thought was at the capture-scene level? Though I like it better here. Others said that this wasn't the case and that media-type must only be consistent between capture scene entries. Mark: I agree with Jonathon that this is good. But these don't include the messages - when the messages are defined will they be part of the data model? Mary: No, they will be seperate but will draw on the data model. Paul: This is similar to an XML document rather than the schema for an XML document; it's easier to read but less tightly defined. My only concern is that because messages will contain some elements but not others this is closer to schema definition, with a seperate document to define how the messages are constructed from these elements. Roni Mapping RTP Streams Is there the same or different ssrc when zoom? in usual conferencing, not in CLUE? Roni says same ssrc comment - if the stream continues, not if there are stops and starts Impossible to distinguish between a single device with zoom in and out and two devices, one zoomed in and the other zoomed out A given ssrc can provide one or more video captures over time, fits with switched model Jonathan- We are doing multiplexing SSRCs, single RTP session Need for multiple RTP sessions sometimes – for backward compatibility, decomposed hardware Andy- wouldn’t backward compatibility be best with a few mlines. Not introducing extra mlines. Roni wants 1 mline per simultaneous capture. Jonathan RTP Usage for CLUE- what’s new in draft since Paris Agenda: • Added a new requirement- the need to synchronize even for switched captures • Description of his architecture • Proposed Architecture Media requirement #12- need for synchronizing Correlate advertisement and requests … Multiple transport flows Coupling between sources and captures for switched captures is loose Loose and dynamic Single RTP session on a single UDP transport per media type. Source multiplexing MCU is a translator for static and switched, and an RTP mixer for locally generated composited captures Steve – if audio is mixed but video is switched, this case needs to be described and examined. Switched – means provider chosen , not necessarily forwarded rather than coded RTP Topologies Magnus Differences between topologies and issues Signaling topologies and functionality all interrelate. Functionality->signaling->topologies -> What RTP functionality a given topology enables Derive correct requirements for signaling Evaluation critieria • Security- key management, who has the keys, source authentication End to end verifiable Trust in central node • Congestion control Multi-hop One or multiple receivers of the same RTP stream Transcoding Media aggregate adjustments • Bandwidth consumption • Media quality • Distribution of complexity Tradeoffs Topologies outline- list • Point to point • Distributed end-point • Mesh multi-unicast • Media-mixer – own SSRC • Media switching mixer- this what Polycom does. The handling of ssrcs varies. That is why it’s relevant • Source projection mixer- this is what Vidyo • Relay (transport translator) 5117 With mixers trust middle box. Here in common session. Anyone can claim to be another source. Relay could do some basic checking. But could do more. TESLA (RFC 4082) or similar is needed. . Won’t see media content. There is possibility of congestion if any are congested it creates. Have to share capacity on all paths. Have to explicitly manage. Share bandwidth Complexity - all in endpoints. Not negotiable by offer answer • Selective forwarding switch. Not supported today by RTP. Switch turns individual source on and off based on policy Congestion – detection confused by disappearing and reappearing sources. If we want to do this, we need to extend RTP. Heard of people wanting to implement this. Not need keys but optimize content • End-point forwarding Not for clue, for rtcweb. • Any source multicast • Source specific multicast Lecture hall use case from use case doc, maybe, more work needed Do we need to select supported topologies? Does clue signaling need to take all into consideration? Jonathan- some of the CLUE assumptions of how to use MCU rules out some of the topologies Summary and plans for tomorrow Mary’s notes Do signaling and call flow tomorrow Framework: 1. Do we need consumer capability message? 2. Content type – what should we do? Options 3. Relationship between capture scene entries p.13 framework doc 4. Ticket #8. How to differentiate between different capture scenes 5. Switched capture – discuss solution options 6. Consumer layout – Rob Data Model 1. Evaluation of criteria against data elements 2. Differentiate between encoding parameters from codec specific parameters? 3. Do we agree with the basic approach, format and content of the data model RTP 1. What topologies should we support? Mary will email the list, wants answers by 6 am. Start at 8:45 Friday, June 8, 2012 9am-noon (Notetakers: Charles Eckel, Roni Even) ====================================================================== Notes by Charles Eckel ---------------------- 9am start of meeting Agenda Bash Mary: Should we have a discussion on call flows Discussion: We are not ready/prepared to have a useful discussion. Action Item: Allyn to prepare draft to guide discussion at future meeting. Framework (FW): 1) Do we need a consumer capability message?            Media Consumer                         Media Provider            --------------                         ------------                  |                                     |                  |----- Consumer Capability ---------->|                  |                                     |                  |                                     |                  |<---- Capture advertisement ---------|                  |                                     |                  |                                     |                  |------ Configure encodings --------->|                  |                                     | Discussion regarding historical purpose and current thoughts on value of consumer capabilities messages. How many people think we need the message, as currently defined in framework: 0 How many people think we should remove it: 13 How many people think we need more information: 7 Action Item: Take to the list to validate removal 5) Switch capture: - discuss proposed solution options: - e.g., suggestion to use a new attribute for ordering - versus overloading attributes. - encoding groups and 6) Consumer layout and capture selection (i.e., discuss Rob's proposal) Andy's proposal (summary): based on overloading switched attribute plus the absence of any spatial relationship of media captures to mean that set 'n' media captures is something that can be rendered in a useful fashion even if only a subset of the media captures within the capture scene is received. Rob and Andy agree that this and Rob's layout draft both address the same problem, and both propose similar solutions. Consumer has ability to convey its layout preferences such that provide can try to align what it sends. Overloading switched this way is problematic because very possible to have capture scene with combination of media captures both with and without spatial information. Paul and Steve: this is a model change from providing advertising and consumer choosing to consumer advertising and provider filling. This adds flexibility to expense of consistency. RTP/RTCP can be used to convey actual spatial information, outside of provider advertisement or consumer capabilities message. However, this requires waiting some amount of time to learn this information before making video visible; else may need to relocate or change layout. Consumer layout coordinate is attempt to minimize or remove this need while facilitating rendering video in real-time. Action Item: Andy and/or Rob to update drafts to include use case/requirements/problem statement to help clarify the problem they are trying to solve; then have discussion of solution. 4) Ticket #8 How do consumers differentiate between multiple capture scenes? Paul sent an e-mail on this previously, no response. Discussion raised more problems, no solutions. Is user selection based on textual description sufficient - no. Action Item: request group to review Paul's e-mail and provide comments. Lunch at noon. End of notes. Notes by Roni Even -------------------- Agenda bashing Should we discuss call flow. Christer – are we ready for this work. It looks like people are not sure how to progress Allyn volunteered to write a draft for next meeting with a call flow proposal. Do we need a consumer capability message. Discussed at the microphone Paul: Trying to reduce the set of configuration in the advertisements. Roni: we will need a consumer request message if no capability Jonathan: need some basic capabilities, like I can do CLUE, clues version XX Keith: You can start with either of the message Andy: Initial thought was for the consumer to say which attributes it supports.   Summary: some people think that we do not need the consumer capability. We need to have flexibility in the advertisement and configuration. For the sense of the room: Remove the consumer capability message from the framework – 13. 0 leave it. 6 do not think we can decide now.   Switched capture: Andy / Mark discussion: On the spatial relation similar to what suggested in Rob’s draft which describes the consumer layout. Andy: the distinction is if the provide can provide based on the offered layout (Rob’s) which may cause the provider for example in the case when the layout is 2 by 2 to not send a 3 camera view that cannot be divide while Andy suggest that the consumer will chose the views and build the layout based on the priority from the provider.   The discussion is about what information should be conveyed, who decide what to send and how do you know what happens when a switch occurs and respond to it fast enough. We need to have the application usage agreed and then the optional solution. There is Rob’s and Andy’s draft. Roni proposed another way that is based on source selection. Roni to post use case/s     Ticket #8 Paul – example of different meeting room Roni – example of medical , Rob example of security Question: Is user selection based on the text description enough to address the topic. Need use cases. Friday, June 8th, 2012 - 13:00-16:00 (Notetakers: Bo Burman, Rob Hansen) ========================================================================= Notes by Bo Burmann ------------------- Issue prio 4: RTP Topologies Topology media mixer supported. New topologies media switching mixer and the source projection mixer, supported? Jonathan: Support all these models if you want to support certain functionalities. Roni: Support at RTP layer is different than on application layer. Magnus: Think we need to support all three. Main difference between media switching and source projection is what you have to do to support identity. You need additional information about identity and capture in both cases. Roni: How handle RTCP? Magnus: Mixer will have a role where it has to manipulate RTCP information. It is partly implementation choices. Roni: Will need to describe RTCP behavior. Magnus: Yes. There is a congestion control WS, including BoF, in Vancouver. Jonathan: There are cascading issues and all CLUE-enabled boxes will have to work together. The topologies that CLUE decides to support will not only have to work independently, but will also need to work together. The only interesting case is the switched capture. Jonathan: The fixed case collapses to a switched which never switches. Paul: Also a media mixer that does not mix or transcode. Magnus: Andy or Rob made the point yesterday that it will need to support different topologies simultaneously. They degenerate together, which is one of the reasons that we need to support all of them. Paul: Does it matter to the receiver which model the sender uses? Magnus: Slightly. At least the number of simultaneously active SSRC differs. Mary: We need to reflect this in a document. Roni and Jonathan work together and update the existing document. Magnus will contribute by reviewing. Paul: Do we need different RTP extensions for the different models to map RTP to media captures? That should be investigated and go into the document. Jonathan: We also need to ensure that the signaling supports multiple transports, like for the distributed end-point or for different QoS. Magnus: We seem to be in agreement for distributed end-point. What about mesh? Is there CLUE information problems there? Think that we need to discuss it. Jonathan: Don’t think there will be CLUE problems. We may have to discuss it. Keith: So far charters excluded mesh. My assumption is that we should exclude it. My understanding that all work made in DCON excluded mesh. Gonzalo: There was not enough energy in the DCON BoF to start mesh work. Paul: Consensus call: humming unanimously opposes bringing up support for mesh in CLUE. Mary: If someone thinks we should support, they need to bring a draft. Magnus: I would like to conclude that that the type of Relay described in the RTP Topologies presentation is not in scope. Charles: The end-point forwarding reminds me of siprec. Will look into that. Jonathan: We probably want to consider how we siprec CLUE. Mary: We could put an issue on the tracker. Keith: Suggest a paragraph in the framework that siprec can start to work with, since it is not in CLUE charter. Mary: We have no use case and nothing in the charter. Siprec likely does not have it in their charter either. Paul: At some level siprec could already deal with CLUE RTP data. It has extensible mechanism to deal with metadata. Roni: siprec was motivated by people doing recording systems, not end-points Charles: some of CLUE complexity does not apply to siprec. Regarding RTP models, the thing discussed here will be a good starting point for siprec, probably based on an updated CLUE RTP draft. Keith: I believe we should have text in framework. Mary: Disagree. When we have a draft we could send that fyi to siprec. Jonathan: One endpoint in CLUE is a recording device. Mary: When we have call signaling and rtp outlined for CLUE we could send fyi to siprec. Magnus: What about multicast? Paul: I don’t know anyone that made multicast work with SIP. Keith: It is described in H.361, but was never updated to really make it work. Magnus: Summary, p2p, distributed end-point, all three mixer models should be supported in CLUE. Issue prio 5: Content Type  *   Borrow 4796 and refine semantics in clue  *   Update 4796 with additional semantics to ensure interop  *   Reference imtc document, if published informational / AD-sponsored document  *   Something new Charles: If it was an accessible IMTC document, would that be sufficient? It is a separate issue. Mary: Yes. It would be good if Charles could look into that. Roni: How do we define values? If separate than 4796, we can define other values. Do we signal the information in two places, in SIP and elsewhere? Keith: What do we need to have the same values for? Jonathan: The IMTC document discusses procedures to handle fallback. The one benefit it would have is for CLUE to be backwards compatible, which is somewhat different than defining values. Steve: IMTC procedure is a legacy interop procedure. Paul: If that is a requirement, we need to write it down, explicitly. Mark: Is there yet another issue with the IMTC document, using bfcp. Do we need bfcp floor control, even when not talking with other clue end point. Seems to be agreement. Keith: Want link to document (Charles posted it to mailing list). If we need new semantics or new values for content type, that should be in 4796bis, not in CLUE. Paul: Whatever you call it, is there an implication to use floor control that has a certain type, but not for other types? Charles: Don’t think so. If people see use for new values, they are welcome to define them. Jonathan: I think there is an issue with bfcp. Don’t know how to say that in CLUE, since CLUE side is all provider side and consumer side. Don’t know how to hook a bfcp to a capture in CLUE. Keith: Bfcp is getting special roles. What does that imply regarding media from that end-point. It has to be defined per media type. Charles: Talking about bfcp is pre-mature at this point in time, even before we have a data model. Keith: We can progress CLUE signaling semantics independently from what is available in SDP today. Jonathan: Don’t know how to fit use of BFCP into our framework. That is a problem. IMTC document has nothing to do with this. If we define a structure where this is impossible, that would be a problem. Mary: Floor control is considered out of scope for CLUE. Roni: That was for conference control, not for handling presentations. Allyn: Agree with Jonathan that our framework should allow using bfcp. Propose putting it as a Task item. Want to use only “main” and “slides”, “sl” and extensions (not “speaker”), subset of 4796. Keith: Valid to look into how to support existing floor control in CLUE, not to define our own floor control. Paul: Ask Jonathan to write something of the bfcp concerns to the mailing list. Mary: Reference 4796, describe and limit the values that are used and define semantics for CLUE Keith: The only way of doing this is to update 4796 Roni: That CLUE defines more precise semantics does not require changing anything in 4796 Mary: a) reference, b) update, c) something new (don’t reference). Raise of hands: 11 a, 1 b, 1 c. Majority support option a. Roni: Do we have the value in the SDP? Keith: Think we should start fresh regarding content attribute for CLUE. Mary: People don’t want alt and speaker, want main and slides, and are mostly neutral on sl (sign language). Only use values main and slides; raised hands 8. We concluded that we’re going to reference 4796, but don’t have full consensus and will take it to the list. Issue prio 6: Do we agree with the basic approach, format (structure) and content of the data model? Mary: Is this a good start? 12 raised hands. Issue prio 7: Relation between capture scene entries Charles: How is the capture scene entry represented, especially in relation to a capture scene? Allyn: See the draft. Issue prio 8: Evaluation of criteria against data elements Christer: Is this data model is to be seen as a single entity, or should we start to map it to various protocols? Mary: Will not continue working on this. We record also criteria in framework and in the data model, when we work though the signaling solution. Issue prio 9: Relationships between capture scene entries [some confusion in which order we discuss issues here!] Charles: Some media captures could be more useful than others, how can the provider announce that? See mail. Mark: Think usefulness of AC0,1,2 could be more useful with we pursue audio tag draft. Andy: Why not use 0-2 if you have those rendering capabilities. Choose VC3 or 4 could be user-based what he/she wants. Choice of AC3 may be bandwidth. Roni: VC4 can be identified as being full room from spatial information. All combinations may be useful. Jonathan: Agree with Charles. Consumer need sufficient information. Steve: You can distinguish since you have the spatial information. VC3 and AC0-2 can be good for a mobile or limited device. Charles: Saw it as a problem. Some people see it as a problem, some not. Need to think about it. Rob: Maybe VC3 is less appropriate for spatial audio than VC4 since VC3 is switched, meaning that sometimes the spatial audio will come from the left, sometimes right, but video will always be front. We would need to express that we get a segment switched into a single video. Roni: Renderers are allowed to use the information differently. Conclusions (notes outside of conclusions as presented to the meeting) Roni: Move conclusions switched capture priority to below call flows. Mary: Update milestones to reflect that use cases and requirements were decided to be left open until we have a framework that is mature enough. Notes by Rob Hansen (13:00-16:00, June 8, 2012) ------------------------------------------------ 1pm: Meeting restarts Topic: What RTP topologies should we support? Identified three topologies as in use: -media mixer -media switching mixer -source projection mixer no one initially felt any of these three could be discarded as not desirable to support the RTCP behavior for all the topologies was fully defined Jonathon brought up that cascading was an issue, and that it was important that in CLUE middle boxes with different topologies should work together - there was general agreement on this Paul brought up the question of whether a given topology would be used for different use-cases in CLUE, or whether all of CLUE should function with each topology. Generally it was felt that the latter was the case, though. It was proposed that the media mixer topology had a strong connection with offering composed captures in CLUE and was the only sensible way to do it, so the difficult case was that switching could be done either by media switching or source projection. Magnus reminded people that Andy brought up the point that a consumer might receive packets from multiple topologies (eg, both switched and composed mixers). People opinied that this was true, but that hopefully it didn't make sense for a single middle box to combine media switching and source projection mixer modes. Paul wanted to know if the consumer had to operate differently to cope with different topologies - Magnus said that there would be slight differences based on source identification. Mary asked if all of this was documented - Roni said no, but that he and Jonathon would work on a single document to support those topologies, potentially with Magnus and Allyn's help. there was consensus that we would need to support the three topologies initially stated, and that the document would examine the mechanisms for doing this, along with ensuring that the methodology would work for scenarios such as decomposed captures. Magnus asked if there was interested in supporting a multi-unicast mesh topology. There was general agreement that this would require little change on the RTP level. There was debate at what this would mean at the SIP level, and Keith pointed out that within IETF at present that mesh topologies were currently out of scope. A consensus call was made, and there was unanimous concensus that this was not something we should pursue. The chairs stated that this conclusion would be put on the list, and anyone who disagreed should also be willing to do some of the work on this issue. Charles believed that the end-point forwarding topolgy shouldn't be immediatelly dismissed, and that there was a larger question of how SIPREC and CLUE would interact. Mary pointed out that there was no current use-case for recording, and questioned whether there was a requirement to add one. Roni suggested that SIPREC was designed to be transparent at the SIP level, and that it was their responsibility to add support for CLUE if they felt it had. There was dicussion of the next step to take; Mary proposed that once we had made progress this send to SIPREC for review to see if the RTP and signalling would fit well with SIPREC. Multicast was proposed and there was general feeling that no one wanted to explore this. As such the valid topoliges for consideration were concluded to be point-to-point, distributed endpoint and all three mixer models. 1:50pm - Topic: content-type There was discussion on the approaches for making changes to content-type. Options included defining semantics for RFC 4796 for CLUE, updating RFC 4796 with semantics, finding a way to publish the updated IMTC document on the issue, or doing something new entirely. There was discussion of whether there was actually value in actually using RFC 4796; for fields in CLUE there's a less strong requirement to reuse RFC 4796. It was concluded that while we have a general requirement to interoperate with current SIP devices, it would be valuable to explicitely write down that we want to ensure that existing methods for content in SIP should continue to work when connected to a CLUE device. Charles said that he woult look at making the IMTC document publically available. Paul asked if there should be a requirement for media types labelled as 'presentation' to have BFCP linkages. This lead into a discussion of how BFCP and CLUE would interact, though there were no firm conclusions and it was felt that we were too early to come to a conclusion. Jonathon was concerned that he couldn't see any way to attach BFCP floors to the CLUE framework, and that before we got too much further we should ensure that we weren't designing an architecture that was incompatible with BFCP floors. It was felt that those with time and interest should look at how BFCP and CLUE would interact. Returning to RFC 4796 there was procedural debate about how, if we referenced RFC 4796, the best way to do so in IETF. A hum was taken with three options: a) reference RFC4796, describe and limit the values that are used & specify semantics specific to CLUE. Must ensure that the element values are extensible b) update RFC4796 with semantics for the purposes of CLUe c) something new (values and semantics) (don't reference RFC 4796) there was strong, but non-unanimous concensus for (a). Having done so there was a consensus call for whether we should limit the RFC4796 values to 'main' and 'slides'. Those who expressed opinions agreed, and no one actively disagreed, but some in the room were concerned that there wasn't yet sufficient information, and as such there wasn't sufficient support to be considered that there was concensus. 2:40 - break 2:55 - Topic: Whether the approach, format and content being taken with the data model. Allyn provided a bit of clarification and then there was a call for concensus on whether people thought the current document was a good starting point. There was concensus that this was a good approach, though consensus was not strong enough for this to be made into a clue document without further iterations. It was agreed that the requirements criteria would not be folded into the data model, as there isn't good overlap between them. It was agreed that creating the requirements criteria was useful and that it would serve as a good reference when making protocol decisions, and could help to resolve disputes. 3:10 - Topic: Charles' query on selecting capture scene entries. Charles questioned if there was enough information at present to choose between capture scene entries, and if so what the criteria of selection would be. Specifically, it was a question that involved how you would differentiate between a composed view giving a view of the entire room, and a switched capture showing one-third of the room at any given time; in the first case it makes sense to do have spatial audio left-centre-right, while for the latter it may make more sense to centre everything to mono. 3:30 - Wrap up Mary presented a set of conclusions, action items and new issues/tickets. A list giving the way forward for documents was presented. There was a brief discussion of SCTP over UDP as there was some concern that there had been discussed but not documented anywhere. The discussion resolved that there was not yet a conclusion that there would be a seperate data channel, but that if we did use a seperate data channel it would be a strong candidate. The way forward for documents were: Framework (mark to make updates) Data model (Allyn to make further progress) RTP: Roni and Jonathon to work together and include topologies (magnus to review) Call Flows: Allyn will put together an initial call flow document Switched Capture: need to further discuss this Audio Rendering tag: more discussion needed