Audio/Video Transport Core Maintenance (avtcore) Working Group =============================================================== CHAIRS: Jonathan Lennox Bernard Aboba IETF 110 Minutes Thursday, March 11, 2021 04:00 - 06:00 Pacific Time Session I, Room 1 IETF 110 Info: https://www.ietf.org/how/meetings/110/ Meeting URL: https://gce.conf.meetecho.com/conference/?group=avtcore Etherpad: https://codimd.ietf.org/notes-ietf-110-avtcore Slides: https://docs.google.com/presentation/d/1eEKqUS1YkX-VTVbXRPuVify1ZmCsjA6B-1X13plaPTk/ ------------------------------------------------- 1. Note Well, Note Takers, Agenda Bashing, Draft status - (Chairs, 10 min) 2. RTP Payload Format for VP9 Video (Jonathan Lennox, 10 min) https://datatracker.ietf.org/doc/html/draft-ietf-payload-vp9 3. RTP Payload Format for ISO/IEC 21122 (JPEGXS) (Tim Bruylants, 5 min) https://datatracker.ietf.org/doc/html/draft-ietf-payload-rtp-jpegxs 4. Completely Encrypting RTP Header Extensions and Contributing Sources (Cryptex) (Justin Uberti, 5 min) https://datatracker.ietf.org/doc/html/draft-ietf-avtcore-cryptex 5. RTP Payload for EVC (Stefan Wenger, 10 min) https://datatracker.ietf.org/doc/html/draft-ietf-avtcore-rtp-evc 6. RTP Payload for VVC (Shuai Zhao, 10 min) https://datatracker.ietf.org/doc/html/draft-ietf-avtcore-rtp-vvc 7. SFrame RTP encapsulation (Dr. Alexandre Gouaillard, Sergio Garcia Murillo & Youenn Fablet, 60 min) https://datatracker.ietf.org/doc/html/draft-gouaillard-avtcore-codec-agn-rtp-payload 8. Wrapup and Next Steps (Chairs, 10 min) ------------------------------------------------- 1. Chair Items - framemarking Mo: just published an update with VP8 and VP9 info. VP9 is the section taken out of the VP9 RTP paylaod spec. VP8 is from the discussion (relating to rewriting of the PictureID and TL0PICIDX) even though i could not find confirmation this is an actual issue. VP8 to be taken to the mailing list. - expired draft: tetra no objection to expiration. 2. VP9 RTP Payload Jonathan: proposal to send for publication with version 12? Mo: there might be a feature not taken from VP8 that might make framemarking/DD difficult. It might be an oversight. Shall we address it before we send it for publication? Jonathan: I will take a look, hoping we will not have to dig into the bitstream, given the wide deployment. Mo: question is: is it a voluntary omission, or a mistake? Jonathan: It looked like we did not use it [in production] so we might have overseen it. But it is worth asking on the list. Bernard, we do have IPR disclosures, right? Bernard: yes. I sent the list of IPR disclosures to the [mailing] list. There was no reaction. So the question is: does anyone object to sending the VP9 RTP Payload draft for publication knowing the IPR declaration? no objection. Jonathan: finally, SDP question from christer. How do browser handle max-fr and max-fs practically, especially when the real value is above the max. Justin Uberti: SHOULD is the preferred way. Jonathan: that was my preference too. decision: make it a SHOULD. then publish draft 12. 3. JPEGXS Draft 09 should have addressed all the points. Question to the chairs: what is/are the next step(s) Bernard: next step should be chair review and publication request. decision: chair review. 4. Completely Encrypting RTP Header Extensions and Contributing Sources (Cryptex) (Justin Uberti, 5 min) Juberti: new version is 01. Changes the name of the parameter (a=cryptex) to negotiate it in SDP + editorial cleanups. Next step: test vectors, and then we will be ready for WGLC. Sergio Garcia Murillo working on test vectors, should be ready for next meeting at the latest. Bernard: Are we ready for implementation anytime soon? juberti: I don't know the time frame in chrome. I will discuss with the team, and it should come out in the coming months. Sergio: We already have a small vector. not a lot of work left. Jonathan has something as well. Jonathan: yes. 5. RTP Payload for EVC (Stefan Wenger, 10 min) shuai Zhao (SZ): Quick update on EVC draft. Current draft 01, based on VVC work. 6. RTP Payload for VVC (Shuai Zhao, 10 min) shuai Zhao (SZ): Quick Update on VVC draft. Current draft 08. Most of the 24 comments were addressed. 13 comments were directly modified as suggested by editors. Note that informative note for the M bit, and some optionnal SDP parameters were removed. Draft focuses on SRST for SVC, other simplifications such as removal of sections on SLI and RPSI feedback (seldom implemented nowadays). There were almost no usage of those with HEVC, so.. Proposed text for the reserve 'R' bit in the Fragmentation Unit Header ... Next: 4 remaining editor notes, a big SDP O/A section (not expected to be hard). ETA for WGLC: June 2021. Bernard: thanks for removing unneeded sections and making the spec more readable. Jonathan: On the 'R' bit: how do you use it in the SVC context? How does this compare to the Marker bit? Will take it to the list. 7. SFrame RTP encapsulation (Dr. Alexandre Gouaillard, Sergio Garcia Murillo & Youenn Fablet, 60 min) Jonathan: big topic, we're ahead of time, which is probably for the better. Bernard: we will hold the queue until slide 34. Youenn: intro: why a codec agnostic packetizer? We want to enable E2EE, and currently it is almost enabled in browsers/webrtc but in a hacky way. We would like to standardize the aproach, and make sure it is as compliant with existing sepcs (RTP/RTCP) as possible. There has been lots of interest, and lots of questions about it in previous AVTCORE and SFrame groups. Youenn: how does it work with e.g. VP8 or H264, without SVC. Raw Frame => Metadata + Encoded Frame => MetaData + transformed/Encrypted Frame => RTP packet (Header + encrypted payload). The metadata attached to the encoded frame is kept for the transformed frame. Youenn: now, more interestingly, VP9/AV1 SVC use cases (layered). Here the encoded frame is a set of subframes. each subframe is then transformed/encrypted as above. They should share the same timestamp and otherwise maintain all properties expected from a layered bitstream. Youenn: the 'transform' input being the output of an encoder, it is obviously codec specific, even though the SFrame transform is less sensitive to that. However, post 'transform', the packetizer itself is 'transform' specific and not codec specific. Youenn: we tried to gather some more detailled specifications in response to demand for clarity during the SFrame WG meeting. Colin: packetizer there is wrong, the encoder has some kind of packetization (NALU/OBU). You are talking about fragmentation, which is different from packetization. I'm worried about duplicated information that would not match (in encrypted payload and in rtp extention). Youenn: about redundancy of information. There is already redundant data anyway. Plus, some data will be used (and dropped) by the SFU, while the other copy will go all the way to the remote end. BTW we should be careful about what we keep open for the SFU to see because it has security implications. Collin: Why do you put metadata in the payload, rather than in an RTP header extension? Sergio: we don't. Metadata is placed in an RTP Header Extension. Sergio/Youenn: I think we agree. Cullen: What parts of this are really codec-agnostic? Bernard: This representation mirrors the implementation of insertable streams, which provides access to the payload but not to the RTP header and extensions. But even in chrome you also have webcodecs and other APIs, which permit the application to have complete access, so it can fill out its own (QUIC) datagrams. In that situation, the application will do packetization as well as the transform. I'd also like to followup about Cullen's question, relating to what is application versus codec-agnostic. The metadata might be the same for every codec, but does this imply that every codec will use the same RTP header extension? For example, we have heard that framemarking works fine with H.264/AVC and temporal scalability, so can an application decide to use that? For insertable streams, the answer is "no", but for WebCodecs architecture, the answer is "yes". As Stefan noted, we've developed at least 3 RTP header extensions for forwarding so far, and we might have more. The metadata that we define today might not be sufficient for some future codec. So I think you need to define clearly what "codec agnostic" means, and whether you are creating requirements that you don't need to impose upon yourself. You should think about whether you want to truly be "codec agnostic", or just have a process that would be identical for all codecs, but with different blocks, some per codec, some per application, some per transform. Youenn: Currently we believe it would be the later. Ideally it would be the former, but it might be too hard, and and currently H264 and VP8 are working just fine today with normal framemarking. I think it is orthogonal to the packetisation of encrypted content problem, which has its own rtp header extention. Jonathan: The decoder half of the picture might be more complex than the encoder. Especially, you need to be able to do things like figure out whether you are missing a packet that you need to recover. This is where the metadata comes in, providing information such as the other frames that a frame depends on, whether the frame is discardable, what layers are required for a given decode target, etc. Youenn: That is true. We have thought about the decoder, but we thought the encoder was a great place to start the discussion. Sergio: For AV1 it's made easier by the Dependency Descriptor which provides all the needed information. Jonathan: reassembly within a frame is easy. You NACK or use FEC. It's cross frame that is hard. Sergio: well, AV1 Dependency Descriptor gives you exactly that. Bernard: Hold-on. Are we going to require that all codecs implement support for the Dependency Descriptor? DD was created to support complex spatial modes supported in AV1 and VP9. But for VP8 and H.264/AVC with temporal scalability, that might be overkill. What happens if you use framemarking instead for those codecs? Or what happens if DD is insufficient for some future codec (e.g. VVC/EVC)? And of course DD is only used for video, not audio, right? So Jonathan's question is still valid. Jonathan: encoder decides how to packetize frames (NALU/OBU/...), how does it impact the 'transform' step. Colin: This does not feel like it is really codec agnostic, given that the only examples we saw were limited to a very small number of codecs used in WebRTC (e.g. VP8, VP9, H.264, AV1) Youenn: there is more than meets the eye. We did not list every codec we can handle even today in the examples. Colin: not all applications that use RTP use webrtc. Sergio: Right, but it is not aimed at replacing RTP, it is an extention and it is opt-in. Youenn: Just like RTP extension, which do not make any sense outside a certain scope. Colin: if you want to define a codec agnostic payload, it needs to be codec agnostic. ... Bernard: I suggest that you need to better define the meaning of "codec agnostic". Are you saying that the transform is codec agnostic? The same way that SRTP applies to all codecs? That I can understand. Colin: in order to depacketize this, eventually you need to know what the underlying codec is. Youenn: While we use the same Payload Type for all transformed packets, there is an "APT" RTP header extension that tells you the actual payload type inside the transform. Youenn: we do not want to have an h264_encrypted payload type, because if we did then we would double the number of payload types consumed. WebRTC is already using 30 payload types, that would increase it to 60. So we use a single payload type for the encrypted payload, then a payload type for the underlying codec (e.g. H.264/AVC). stefan: too many oversights here. - The design is not codec agnostic. It is limited to a certain number of codecs. For example, for AV1 you need the DD. There are tons of codecs which won't support DD, or for which DD may not be sufficient. VVC and EVC are examples. - I also understand that you want a middle box to be able to deal with many codecs without needing code to parse each payload. I would suggest that you don't call it "generic" and that you should list the codecs that it is known to work with, based on implementation experience. - remember that the vast majority of interactive media today, even vast majority of RTP is not WebRTC. Youenn: agreed, trying to be fully codec agnostic, especially for SFU might be too hard. We are trying to move information needed by the SFU from the RTP paylaod, to RTP Header Extensions. section. Hopefully, while different, we hope that the variations would be small and that would be progress. I agree that we need to remove "agnostic". Sergio: I have curiosity about the vvc <==> DD match. I would love to see what is not covered for example. stefan: ok, I could do it. However, i will only do it on a best effort basis - the burden is on you to show that your solution is "agnostic" (applies to all codecs), not on me to prove that it isn't. juberti: it is a complicated problem to come up with a common metadata representation for all codecs. It would make SFU work easier. Especially in the case of E2EE. We will find a way to cleanly separate RTP payload and RTP header information, while remaining fully compliant with RTP. I think for the set of codecs that are of interest, we are not far off. sergio: I agree. colin: I m not sure I agree that we can build a generic metadata format. We need to know at least when units (NALU/OBU) start and end, so you don't fragment them. you are trying to throw the architecture away. juberti: every paylaod format is doing this. It feels strange to have to redefine the exact same thing, everytime we define a new payload format. There is a common part. I think this is all there is here. Here is the same thing we do it every single time, and a way that would allow us not to repeat again. colin: so propose something that fixes it. juberti: this is what we have here. cullen: this is not what you have here. aligning terms: taking bitstream out of the encoder and breaking it up (packetizer into NALU/OBU). Then there is another step which takes things that are bigger than an MTU and we slice them into RTP packets, and this is what the authors here call packetization. Bitsream packetization is by definition codec agnostic. Then we have metadata. 'audio level' is already a great example of something that is generic to most audio codecs, and that is made to be used by SFUs. I do not think there was ever a commitment that audio level will work for all future audio codecs [note jonathan: it is not working with MIDI and it's fine], nor that future audio codecs will not need extra information. This, here, is exactly the same kind of proposal for video. cullen: so here, want to say [codec] was [transform] encoded: VP9 was sframe encoded. Next question is do we do it for something bigger than an rtp packet (sframe) or smaller (perc-like). My opinion is that we should apply tranforms per packet not per frame. I believe it makes recovery of loss packets easier, and I do not believe the gain in bandwidth overhead is significant. juberti: I agree with a lot that was just said. Let's agree on what "packetization" means. I do believe the bandwidth gains are not negligable, but it is not the driving reason. encryption of content and packets lies in 2 different layers and we should keep it this way. Jonathan & bernard: let's resume the presentation. - RTP packetization - SDP parameters and negotiation Colin: payload types are made exactly for that, why are you multiplexing. Sergio: to avoid doubling the number of payload types in use from 30 to 60. Youenn: the draft presents alternatives, we are certainly interested in getting feedback there. We are presenting this one today, even if it is a compromise, and we put other approaches in the draft. Colin: payload types are supposed to indicate the media type, and here you are pushing an application/transform PT. Sergio: we replicated the mechanism from RTX. we propose to add an RTP header extension to indicate the media type inside the transformed payload There is also frame metadata that an SFU would need. This is just a proposal. we think we should send info needed for transfer, and for error correction / CC for audio the list is a little bit different There are several solutions possible. We are looking for feedback. Mo note from chat: it's exactly like flex FEC, so there is a precedent. Sergio: presentation of how RTX, FEC and RED would work in the scope of the proposal. 8. Wrapup and Next Steps (Chairs, 10 min) Bernard: nest step? schedule an interim meeting? Jonathan: some sort of interim is needed. Interim would be ideal for focus and convergence. Youenn: in the mean time, I would love if people could provide feedback on the mailing list about the draft. Bernard: We have reached the end of our alloted time. Mo: We think we are more in agreement that we think. I think we need to all agree that new codecs are rich ... Bernard: thank you everybody! Meeting adjourned.