Skip to main content

Telechat Review of draft-ietf-avtcore-rtp-v3c-14
review-ietf-avtcore-rtp-v3c-14-tsvart-telechat-westerlund-2026-01-19-00

Request Review of draft-ietf-avtcore-rtp-v3c
Requested revision No specific revision (document currently at 17)
Type Telechat Review
Team Transport Area Review Team (tsvart)
Deadline 2026-01-20
Requested 2026-01-05
Authors Lauri Ilola , Lukasz Kondrad
I-D last updated 2026-02-16 (Latest revision 2026-02-11)
Completed reviews Genart IETF Last Call review of -12 by Lars Eggert (diff)
Secdir IETF Last Call review of -12 by Carl Wallace (diff)
Tsvart IETF Last Call review of -12 by Magnus Westerlund (diff)
Tsvart Telechat review of -14 by Magnus Westerlund (diff)
Assignment Reviewer Magnus Westerlund
State Completed
Request Telechat review on draft-ietf-avtcore-rtp-v3c by Transport Area Review Team Assigned
Posted at https://mailarchive.ietf.org/arch/msg/tsv-art/SpTKiKpoj99A0j6wIpZ3GXiYhPs/
Reviewed revision 14 (document currently at 17)
Result Ready w/issues
Completed 2026-01-19
review-ietf-avtcore-rtp-v3c-14-tsvart-telechat-westerlund-2026-01-19-00
Hi,

This is a follow up TSV-ART review.

Summary rating: Ready with Issues.

Sorry, I missed this email before Christmas. I just took another look at the
updated draft -14. It does addresses some of the issues that I raised. However,
I think the following issues should be addressed before publication.

  1.
Security considerations: As you reply in worst case managing to instruct the
endpoint to combine the wrong streams could result in at minimal the incorrect
produced output, it may also crash the decoder. Thus, I think the security
considerations needs a bit more than the standard boiler plate here and be
explicit about the need to be able to trust the signalling for which streams to
combine into one output, as well as the need for having source authentication
on all the streams that are used as input when decoding.
  2.
When it comes to the congestion control considerations I would note that also
here it appears to needs a bit of discussion about the need to consider the
full aggregate when adapting the bit-rates to ensure that the adaptation
results in a proportional quality degradation, and not a much larger due to
adapting the wrong stream.

When it comes to the general multiplexing description, including using SSRC
grouping I think addressing this will have some significant impact on the time
to publish this document. If that is worth it would be highly dependent if
there exists usage of this that would benefit from that description. However,
as you have a BUNDLE case it might be fine for the initial usage. Thus, it
might be simpler to just go ahead for now, and if the actual deployments runs
into use cases with need to more clearly express things for example multiple
sources per direction in the same set of RTP session(s) then extensions may be
warranted.

I still think this draft could have benefited from an additional architectural
section between Section 4 and 5 that would have discussed how RTP sessions vs
streams (SSRCs) are best used for some use cases. I think that would have
simplified the rest of the description.

Cheers

Magnus

From: Lauri Ilola (Nokia) <lauri.ilola@nokia.com>
Date: Tuesday, 9 December 2025 at 14:42
To: Magnus Westerlund <magnus.westerlund@ericsson.com>, tsv-art@ietf.org
<tsv-art@ietf.org> Cc: avt@ietf.org <avt@ietf.org>,
draft-ietf-avtcore-rtp-v3c.all@ietf.org
<draft-ietf-avtcore-rtp-v3c.all@ietf.org>, last-call@ietf.org
<last-call@ietf.org> Subject: RE: draft-ietf-avtcore-rtp-v3c-12 ietf last call
Tsvart review

[You don't often get email from lauri.ilola@nokia.com. Learn why this is
important at https://aka.ms/LearnAboutSenderIdentification ]

Hello Magnus!

Thanks for the thorough feedback. Let me try to address these over email here.
I've implemented your suggestions below, except for the few clarifications that
I wanted to ask.

Regarding Section 9.3.

You are correct that there are multiple ways of transmitting the atlas data and
the video data. V3C has a concept to allow packing multiple video component in
the same video frame so you can only end up needing one video stream. Together
with the video you'll need the atlas stream in case the atlas data is dynamic.
Alternatively the draft allows sending atlas data as part of the SDP, if it
doesn't change over the session - which is the scenario that allows you to only
stream on video for the volumetric experience. I'll try to clarify these two
methods more clearly in the draft to avoid any confusing on the readers part.

Your point on ssrc-group is also well made and could be yet another way of
grouping the different components. Would it make sense to add this as yet
another way for grouping the data under the clarified section on grouping V3C
components? It would probably just need a new <semantics> parameter to clarify
the nature of grouping, correct?

Regarding Section 8.

> Due to how the full media representation when using V3C is dependent on
having both the ATLAS as well as the component video streams the response to
congestion control limitations are far from trivial. I think some clarification
to the implementer here is needed on how it should behave when forced to reduce
the aggregate bandwidth and how to consider inter stream prioritization. This
issue is clearly different from what scalable video codecs encounter when being
bandwidth limited where it is usually clear how to reduce the bit-rate.

This is an astute observation. You are absolutely correct that this is far from
trivial and could be something that will set one implementation apart from
another implementation. Many services and receivers may have different opinions
how the adaptation should be performed depending on the available hardware and
processing at hand. Specifying it here could be rather limiting and as such we
propose to follow the bare minimum methods as written in the draft. I don't
believe proper adaptation is as simple as defining media stream priorities, but
some streams are for sure more important than others. For example, for one
application it may be absolutely ok to drop color or texture information and
stream only black and white data as a method for adaptation. Another
application may be prefer increasing the noise for the rendering, by dropping
occupancy information and trying to derive occupancy form depth & color videos.
Do you consider this a road-blocker if we don't fix definitely the adaption in
the specification?

Regarding Section 11.

> I think this format needs an additional security consideration due to the
grouping. That is that for correct decoding the signalling system needs to
correctly indicate the combination of the V3C Atlas stream and the component
streams. If an attacker is able to manipulate this information the senders
intention will not be represented.

This would mean that an attacker, if able to manipulate the SDP, would be able
to direct atlas data to video decoder and vice versa, or that video codec
components would be reconstructed incorrectly. This would likely cause the
decoder to crash. Similar problems would occur, when a video and audio
streaming session would be attacked and the bitstreams would be directed to
incorrect decoders. This sounds like something that should have a default
mechanism to protect against these kind of attacks. Do you know if there is a
standard that would be addressing this?

> If I manipulate the ATLAS information can I significantly increase the
decoding information. For example forcing magnitude more iterations over the
underlying component video stream data to create the Volumetric representation?

Manipulation of the atlas data, would likely cause mis-indexing of video
textures and result in crashing the decoder. How the decoders handle falsified
atlas data is very much left to the decoder implementation. Smart
implementations would have means of detecting such manipulations (for example
counting how many texel read operations are made per pixel), but less smarter
decoders could end up in infinite loops if not careful. I'm unsure how this
sort of attacks could be prevented other than urging for carefulness from the
decoder implementers. Would it be sufficient to add a note urging for such
carefulness?

Thanks again for the constructive suggestions. Looking forward for your
suggestions.

Kind regards,
-Lauri

-----Original Message-----
From: Magnus Westerlund via Datatracker <noreply@ietf.org>
Sent: Tuesday, October 28, 2025 3:50 PM
To: tsv-art@ietf.org
Cc: avt@ietf.org; draft-ietf-avtcore-rtp-v3c.all@ietf.org; last-call@ietf.org
Subject: draft-ietf-avtcore-rtp-v3c-12 ietf last call Tsvart review

CAUTION: This is an external email. Please be very careful when clicking links
or opening attachments. See the URL nok.it/ext for additional information.

Document: draft-ietf-avtcore-rtp-v3c
Title: RTP Payload Format for Visual Volumetric Video-based Coding (V3C)
Reviewer: Magnus Westerlund
Review result: Almost Ready

This document has been reviewed as part of the transport area review team's
ongoing effort to review key IETF documents. These comments were written
primarily for the transport area directors, but are copied to the document's
authors and WG to allow them to address any issues raised and also to the IETF
discussion list for information.

When done at the time of IETF Last Call, the authors should consider this
review as part of the last-call comments they receive. Please always CC
tsv-art@ietf.org if you reply to or forward this review.

High level issue:

I think this document is not clear enough on the different alternatives that is
actually supported for transmitting the ATLAS data and the component video data.

Section 4.1 gives the impression that one can combine all data needed for one
V3C represenation into a single video stream, i.e. being sent over a single RTP
SSRC.

Section 9.2 instead talks about how to have seperate V3C with the atlas data,
and then component video streams over other RTP streams (SSRC).

For the later there exists a plentora of possible multiplexing models. With
what is being defined in section 9.2-9.4. With the defined grouping of V3C one
can clearly do both RTP session based multiplexing as well as bundled. The
examples in Section 9.3 appears to indicate that one need uniquie media lines
in SDP per complete V3C representation and that one can't setup one media line
per type and simple use multiple SSRC in each one complete set across the media
line to generate one media representation? Or even by just establishing one
payload type per type and then use RFC 5576 ssrc-group to indicate a set of
SSRCs that are part of one representation. Wouldn't it make sense to have a
ssrc-group for V3C?

Having read the document I think there is a need for a dedicated section that
defines which combinations that are possible and what external from RTP/RTCP
support these needs in providing the grouping.

Can you confirm that you have not identified anyway of using RTP/RTCP
mechanisms that exist to identify the set of SSRCs that are part of one
representation?

Another significant issue is the one for Section 8: regarding bit-rate
adaptation for this payload format and its component stream.

Section 7.1:

Published specification: Please refer to [ISO.IEC.23090-5]

I think this needs to indicate the RFC that defined the RTP payload format, as
that is the specification for which the media type is being registered.

Restrictions on usage: N/A

I think the recommened text from RFC 8088 for this field still applies:

This media type depends on RTP framing and, hence, is only defined
      for transfer via RTP [RFC3550].  Transport within other framing
      protocols is not defined at this time.

Section 8:

Due to how the full media representation when using V3C is dependent on having
both the ATLAS as well as the component video streams the response to
congestion control limitations are far from trivial. I think some clarification
to the implementer here is needed on how it should behave when forced to reduce
the aggregate bandwidth and how to consider inter stream prioritization. This
issue is clearly different from what scalable video codecs encounter when being
bandwidth limited where it is usually clear how to reduce the bit-rate.

Section 9.

Please add a reference to RFC 8866 in the first sentence.

Section 9.1:

I would recommend that one are clear that "byte-string" is using the definition
that exists in RFC 8866.

Section 11:

I think this format needs an additional security consideration due to the
grouping. That is that for correct decoding the signalling system needs to
correctly indicate the combination of the V3C Atlas stream and the component
streams. If an attacker is able to manipulate this information the senders
intention will not be represented.

Secondly:

This RTP payload format and its media decoder do not exhibit any significant
non-uniformity in the receiver-side computational complexity for packet
processing, and thus are unlikely to pose a denial-of-service threat due to the
receipt of pathological data. Nor does the RTP payload format contain any
active content.

If I manipulate the ATLAS information can I significantly increase the decoding
information. For example forcing magnitude more iterations over the underlying
component video stream data to create the Volumetric representation?