Summary: Has 2 DISCUSSes. Has enough positions to pass once DISCUSS positions are resolved.
Thanks for the work everyone put into this document. I think it's not quite ready to publish, due to one ambiguity, one critical missing feature, and the lack of guidance around fragmentation. I also have two comments that I consider very important, although they don't quite rise to the level of blocking publication. As always, it's possible that my DISCUSS points are off-base, and I'd be happy to be corrected if I've misunderstood anything here. --------------------------------------------------------------------------- §4.1: > When the document spans more > than one RTP packet, the entire document is obtained by > concatenating User Data Words from each contributing packet in > ascending order of Sequence Number. This is underspecified, in that it doesn't make it clear whether it would be valid to split a single UTF-8 or UTF-16 character between RTP packets, and it is nearly certain that different implementations will make different assumptions on this point, leading to interop failures. For example, the UTF-8 encoding of '¢' is 0xC2 0xA2. Would it be valid to place the "0xC2" in one packet and the "0xA2" in a subsequent packet? Without specifying this, it is quite likely that some implementations will use, e.g., UTF-8 strings to accumulate the contents of RTP packets; and most such libraries will emit errors or exhibit unexpected behavior if units of less than a character are added at any time. (The same point holds for splitting a UTF-16 byte across packets). I don't think it much matters which choice you make (explicitly allowing or explicitly forbidding splitting characters between packets), but it does need to be explicit. I have a slight personal preference for requiring that characters cannot be split (both for ease of implementation on the receiving end and to more smoothly handle missing data due to extended packet loss), but leave it to the authors and working group to decide. --------------------------------------------------------------------------- Unlike other definitions to convey non-loss-resilient data on RTP streams, this document had no defined mechanism to deal with packet loss. This makes it unusable on the public Internet, where packet loss is an inevitable feature of the network. The existing text-in-RTP specifications define procedures to deal with such loss (see, e.g., RFC 4103 section 4 and RFC 4396 section 5). --------------------------------------------------------------------------- This format is rather unique in that it, alone among all other RTP text formats, is designed to send monolithic documents that may stretch into the multiple kilobyte range. While fragmentation is mentioned as a possibility, the document provides no implementation guidance about when to fragment documents, and what sizes each fragment should assume. RFC 4396 section 4.4 is an example of the kind of information I would expect to see in a document like this, with emphasis on the fact that TTML documents are going to frequently exceed the PTMU for a typical network connection.
§1: > TTML (Timed Text Markup Language)[TTML2] is a media type for > describing timed text such as closed captions (also known as > subtitles) in television workflows or broadcasts as XML. Although superficially similar, there are important distinctions between subtitles (intended to help a hearing audience exclusively with spoken dialog, typically because the audio is in a different language or otherwise difficult to understand) and closed captions (intended to aid deaf or hard-of-hearing viewers by providing a direct, word-for-word transcription of dialog as well as descriptions of all other audio present). Calling one "also known as" the other is incorrect. I suggest rephrasing as: TTML (Timed Text Markup Language)[TTML2] is a media type for describing timed text such as closed captions and subtitles in television workflows or broadcasts as XML. --------------------------------------------------------------------------- §184.108.40.206: > The TTML document instance MUST use the "media" value of the > "ttp:timeBase" parameter attribute on the root element. This statement makes an assumption that the "http://www.w3.org/ns/ttml#parameter" namespace MUST be mapped to the "ttp" prefix, which is both bad form and probably not what is intended. I suggest rephrasing as: The TTML document instance MUST include a "timeBase" element from the "http://www.w3.org/ns/ttml#parameter" namespace containing the value "media".
James and WG, I do have a couple of issues I want to have your feedback on if they should be corrected or not before proceeding to publication. Note they are for discussion and in cases where things have been discussed and there is consensus please reference that so that I can take that into consideration when we resolve these. 1. Section 4.1: Timestamp: The RTP Timestamp encodes the time of the text in the packet. As timed text is a media that has duration, from a start time to an end time, and the RTP timestmap is a single time tick in the chose clock resolution the above text is not clear. I would think the start time of the document would be the most useful to include? I think the text in 220.127.116.11 combined with the above attempts to imply that the RTP timestamp will be the 0 reference for the time-expression? I think this needs a bit more clarification. Not having detailed studied TTML2/1 I might be missing important details. But some more information how the document timebase:media time line connects to the RTP timestamp appears necessary. 2. A Discuss Discuss: As Timed Text is directly associated with one or more video and audio streams and requires synchronization with these other media streams to function correct. This leads to two questions. First of all is application/ttml+xml actually the right top-level media type? If using SDP that forces one unless one have BUNDLE to use a different RTP session. Many media types having this type of properties of being associated with some other media types have registered media types in all relevant top-level media types. Secondly, this payload format may need some references to mechanisms in RTP and signalling that has the purpose of associating media streams? I also assume that we have the interesting cases with localization that different languages have different time lines for the text and how long it shows as there are different tranditions in different countries and languages for how one makes subtitles. This may also point to the need for discussing the pick one out of n mechanism that a manifest may need. 3. Section 7.1: It may be appropriate to use the same Synchronization Source and Clock Rate as the related media. Using the same SSRC as another media stream in the same RTP Session is no-no. If you meant to use multiple RTP sessions and associate them using the same SSRC in diffiernt, yes it works but is not recommended. This points to the need for a clearer discussion of how to achieve linkage and the reasons for why same RTP timestamp may be useful or not. 4. Fragmentation: I think the fragmentation of an TTML document across multiple RTP payloads are a bit insufficiently described. I have the impression that it is hard to do something more clever than to fill each RTP payload to MTU limtiation, and send them out insequence. However, I think a firm requirement to apply RTP sequence number for a single document in consecutive numbers. Also the re-assebly process appear to have to parts for detecting what belongs together, same timestamp and last packet of document should have marker bit set. As a receiver can loose the last packet in the previous document, still know that it has received everything for the following document. However, if the losses are multiple, inspection of the re-assemblied document will be necessary to determine if the correct beginning is present. I have the impression that a proper section discussing these matter of fragmentation and re-assembly are necessary for good interoperability and function.
A. Section 6. To my understanding the TTML document is basically not possible to encode better. A poor generator can create unnecessary verbose XML which could be shorter, but there are no possibility here to trade-off media quality for lower bit-rate. I think that should be made more explicit in Section 6. B. Section 7. Wouldn't using 90kHz be the better default? 1kHz is the minimal from RTCP report that will work decently. However, if the timed text is primarily going to be synchronized with video 90k do ensure that (sub-)frame precise timing is possible to express. I don't see any need raster line specific for time text so the SMPTE 27 MHz clock is not needed. And using non default for subtitling radio etc appears fine. C. Repair operations and relation to documents. Based on basic properties of TTML documents, I do think the repair operations should be highly targeting single documents as there is likely seconds between documents, while the fragments of a document will be sent in a rather short interval. That recommendation would be good to include.
I would recommend starting some new top-level sections within what is currently Section 4.2, rather than going down to six levels of subsections (18.104.22.168.1.2), which can get confusing when other people are citing parts of this document. Please respond to the Gen-ART review.
Thanks for this clear and well-written document! Section 2 The term "word" refers to byte aligned or 32-bit aligned words of data in a computing sense and not to refer to linguistic words that might appear in the transported text. Either of byte-aligned and 4-byte-aligned, as opposed to aligned to one of those and in multiples of the other in length? Section 4 I find myself feeling like I would benefit from a brief discussion of the relationship between documents and the RTP stream before getting into the details of the payload format (e.g., "one document per subtitle", "many documents per stream but each document contains some minutes of data", or "totally up to the profile in use"). Even having finished the I-D I'm still wondering: it's clear that we only have a single TTLM stream in a given RTP stream, and a given RTP packet has (part of) a TTML document in the epoch of the timestamp of the RTP packet, and I can only have one document active at a time. On the flip side, different documents must belong to different epochs. So it seems that I could either make large documents stuck on a single timestamp, or small documents with (relatively) rapidly advancing timestamps, regardless of how I need to actually split the TTML content into packets in order to meet MTU requirements (and possibly packet pacing ones). Given that this is RTP and we're used to ignoring things with old timestamps, I mostly expect the latter to be more common, but would appreciate some guidance in the document [sic]. This seems to roughly be Adam's third Discuss point. Section 22.214.171.124 If the TTML document payload is assessed to be invalid then it MUST be discarded. When processing a valid document, the following requirements apply. Does this imply that I have to wait for the entire document to arrive before I start processing it? Each TTML document becomes active at the epoch E. E MUST be set to nit: I suggest s/the/its/, since there is not a global distinguished epoch. Most of the security considerations I can think of apply more to the TTML format itself rather than the RTP payload. I might include a short note that the text contents are meant to be interpreted by a human, and content from untrusted sources should be viewed with appropriate levels of skepticism.
Thank you for writing this -- I found it interesting and useful.
Small comment on Sec 4.1. - Maybe: OLD "These bits are reserved for future use and MUST be set to 0x0." NEW "These bits are reserved for future use and MUST be set to 0x0 and ignored at receive."
I agree with Adam’s DISCUSS.
Thank you for the work done in this document. The unusual wording of 'RTP carriage' in section 4.2.1 is interesting. -éric