Summary: Has 6 DISCUSSes. Needs 5 more YES or NO OBJECTION positions to pass.
I support Roman's DISCUSS. I'm also unclear on the over-arching recommendation this document is making for securely deploying this protocol. Given that the protocol itself is insecure, I would have expected some normative requirement for correcting that (e.g., Minimally, Babel deployments MUST be secured using a lower-layer security mechanism, Babel over DTLS, or HMAC-based authentication.) This still would not bring it into line with BCP 61 Section 7, but perhaps there is some argument for making an exception for this protocol.
I support Suresh's DISCUSS. An explanation of why this document obsoletes RFC 6126 and RFC 7557 needs to appear in the introduction of this document. Section 3.2.3: It's a bit odd that the Multicast Hello is introduced here but the difference between the two kinds of hellos is not explained until Section 3.4.1. It makes me wonder if 3.2 should come after 3.4. Section 3.6: s/is not left-distributive Section 3.5.2/is not left-distributive (Section 3.5.2)/ Appendix C: This section should be in the body of the document.
Security Considerations. While the high level statement of Babel being “an insecure protocol” is accurate and clear, precisely enumerating the threats is needed to motivate the selection of the appropriate mitigations. (1) Per “Any attacker can misdirect data traffic by advertising routes with a low metric or a high seqno.”: -- Can the "any" of the attacker be scoped any more? -- Explain why this is possible – because Babel peers are not authenticated and Babel messages aren’t integrity/replay protected -- Discuss the impact of this misdirection: denial of service (dropping the traffic and against a given target), eavesdropping, or allowing for the possibility of traffic modification (depending on upper level security mechanisms) – RFC4593 covers a number of them -- Note that because Babel messages aren’t encrypted any on-path attacker can gather the routing topology (2) The rest of this paragraph describes the security properties conveyed by link-layer security, IPSec, BABEL-HMAC and BABEL-TLS. They all make sense. Please be explicit that IPSec or BABEL-TLS address all of the above described attacks. BABEL-HMAC addresses only somet. (3) Per “HMAC is simpler and does not depend on DTLS, and therefore its use is RECOMMENDED whenever both mechanisms are applicable”, can you explain this recommendation and the circumstances where “both mechanisms are applicable”. If one wants to ensure confidentiality, it can’t be realized with HMAC – they aren’t equal. (4) Per “The privacy issues that this causes can be mitigated somewhat by using randomly chosen router-ids and randomly chosen IP addresses, and changing them periodically, who’s IP address should be randomly chosen the Babel node or the mobile device? In other sections: (5) Appendix C: Per the last paragraph, “The packet trailer is intended to carry cryptographic signatures …”, to what security mechanism is that referring? Where is that defined? (6) Appendix D: Is the stub implementation guidance normative? If so, will it satisfy all of the RFC2119 language in this document? (7) Appendix E. Please explicitly state that the sample implementation is non-normative.
(8) Section 1.1. What is a “network diameter”? Calculated how? (9) Section 3.6. Recommend avoiding the phrase “protocol’s correctness” (10) Section 3.7.2, Per the guidance to send updates with acknowledgement requests to a small, but not a large number of neighbors. Is there guidance to provide on what is a large number? (11) Section 126.96.36.199. Is there any guidance on what a “small number of multicast” requests constitutes? (12) Section 4. Per “Both the source and destination UDP port are set to a well-known port number”, the same one? (13) Section 4.2. What is the “carefully chosen” rational for the magic number being 42 (unless this is a Hitchhikers Guide reference)? (14) Section 4.6.4. What are the properties needed for this nonce? (15) Section 6. Per the concern that Babel packets might escape into the wild and “No such natural protection exists when Babel packets are carried over IPv4”, doesn’t setting the TTL=1 per Section 4 help? (16) Appendix D: Per “Nonetheless, in some very constrained environments, such as … abacuses”, what does it mean to implement Babel on an analog device? (17) Did the WG consider renaming the title of this draft “Babel Routing Protocol v2” (as this is a distinct and new protocol)? (18) Editorial nits: -- Section 1.1. Editorial. s/Babel never/Babel does not/ -- Section 1.2 Editorial. s/Babel does impose/Babel imposes/ -- Section 2. Editorial. s/venerable RIP/RIP/ -- Section 2.3. Editorial. s/It is well known that a/A/ -- Section 4.1.2. Typo. s/ones ones/ones/
I don't think that all of the arithmetic specified in Section 3.2.1 is well defined. Specifcally, the formulations involving bitwise AND assume that the input to the bitwise AND is nonnegative, which does not seem to be implied by the other stated constraints. (For example, an "integer n" may well be negative.) Some discussion of the representation of negative integers would then be needed, and then whether the mathematical operation is performed in an abstract infinite-precision machine or in a realizable approximation, etc.. It might be simpler to just use the modular arithmetic flavor and avoid any of the issues that can arise when providing two alternative definitions that are intended to be equivalent (since there is always a risk of edge cases). Section 3.5.2 needs to explicitly say that the c and m arguments to M() are the local link cost and the advertised metric, e.g., "the function M(c, m) used for computing a metric from a locally computed link cost c and the metric m advertised by a neighbor". Section 188.8.131.52 notes that "[d]ue to duplicate suppression, only a small number of such requests will actually reach the source." (for seqno requests intending to avoid starvation). But Section 184.108.40.206 only has a SHOULD-level requirement to suppress duplicate seqno requests, so I think there is an internal inconsistency. I think we may need to have a discussion about the feasibility of multicast acknowledgment requests with only a 16-bit nonce. With random assignment of nonces the risk of birthday collisions becomes uncomfortably large, and non-random assignments are likely to have worse pathologies. (A pointer to a previous discussion of this topic would, of course, short-circuit a lot of it if not all of it.) Are we willing to make hard assumptions about the maximum size of a multicast domain and the risk of collision we are willing to accept? The discussion in Section 4.6.9 of computing the prefix from an Update message (and parser state) seems a little underspecified when the prefix length is not a multiple of 8 bits. (Additionally, "Plen" is not described as measuring bits, explicitly, for any of the PDU descriptions that I remember.) Specifically, the "Prefix" description does not mention that any trailing bits must be set to zero, but the subsequent discussion about the prefix is "computed as follows" refers to assembling the prefix as a collection of octets, including trailing zero octets, implying that the computed prefix is the full length of the address type. I appreciate that we have some discussion in Section 4.5 about the need for a stateful parser for the babel packet body; this seems like one of the riskiest areas of the protocol from the implementation perspective. However, I think it would be even more helpful to explicitly call out what pieces of state are needed, what protocol elements affect the state, and what ordering requirements (or non-requirements) there are for the interactions between the different protocol elements that affect parser state. Can we have a discussion about whether it's appropriate to add some text along these lines?
Should there be a "changes since RFC 6126" section that is retained in the published RFC? (I assume that Appendix F is going to be dropped.) The secdir review has some good thoughts (e.g., tracking "link-local" IPv4 addresses, discussion of non-protection from hostile insiders), but I don't see a response to it. We use the phrase "a small multiple of" a few times, but I don't remember seeing any concrete guidance for what factor to use. Is it intended to be closer to 1.1 or to 4? In a related vein, there are many places in the document where the precise details of processing are left intentionally underspecified (e.g., computing a link's cost). I understand that due to the protocol guarantees the needed routing will still be achieved even if nodes use different parameters and algorithms in these cases, but do we expect the details to be chosen on a per-implementation basis, or in profile documents, or even left up to operator configuration on a per-node basis? Section 1 The introduction should mention obsoleting 6126 and 7557, in addition to doing so in the abstract. Section 1.1 Finally, Babel is a hybrid routing protocol, in the sense that it can carry routes for multiple network-layer protocols (IPv4 and IPv6), whichever protocol the Babel packets are themselves being carried over. nit: I think "regardless of which" is better than "whichever", since the latter might erroneously imply that there is a consistency requirement of the carried routes and the carrying protocol. Section 1.2 Second, unless the optional algorithm described in Section 3.5.5 is implemented, Babel does impose a hold time when a prefix is Similarly to my comment on the applicability doc, I'm not sure if there's one or two things in Section 3.5.5 that would match this description. Section 2 Conceptually, Bellman-Ford is executed in parallel for every source of routing information (destination of data traffic). In the following discussion, we fix a source S; the reader will recall that the same algorithm is executed for all sources. Just to check my understanding: this "source S" is a source of routing information, not a source of data-plane traffic being routed? Section 2.4 Is there a reference for AODV? To show that this feasibility condition still guarantees loop- freedom, recall that at the time when A accepts an update from B, the metric D(B) announced by B is no smaller than FD(B); since it is smaller than FD(A), at that point in time FD(B) < FD(A). Since this property is preserved when A sends updates, it remains true at all times, which ensures that the forwarding graph has no loops. I'm trying to walk through this and missing a step or two. "the metric D(B) announced by B is no smaller than FD(B)" is pretty clear, since FD(B) is just the minimum value of D(B) over time thus far. But I'm not sure I follow how A can preserve the property FD(B) < FD(A) when A sends updates. Clearly FD(B(T')) <= FD(B(T0)) for any time T' after T0, but suppose FD(B) remains constant but A is off interacting with some other node C and finds a great path via C, which correspondingly causes D(A) to reduce. Can I get into a situation where D(A) < FD(B) <= D(A) + C(A,B) (and thus, the subsequent FD(A) < FD(B) <= FD(A) + C(A,B)) if A does not interact with B during that time? Section 2.5 Using the minusculeu and majuscule forms of the same letter to mean different things (e.g., source S and sequence number s) is something of a readability anti-pattern. Section 3.2.6 It would probably be helpful to readers to note that "neighbor that advertised" and "next-hop" can be different due to being different address families. (For the same address family, they are generally going to be the same, modulo weird network-layer technologies, right?) Section 3.5.1 (side note: I got a bit confused reading this section and had to go double-check several definitions, due to the qualitative difference between the "metric" and "metric'" under comparison. Namely, the "metric" is for the path from neighbor to S, but the "metric'" is for the path from the current node to S, and so in some sense they are "measuring different things". Perhaps using "FD" instead of "metric'" would help disambiguate. I understand that this a fairly common pattern for routing protocols, though, so don't necessarily expect any change to the text.) router-id. Feasibility distances are maintained in the source table, the exact procedure is given in Section 3.7.3. nit: this is a comma splice. Section 3.5.2 Note that while strict monotonicity is essential to the integrity of the network (persistent routing loops may arise if it is not satisfied), left distributivity is not: if it is not satisfied, Babel will still converge to a loop-free configuration, but might not reach a global optimum (in fact, a global optimum may not even exist). I might even go so far as to say that a global optimum "will likely not exist", though this is fairly qualitative/intuitive since we don't define a configuration space or metric over it in which to evaluate the probability. Section 3.5.4 We don't seem to use the "link cost value equal to cost" anywhere in this section, so maybe it is superfluous. If such an entry exists: o if the entry is currently selected, the update is unfeasible, and the router-id of the update is equal to the router-id of the entry, then the update MAY be ignored; I guess the idea is that we can keep the old one around until it would time out, since the initial timeout value for it means it should still be workable until our timer expires, but it's only a MAY in case we want to be more proactive about noticing that the advertised metric is now unfeasible? It might be worth saying a bit about when we might/might not want to heed the MAY. Section 3.5.5 o sending a retraction with an acknowledgment request (Section 3.3) to every reachable neighbour that has not explicitly retracted prefix P and waiting for all acknowledgments. nit(?): I'd suggest a comma before "and waiting for all acknowledgments", since that's the final gating factor to achieve the goal. The former option is simpler and ensures that at that point, any routes for prefix P pointing at the current node have expired. However, since the expiry time can be as high as a few minutes, doing that prevents automatic aggregation by creating spurious black-holes for aggregated routes. The latter option is RECOMMENDED as it dramatically reduces the time for which a prefix is unreachable in the presence of aggregated routes. nit: I don't think this "prevents automatic aggregation" at a technical level, but rather that it "makes automatic aggregation rather unusable in practice" since if automatic aggregation is used, any route retraction will result in a spurious blackhole for the (minutes) expiry time, which is unacceptable for most environments. Section 3.7 Additionally, in order to ensure that any black-holes are reliably cleared in a timely manner, a Babel node sends retractions (updates with an infinite metric) for any recently retracted prefixes. Is the sending of retractions the one described by the SHOULDs in 3.7.2? If so, I'm not sure that "a Babel node sends retractions for any recently retracted prefixes" is quite accurate (since SHOULD is not a mandatory requirement); "can send" or "will generally send" might be better. Section 3.7.1 Every Babel speaker periodically advertises all of its selected routes on all of its interfaces, including any recently retracted routes. Since Babel doesn't suffer from routing loops (there is no "counting to infinity") and relies heavily on triggered updates (Section 3.7.2), this full dump only needs to happen infrequently. Part of the need for the full dump stems from the potential for unreliable links, right? Do we want to mention that relationship here, (and that if there are particularly unreliable links the frequency may need to be more often)? Section 220.127.116.11 We haven't introduced "hop count" yet and just mention it in passing here as "[if the] hop count is 2 or more". Intuitively, it seems like the routr should send an update if the router-ids match and the requested seqno is equal to the route entry's seqno, but I don't see this case covered in the current text. o otherwise, if the node has one or more (not necessarily feasible) routes to the requested prefix with a next hop that is not the nit: I think the parenthetical can just be "not feasible", as any feasible routes in question would have matched the previous bullet point. neighbours. However, if a seqno request is resent by its originator, the subsequent copies MAY be forwarded to a different neighbour than the initial one. Is MAY the appropriate level of strength? Trying the same neighbor would be effective if the original was unsuccessful due to packet loss, but is it possible for a routing pathology to occur that directs the request in the "wrong direction" with respect to a link or node failure? Section 18.104.22.168 Is it worth giving some informal guidance about not sending multicast wildcard requests if a node observes others doing the same around the same time (or similar) to avoid the "serious congestion" issues? Section 4.2 A Babel packet consists of a 4-octet header, followed by a sequence of TLVs (the packet body), optionally followed by a second sequence of TLVs (the packet trailer). Without mention of the 'body length' field here, a reader might be confused at what distinguishes the body TLVs from the trailer TLVs. The packet body and trailer are both sequences of TLVs. The packet Ibody is the normal place to store TLVs; the packet trailer only contains specialised TLVs that do not need to be protected by cryptographic security mechanisms. I think we need a more explicit statement that the body structure is subject to change when security mechanisms are in use, to allow for potential confidentiality-protecting cryptographic mechanisms. Section 4.3 Length is still in octets, right? Section 4.4 Every TLV carries an explicit length in its header; however, most TLVs are self-terminating, in the sense that it is possible to determine the length of the body without reference to the explicit Length field. If a TLV has a self-terminating format, then it MAY allow a sequence of sub-TLVs to follow the body. This seems like a statement of fact, for which a lowercase "may" is perfectly adequate. Sub-TLVs have the same structure as TLVs. With the exception of PAD1, all TLVs have the following structure: I was going to complain that it's somewhat unfortunate to use the same name for a thing that's a TLV and a thing that's a sub-TLV, even if they have identical encodings. But then I noticed that in this (sub-TLV) section we spell it "PAD1" and in the previous (TLV) section we spell it "Pad1", which are different. On the gripping hand, Sections 4.6.1 and 4.7.1 both spell it "Pad1", which are the same. So a little bit of effort rationalizing things would go a long way. The most-significant bit of the sub-TLV, called the mandatory bit, Just to be clear: this is the MSB of the 'type' octet? Also, for similar features in other protocols I've suggested the clarifying language of "comprehension-mandatory" which seems to more accurately reflect the corresponding behavior. Section 4.5 Since the parser state is separate from the bulk of Babel's state, and since for correct parsing it must be identical across implementations, it is updated before checking for mandatory TLVs: nit: "mandatory sub-TLVs" (right?) Section 4.6.2 MBZ Set to 0 on transmission. Is it legal for a receiver to check and abort if any bits are nonzero? Section 4.6.3 Sixteen bits of nonce does not provide much unguessability (I note that LISP's rfc6830bis is recommending that their 24-bit nonce echo functionality not be relied on for return-routability checks over the public Internet). However, since these acknowledgment exchanges are only between direct neighbors, it seems that they are only needed for correlating responses to requests and not for unguessability. (In this case it seems a sequence number would work just as well as a random number, and we might want to discourage random assignment in the text to avoid the risk of birthday collisions.) On the other hand, multicast acknowledgment requests could be problematic (and especially so when sequential nonces are used), and if they are intended to be allowed then we may need to consider using a larger and random nonce. Section 4.6.6 I'm getting some sever cognitive dissonance between the "Rxcost" field and the "carrying a link's transmission cost" statement. Also, in Rxcost The rxcost according to the sending node of the interface whose address is specified in the Address field. The value FFFF hexadecimal (infinity) indicates that this interface is unreachable. if I insert commas to get "The rxcost, according to the sending node [of the TLV], of the interface whose address is specified in the Address field", does that preserve the intended meaning? nit/aside: It also feels like there's a bit of a mismatch here, in that the "rxcost of the interface" probably means the local interface (from the perspective of the sender), but that interface is being identified by the *remote* address (again, from the perspective of the sender of the TLV). So maybe "whose remote address" could resolve the mismatch I'm perceiving? (Or maybe I'm completely misunderstanding, of course.) Interval An upper bound, expressed in centiseconds, on the time after which the sending node will send a new IHU; this MUST NOT be 0. [...] To check my understanding: are the IHUs conceptually a reply to Hellos, such that if the Hellos stopped arriving then the peer would stop sending IHUs in response? I understand that their intervals are set completely independently, so there is not a direct causal relationship, but I'm trying to check whether the quoted sentence is a strict commitment by the sender of the IHU or could be rescinded due to external events. Section 4.6.9 If the Metric field is finite, the router-id of the originating node for this announcement is taken from the prefix advertised by this Update if the Router-Id flag is set, computed as described above. Otherwise, it is taken either from the preceding Router-Id packet, or the preceding Update packet with the Router-Id flag set, whichever comes last, even if that TLV is otherwise ignored due to an unknown mandatory sub-TLV. Both cases of "packet" here should be "TLV", right? Otherwise we have to scope what set of previous packets are applicable to this route (since we get a lot of packets, for a lot of different routes). Section 5 "Specification Required" also requires Expert Review. What guidance can we provide to the experts for making registration decisions? Section 6 It's a little disappointing that we provide four different PDUs for padding but then have no discussion of privacy considerations related to (potentially encrypted) packet length, and when (else) one might want to pad, and what padding policy might look like. I understand that padding policy remains something of an open research question, but even acknowledging that can still be useful. Is an attacker's capability limited to misdirecting traffic? Can it cause traffic to be blackholed or cause routing loops by falsifying protocol data either modified in transit or originating false data? What are the effects of an attacker completely or selectively dropping protocol data? In essence, please flesh out "completely insecure" with a bit more detail. The information that a Babel node announces to the whole routing domain is often sufficient to determine a mobile node's physical location with reasonable precision. The privacy issues that this causes can be mitigated somewhat by using randomly chosen router-ids and randomly chosen IP addresses, and changing them periodically. "periodically" may not be the best advice; coupling such changes to mobility events is likely to be more effective at preserving privacy. (QUIC has discussed related topics quite extensively, though there's enough traffic in the archives that I can neither point you at a specific thread or recommend searching for it.) Section 8.2 I think at least BABEL-HMAC needs to be normative, since it is RECOMMENDED. Section A.1 If we're talking about "appending bits" to the history fields, maybe describing them as fixed-length queues or something makes more sense than vectors. If the field is maintained in a 16-bit integer, what is done for the previously erased bits when we "undo history"? Whenever either Hello timer associated to a neighbour expires, the local node adds a 0 bit to this neighbour's Hello history, and We keep two hello histories; we should clarify that the one in question is the one corresponding to the timer that expired. Section A.2.2 I don't understand the origin of the '256' in the MIN(1, 256/txcost) formula (described as a probability estimate). I think a lot more work is needed to convince me that the two given formulae for "cost" are equivalent (especially given that 'rxcost' only appears once in the entire section, in the second formula). Section A.3.2 Is k "allowed to" (I know this section is just informative) vary on non-external data, such as the route or link in question? Appendix C I could see this content in the main body of the document.
Thanks for your work on this well written document. Most of the issues I found have been covered in the ballot positions of my esteemed colleagues. I did have one major concern that I would like to see addressed though. This is in regard to backward compatibility with RFC6126 implementations. Due to the addition of the mandatory bit and the processing associated with it, I would think that the new implementations will not be able to properly interoperate with the existing RFC6126 implementations. Is my understanding correct? If so, I would like to see some text explaining what is the expected behavior when deploying into legacy environments. If not, I would greatly appreciate an explanation and I will clear.
* Appendix F I think a consolidated change log from RFC6126 would be more helpful in the finished RFC for existing implementers.
(Sorry I forgot two points about the appendix; see one in the discuss section and one in the comment section) I have a couple of points that needs addressing before this document can move forward. Most of them should the straight forward to address. My main point is about network load. Thanks for discussing network load and correctly adding some warnings at the right places, however, for a PS track document I would like to see more than this. Usually it's good provide default values were suitable (as this often is what people will then pick if there is no good reason to diverge) and more important I really like to see min/max values. Note that RFC8085 recommend a minimal interval of 3 seconds which probably is also a good hard boundary here. More concretely I think there are these cases that need more guidance: - Section 3.7.2. (Triggered Updates) advises to send a message multiple times for redundancy in case of loss. 5 and 2 are mentioned as example values. Please provide a normative default value and a normative maximum value here. Moreover the spec should also require to pace out these messages and avoid "tail loss" by overloading the local queue. (See also section 22.214.171.124) - Section 126.96.36.199. (Route Requests) says: "Full route dumps MAY be rate-limited, especially if they are sent over multicast." I think this should at least be a SHOULD. Please also provide further guidance about to appropriately rate limit and think about other cases where a recommend to implement rate-limiting could make sense. - In section 4.1.1 the update interval needs a lower limit (e.g. 3 seconds) and a recommend default value would be could as well (Note that there are other part in section 3 where the update value is discussed as well). - Section 188.8.131.52. mentions network load when requests are sent to all neighbours after reboot. Please provide more guidance about how to pace out these requests. - Section 184.108.40.206. (Seqno Requests) discusses hop count values but could maybe also give more concrete guidance. I would assume that the hop count value of the current active route is usually know. Maybe that knowledge could be used to pick an appropriate value? Three other smaller discuss points/questions/comments: 1) Sec 4.6.8. (Next Hop): If I interpret this correctly, address compression is allowed for the next hop field and therefore this TLV would actually not be self-terminating. What do I miss? 2) This document needs to specify a registration policy also for each of the already existing registries given this document obsoletes RFC7557. 3) Appendix D (Stub Implementations) contain normative language and therefore should probably be moved into the body of the draft.
Other comments: 1) While this point might not raise discuss-level, it would probably also be good to provide more concrete advise on how to implement jitter: Sec 3.1.: “ A moderate amount of jitter may be applied to packets sent by a Babel speaker: outgoing TLVs are buffered and SHOULD be sent with a small random delay.” Sec 4: “a Babel node SHOULD buffer every TLV and delay sending a packet by a small, randomly chosen delay [JITTER].” 2) Sec 4.1.2. (Router-Id) should probably state again that the router-id is assumed to be unique within a domain. 3) Sec 4: “The most-significant bit of the sub-TLV, called the mandatory bit, indicates how to handle unknown sub-TLVs.” I would recommend to also indicate this bit in the image. 4) Sec 4.4: “If a TLV has a self-terminating format, then it MAY allow a sequence of sub-TLVs to follow the body.” Initially I wasn’t quite sure what you wanted to say here. I guess you say that the length would indicate a larger value that needed for the body and therefore a subTLV might be present? I recommend to clarify this here a bit. 5) I recommend to move Appendix C (Considerations for protocol extensions) in the body of the document.
I really enjoyed reading this document! Thank you for the work and time that has gone into it. However, I don't think that this specification is ready to be published as a Proposed Standard. In general, I don't think that the document is clear or specific enough to be considered in the Standards Track -- that is the main reason for this DISCUSS. (A) Clear Defaults and Operational Guidance While I appreciate Babel's flexibility in terms of the ability to use different strategies, I believe that both defaults and clear guidance should be provided. Given that "not all...strategies will give good results" and that in most cases these are listed as possible choices, I don't think that this document "has resolved known design choices" [BCP9/rfc7127]. The cost/metric computation and route selection specially concern me because I believe that a robust/clear specification is at the heart of any routing protocol. In general what I am looking for to resolve this part of the DISCUSS are two items: (A1) Clear defaults. For example, Appendix B talks about constants/default values. I would assume that, given the existing experience, that the values there are probably sensible defaults. Is that not the case? (A2) Operational Considerations. Given that Babel can be (and is) used in different environments, I would like to see guidance to operators as they deploy the protocol in their networks. An example of the type of discussion I would like to see expanded is: "a mobile node that is low on battery may choose to use larger time constants (hello and update intervals, etc.) than a node that has access to wall power" (§1.1). Consider §2 in rfc5706 (Operational Considerations - How Will the New Protocol Fit into the Current Environment?). I believe that both items are important, specially in a protocol as flexible as Babel. Some of this guidance could have been included in draft-ietf-babel-applicability -- but this information is not there either. (B) Error Handling Many sections of the document describe functionality, or even Normatively mandate it, but there is no discussion about Error Handling. (B1) Router-Id Setting §4.5: o the current router-id; this is undefined at the start of the packet, and is updated by each Router-ID TLV (Section 4.6.7) and by each Update TLV with Router-Id flag set. It took me some time to figure out the reason for being able to carry the router-id in two different places inside the same packet, which is my interpretation of the "and" above. Let me see if I understood: a packet can carry multiple updates...updates contain routes that were either originated by the local node, OR, learned from other routers...the router-id matches the originator... So...if a packet carries multiple updates, some locally originated and some learned, then it is possible for the packet to first include (for example) a Router-ID TLV (indicating router-id_A), followed by some Update TLVs (without the R-bit set), than then some other Update TLVs (with the R-bit set)... Did I understand correctly? If so, I think there are significant pieces of this operation that are not clearly specified in the document. There is mention of the effect of the Router-ID TLV (or the Update TLV w/R=1) on subsequent Update TLVs...there is an very subtle hint (for my taste) in §4.5 (Parser state) about the state learned for each packet from those TLVs...but there is no explicit text that talks about the need for strict ordering when sending and later when processing...it is all simply implied. What should happen if no Router-Id has been defined? For example, an Update (R = 0) is received but no Router-ID TLV is present... What if the Router-ID TLV is present, but *after* the Update? There are many possible combinations... (B2) Default Prefix Similar comments as above... "P (Prefix) flag...establishes a new default prefix for subsequent Update TLVs with a matching address encoding within the same packet" (§4.6.9). What if an update with an AE that allows compression is received *before* the one that sets the new default prefix? (B3) Next Hop §4.6.9: The next-hop address for this update is taken from the last preceding Next Hop TLV with a matching address family (IPv4 or IPv6) in the same packet even if it was otherwise ignored due to an unknown mandatory sub-TLV; if no such TLV exists, it is taken from the network-layer source address of this packet. What if the Next Hop TLV doesn't exist and the network-layer doesn't correspond to the address family in the Update? For example, let's say IPv6 is used as the network-layer protocol and the Update contains IPv4 prefixes... (B4) For the Normative behavior listed here (I may have missed other instances), I have basically the same question: what should a receiver do if it is not the case? - §220.127.116.11: "A node MUST NOT increase its sequence number by more than 1 in response to a seqno request." - §4: "A Babel packet MUST be sent as the body of a UDP datagram, with network-layer hop count set to 1..." - §4.6.9: "If the metric is finite, AE MUST NOT be 0. If the metric is infinite and AE is 0, Plen and Omitted MUST both be 0." - §4.6.10: "...if AE is 0 (in which case Plen MUST be 0 and Prefix is of length 0)." - §4.6.10/§4.6.11: Is AE 3 a valid value in a request? I assume it isn't. What should a receiver do if AE = 3. (C) Mandatory Bit §4.4: "The most-significant bit of the sub-TLV, called the mandatory bit..." The most significant bit of which part of the sub-TLV? As written, that bit would be the first one in the Type, which corresponds to the text in the IANA section. Please be specific. In the IANA considerations section, please include the whole registry in the table to avoid confusion. Note that because of the mandatory bit, the 128-239 range should be Reserved...but it is currently marked as Unassigned. Even worse, value 128 is assigned already [draft-ietf-babel-source-specific]. The impact may not be too bad because I doubt that Pad1 would need to be mandatory, but it at least causes confusion and inconsistency, and (as currently specified) there would be no way to differentiate between Pad1 and the Source Prefix sub-TLV.
(a) §3.1 introduces the term "urgent TLVs". (1) It might be a good idea to explicitly mention/list which are these TLVs. There are some references in subsequent sentences, which may or may not be enough for most readers. (2) There are some Normative actions applied to them, for example "MUST be sent in a timely manner". While the intent may be ok, the Normative enforcement of "in a timely manner" is not clear at all -- how do you comply with that? Appendix B says: The amount of jitter applied to a packet depends on whether it contains any urgent TLVs or not (Section 3.1). Urgent triggered updates and urgent requests are delayed by no more than 200ms; acknowledgments, by no more than the associated deadline; and other TLVs by no more than one-half the Multicast Hello interval. I think it would help if this text is moved to §3.1 to make it explicitly clear what a timely delay is...and the text was changed to (something like) "MUST NOT be delayed more than 200ms". (b) §3.2.2: "SHOULD NOT increment its sequence number (seqno) spontaneously" When it is ok to increase the seqno spontaneously? IOW, why not use MUST NOT? I think it would be better if there was a clear indication of when the seqno is increased. Scanning the rest of the document, it seems that those indications are in place. (c) There seems to be no specific explanation of how the timers are handled, what happens when they expire, etc.. For example, §3.2.4 includes this text: There are three timers associated with each neighbour entry -- the multicast hello timer, which is initialised from the interval value carried by scheduled Multicast Hello TLVs, the unicast hello timer, which is initialised from the interval value carried by scheduled Unicast Hello TLVs, and the IHU timer, which is initialised to a small multiple of the interval carried in IHU TLVs. But there is no explanation (that I could find) about how to manage those timers. The only place where hello timers are mentioned is in Appendix A.1...but that is just an example. (d) §3.7.2 includes two instances of "SHOULD make a reasonable attempt at ensuring that all [reachable] neighbours receive this update/retraction". What does making "a reasonable attempt" mean? How can that be Normatively enforced? (e) §3.7.2 Finally, a node MAY send a triggered update when the metric for a given prefix changes in a significant manner, due to a received update, because a link's cost has changed, or because a different next hop has been selected. A node SHOULD NOT send triggered updates for other reasons, such as when there is a minor fluctuation in a route's metric, when the selected next hop changes, or to propagate a new sequence number (except to satisfy a request, as specified in Section 3.8). How much is "a significant manner"? What about "a minor fluctuation"? Are the modifiers (next hop change, for example) the only conditions to take into account, or are they just examples of when these significant/minor changes may occur? How can these terms be Normatively enforced? (f) §18.104.22.168: "When a node receives a wildcard route request, it SHOULD send a full route table dump." When is it ok to not send a full table dump? IOW, why is MUST not used? (g) §22.214.171.124: "a node SHOULD repeat such a request a small number of times if no route becomes feasible within a short time." What does "a small number of times" and "within a short time" mean? How can that be Normatively enforced? Please be specific. (h) §4.6.9: "Omitted...that should be taken from a preceding Update TLV in the same address family with the Prefix flag set." What if that Update TLV is not in the packet? (i) Security Considerations The initial vulnerability listed ("attacker can misdirect data traffic by advertising routes with a low metric or a high seqno") is only one of several actions an attacker can take. More importantly, if the attacker happens to be in control of an authenticated node, then the mitigation proposed doesn't help. This type of rogue node can, for example, set the mandatory bit in an unknown TLV (as in completely made up!) to cause whole TLVs to be ignored, resulting in loss of routes, etc.. I am not sure what can be done to mitigate this type of vulnerability...but I think it is important that it is at least called out. (j) Are the appendices intended to be Normative or not? I'm assuming the answer is no...but I can base that only on the references in the text to Appendix A.*, pointing to them as examples. What about the others? They are not even referenced in the text. Some comments: - Appendix B talks about constants/default values. See my DISCUSS comments above. - Appendix C "is intended to guide designers of protocol extensions in chosing a particular encoding." I think is is valuable information. It would be very nice if there was a reference (or perhaps several, from where the different extensibility methods are presented) in the main body of the specification. I can see how this is an informative section. - Appendix D defines a "stub implementation". This is also valuable information. But...there's no reference from the text, and Normative language is used... Why is this type of implementation (which I would think might be relatively common) not normative? - Appendix E simply points to the sample implementation. Personally, I would prefer to see an rfc7942 section instead -- it would have been nice to also mention other implementations. (k) "The length of..." is used everywhere in the document, but no units are mentioned. Some seem to obviously be in octets, but others could easily be in bits... (l) §4: s/SHOULD attempt to maximise the size of the packets/SHOULD maximise the size of the packets (m) §4.1.3: The description of AE 1 and 2 says that "Compression is allowed." -- but it looks like the only place where it can happen is in an Update. It might be nice to indicate that...and avoid indicating that compression is not allowed where it can't be done anyway. (n) rfc8126 should be a Normative reference. (o) Please include Informative references to rfc6126 and rfc7557. (p) s/Bellman-Ford protocol/Bellman-Ford algorithm (q) §2.4: Include an Informative reference to AODV (rfc3561). (r) §2.4: "if A has selected B as its successor" This is the only place where "successor" is used. For clarity, perhaps use a different word/description.
I support the DISCUSS points by Ben and Álvaro.
Dear authors, Thank you for the work put into this document and its companion documents. The text is usually clear and explanations are concise and easy to understand. I have nevertheless some COMMENTs and a few NITs. Regards, -éric == COMMENTS == -- Section 1.1 -- In the properties bullet points, "its diameter" is it about the loop diameter or the network diameter? I suspect the latter but this is ambiguous IMHO. -- Section 3 -- It is unclear by reading "The protocol encoding is slightly more compact when router-ids are assigned in the same manner as the IPv6 layer assigns host IDs." whether EUI-64 is referred here. I also fail to see in section 4.6.7 (router-id TLV) what is the encoding benefit? -- Section 3.1 -- Should there be a mention of maximum UDP datagram size? and some words on layer-3 fragmentation ? I understand that section 4 has a section on this, so, perhaps refer already to that section for completeness ? -- Sections 3.2.3 and 3.2.4 -- Does a dual-stack host have to send 2 hello? One on each protocol stack (v6 or v4) ? Unclear from the explanation. -- Section 3.4.2 -- It is unclear to me whether a link to a neighbor (router-id) can have different cost based on the v6 or v4. -- Section 4 -- Is there a reason why the well-known ports and multicast group addresses are not spelled out in this section ? They only appear in the IANA considerations section. Also, should the hello be sent over v6 _AND_ v4 ? -- Section 4.2 -- No a real comment, just an appreciation of your humor: "The arbitrary but carefully chosen value 42" ;-) you made my Monday morning ! -- Section 4.4 -- Is there any reason why the 'mandatory bit' does not appear in the packet structure? -- Section 4.6.2 -- Is there any reason why the "MBZ" is not expanded ? Must Be Zero ? -- Section 4.6.3 -- I wonder how the receiver could estimate the propagation time (in each direction BTW) + queuing time + whatever delay... -- Section 4.6.7 -- Should the router-id field length be repeated here as well ? == NITS == -- Section 1.1 -- Rather than using 'naive' to describe RIP, let's rather use 'trivial' or 'simple' ;-) s/the routers involved participate/the involved routers participate/ ? -- Section 3.2.6 -- Suggest to use '0xffff' rather than 'FFFF' and be consistent in the use of lowercase / uppercase for hexadecimal numbers. -- Section 4.5 -- Parsing "since for correct parsing it must be identical across implementations" is not easy... a comma would be welcome. -- Section 4.6.3 -- s/receiver send/receiver sends/