Ballot deferred by Mirja Kühlewind on 2017-05-08.
Summary: Has 3 DISCUSSes. Has enough positions to pass once DISCUSS positions are resolved.
1) Use of transport protocols is not sufficiently defined Especially the following text in section 3.5.3 seems not to reflect later assumptions correctly; it seems to be assumed that TCP is used for all messages other than the discovery and therefore reliable transport is provided for these message (see sections 3.5.5, 3.8.4 and 3.8.5): "All other GRASP messages are unicast and could in principle run over any transport protocol. An implementation MUST support use of TCP. It MAY support use of another transport protocol but the details are out of scope for this specification. However, GRASP itself does not provide for error detection or retransmission. Use of an unreliable transport protocol is therefore NOT RECOMMENDED." In general the usage of the transport protocols is not well enough specified, see also Spencer's comments and this part of Martin's tsv-art review (Thanks!): "* Usage of UDP: This document is not discussing any of the aspects in RFC 8085. Every usage of UDP is required by IETF consensus to review RFC 8085 and to address at least the applicable subset of issues listed in RFC 8085 (or the predecessor RFC 5405). * Starting with UDP and switching to TCP for the data transfer looks like the right do. However, UDP should be really only used to discover other devices, but not piggy back further protocol mechanics. However, this document is not really specific on how to make use of TCP, for instance, how long are TCP connections kept open or closed down after a protocol exchange (persistent vs temporary connections). What happens if a TCP connection is shutdown by one end or is forcefully closed, e.g., by a reset?" I would recommend, as assumed in the rest of the document, to update section 3.5.3 to only use UDP for the initial recovery message and open a TCP connection for the discovery response and require that all other messages to be sent over TCP (also removing any option to use any other reliable transport because TCP seems to be the right choice here.) Further, additional guidance is needed when to open and close a TCP connection (or keep it alive for later use) and what to do if the connection is interrupted. 2) Time-out handling section 126.96.36.199: "Since the relay device is unaware of the timeout set by the original initiator it SHOULD set a timeout at least equal to GRASP_DEF_TIMEOUT milliseconds." Should a relay really maintain an own time-out? Wouldn't it be sufficient to just relay again if another discovery message is received. Otherwise this can lead to an amplification, when the own time-out expires and another relay message is sent when another discovery message is received due to the time-out of the originating peer. Further in relation to the point about, this should be more specific: section 188.8.131.52: "Also, it MUST limit the total rate at which it relays discovery messages to a reasonable value, in order to mitigate possible denial of service attacks. " 3) Version and extensibility: section 184.108.40.206: "A possible future extension is to allow multiple objectives in rapid mode for greater efficiency." How can this extension be defined if there is no version mechanism?
Other mostly editorial comments: - ASA needs to be spelled out in the intro. - I would recommend to move section 2 and 3.3 into the appendix - section 220.127.116.11: "A neighbor with multiple interfaces will respond with a cached discovery response if any." "cached response" is explained in the next section and not clear in this paragraph. - section 18.104.22.168: "After a GRASP device successfully discovers a locator for a Discovery Responder supporting a specific objective, it MUST cache this information, including the interface index via which it was discovered. This cache record MAY be used for future negotiation or synchronization, and the locator SHOULD be passed on when appropriate as a Divert option to another Discovery Initiator." Not sure why the first is a MUST and the later is a SHOULD. I guess a SHOULD for caching would be sufficient. - section 3.8.6 "If a node receives a Request message for an objective for which no ASA is currently listening, it MUST immediately close the relevant socket to indicate this to the initiator." How is that indicated? Should really be further clarified - Also section 3.8.6: "In case of a clash, it MUST discard the Request message, in which case the initiator will detect a timeout." Why don't you send an error message instead? How does the initiator know that is should retry (assuming there is a TCP connection underneath that provides reliable transport)? - Also section 3.8.9: "If not, the initiator MUST abandon or restart the negotiation procedure, to avoid an indefinite wait." How does the initiator decide for abandoning or restarting instead? Needs clarification! - Could be useful to include an optional reasoning field in the Invalid Message and make copying the received message up to the maximum message size of this message a SHOULD (section 3.8.12.). - Not sure I fully understand the purpose of the No Operation Message (section 3.8.13.). If you just want to open a socket for probing, you perform a TCP handshake and send a RST right after. No need for further application layer interactions. And should there also be an optional reasoning phrase? - Not sure why the objectives flag is needed. I assume that unknown objectives are ignored anyway and if a objective is known the receiver should know if that objective is valid for the respective message type (section 3.10.2). - section 3.10.4: "An issue requiring particular attention is that GRASP itself is a stateless protocol." It's not. It caches information and needs to remember previous messages sent to reply correctly. - section 5: "Generally speaking, no personal information is expected to be involved in the signaling protocol, so there should be no direct impact on personal privacy." I don't think this is true because the protocol is so generic that you cannot say anything about the services it is used for. Please see also further comments from Martin's tsv-art review (Thanks again!)!
I have a small list of issues that I would like to discuss before recommending approval of this document: 1) The first reference to UTF-8 needs a Normative reference to RFC 3629. 2) In Section 3.10.1, you say: The names of generic objectives MUST NOT include a colon (":") and MUST be registered with IANA (Section 7). In Section 7 you only say: GRASP Objective Names Table. The values in this table are UTF-8 strings. Future values MUST be assigned using the Specification Required policy defined by [RFC5226]. IANA is not going to review section 3.10.1 and there is no back reference in Section 7. IANA needs to know that values with ":" are not to be registered.
Martin's ART Review comments seem to be addressed (other than some possible cleanup of text about TLS use). As a general comment, the document has several SHOULD/MUST level requirements which are sometimes addressed at people deploying the protocol, sometimes at UI designers and sometimes at designers of new objectives. I generally don't mind, but the document doesn't always make it clear what is the intended audience for different requirements. Other smaller things: "Fully Qualified Domain Name" probably needs a Normative Reference. 22.214.171.124. Discovery Procedures In 6th para: The cache mechanism MUST include a lifetime for each entry. The lifetime is derived from a time-to-live (ttl) parameter in each Discovery Response message. Cached entries MUST be ignored or deleted after their lifetime expires. In some environments, unplanned address renumbering might occur. In such cases, the lifetime SHOULD be short compared to the typical address lifetime and a mechanism to flush the discovery cache MUST be implemented. How can the discovery cache be flushed? 126.96.36.199. Locator URI option In fragmentary CDDL, the URI option follows the pattern: uri-locator = [O_URI_LOCATOR, text] I suggest inclusion of optional transport protocol here to match other locators and to follow best practices for not encoding transport information in URIs.
ISSUE 1 The security situation here is pretty unspecified here, in at least two respects: 1. In terms of communication security, you seem to have two modes: (a) Punt it to ACP (b) Use TLS as specified in S 188.8.131.52 I'm not reviewing ACP here (though I have some comments on that too) but S 184.108.40.206 doesn't (for) instance explain how to do certificate validation, which it clearly needs to do. Finally, I don't understand the security story for the multicast packets. This is especially relevant for Rapid mode, where you are attaching real work to these multicast packets. 2. I didn't find the security model very clear. As I understand things, basically anyone on the network who has ACP credentials is trusted to engage in negotiation with you, so, for instance, if you want to get parameter X, then you basically just trust whoever on the network offers you X. is that correct? That seems like it needs to be very explicitly called out. And if that's not true, then I don't understand the spec. ISSUE 2 This document seems like it provides incomplete guidance on how to actually implement it. For instance: discovery messages to a reasonable value, in order to mitigate possible denial of service attacks. It MUST cache the Session ID value and initiator address of each relayed Discovery message until What's "reasonable"? ISSUE 3. I don't think I understand how the transition from UDP multicast to TCP/TLS unicast works. Maybe I'm just misreading the spec, so could you point me to the section that describes this. Finally, I don't see a spec for how you map CBOR onto the wire. Do you just shove them on? Something else? I see that Martin Thomson raised a number of these issues in his review in more detail.
S 220.127.116.11. After a GRASP device successfully discovers a locator for a Discovery Responder supporting a specific objective, it MUST cache this information, including the interface index via which it was discovered. This cache record MAY be used for future negotiation or synchronization, and the locator SHOULD be passed on when appropriate as a Divert option to another Discovery Initiator. What's an "interface index" S 18.104.22.168. Since the relay device is unaware of the timeout set by the original initiator it SHOULD set a timeout at least equal to GRASP_DEF_TIMEOUT milliseconds. I'm not sure I'm following here. Does the relay instance retransmit with its own timeout? It MUST cache the Session ID value and initiator address of each relayed Discovery message until any Discovery Responses have arrived or the discovery process has timed out. How does this behave if the original initiator's timeout is longer than GRASP_DEF_TIMEOUT? S 3.5.5. A negotiation procedure concerns one objective and one counterpart. Both the initiator and the counterpart may take part in simultaneous negotiations with various other ASAs, or in simultaneous negotiations about different objectives. Thus, GRASP is expected to be used in a multi-threaded mode. Certain negotiation objectives may have restrictions on multi-threading, for example to avoid over-allocating resources. "multi-threaded" is an odd word here. I assume you mean that you are doing multiple stuff at once, but you might actually write the system using non-multi-threaded techniques. S 3.7. You seem to be going to a lot of trouble to deal wit session ID collisions. Why don't you just make session IDs 128-bit random values and then you won't have to worry about collisions. The Session ID SHOULD have a very low collision rate locally. It MUST be generated by a pseudo-random algorithm using a locally generated seed which is unlikely to be used by any other device in the same network [RFC4086]. Why don't you just require a cryptographically secure PRNG? That will be required to implement the rest of this protocol S 3.8.2. You seem to introduce a normative dependency on CDDL here. I see that it's in your changelog here, but what are your intentions about this document, given that CDDL seems to not even be a WG document S 3.8.5. It MUST contain a time-to-live (ttl) for the validity of the response, given as a positive integer value in milliseconds. Zero is treated as the default value GRASP_DEF_TIMEOUT (Section 3.6). Why do this, rather than just forbidding 0. S 3.8.6. If a node receives a Request message for an objective for which no ASA is currently listening, it MUST immediately close the relevant socket to indicate this to the initiator. This is to avoid unnecessary timeouts if, for example, an ASA exits prematurely but the GRASP core is listening on its behalf. This is not secure. You need a secure indication of non-knowledge, not a transport-level close. S 22.214.171.124. What are the semantics of a Divert URI? What do I dow ith the path part? S 3.10.4. The semantics of "dry run" seem pretty unclear. Is it just "tell me if you would be sad about doing this"?
The comparison text to routing protocols is outdated as ignores TE which can support any link/node attribute desired (bandwidth, availability, latency, etc.), discovery, bidirectional negotiation for use, and autoconfiguration (e.g. RFC 5340). When first discussing automatic networks, it may have been useful to compare with routing, as at a very high level view, it may look similar, but I think it is no longer relevant, and very confusing for a routing person. Suggest instead of a "I'm more complex than you" approach, remove these paragraphs. A few minor edits will fix. Suggest to remove the first paragraph of Section 2.2. Or edit: 1. links are no longer simple: "consider simple link"/s/"consider link" 2. Delete from "nodes need a consistent, although partial, view of the network topology in order for the routing algorithm to converge. Also, routing is mainly based on simple information synchronization between peers, rather than on bi-directional negotiation." I think what you want to infer by "partial" is for a protocol instance/region. But there is support today for multi-layer and multi-region networks. And convergence scale is implementation. But none of this is relevant to anima so best is to delete vs. trying to fix. Appendix E Remove the paragraph on routing or preface with "Early routing protocols.." And the paragraph on RSVP is really not relevant for this comparison. Unless want to edit, as RSVP-TE does do "discovery".
Substantive: -126.96.36.199: "Messages MUST be authenticated and encryption MUST be implemented." Should the latter be "... MUST be used"? It seems odd for authentication to be MUST use, but crypto to only be MTI. -188.8.131.52: "An exponential backoff SHOULD be used for subsequent repetitions, to limit the load during busy periods." Why not MUST? Also, is there a retry limit? (Comment applies to the other sections that mention retries with exponential backoff) -184.108.40.206: "To ensure that flooding does not result in a loop, the originator of the Flood Synchronization message MUST set the loop count in the objectives to a suitable value " I assume this is true for discovery and negotiation as well? I don't think it was mentioned in those sections (although I think I saw a related mention in the message format sections.) - 3.10.5: "SHOULD NOT be used in unmanaged networks such as home networks." Why not MUST? -5, Privacy and Confidentiality: Did people consider IP Addresses and other potentially persistent identifiers as impacting privacy? -7, Grasp Message and Options table: Why "Standards Action"? Would you expect some harm to be done if this were only Spec Required? Editorial: - Is section 2 expected to be useful to implementers once this is published as an RFC? Unless there's a reason otherwise, I would suggest moving this to an appendix, or even removing it entirely. As it is, you have to wade through an unusual amount of front material before you get to the meat of the protocol. - Along the lines of the previous comment, I found the organization a bit hard to follow. I didn't find actual protocol details until around page 21. Procedures are split (and sometimes repeated) between the procedure sections and the message format sections. I think that will make this more difficult and error prone than necessary for implementors to read and reference. I fear readers will read one section and think they understand the procedures, and miss a requirement in the other. - 220.127.116.11: First bullet: Please consider a "MUST NOT construction. "MUST only" can be ambiguous. It would be helpful to explain why the loop count must not be more than one. I can infer that from the later sections on relays, but it was not obvious when reading this section. And unless I missed something, there's no text that puts the two ideas together. - 18.104.22.168: This section seems redundant to the similar sections under negotiation . Since those sections have more information, would it make sense to consolidate them there?
In this text, T6. The protocol must be capable of supporting multiple simultaneous operations with one or more peers, especially when wait states occur. I understand every word, but I'm not sure what this requires the protocol to do. Are you asking that the protocol be non-blocking? But that's a guess. In this text, A GRASP implementation will be part of the Autonomic Networking Infrastructure in an autonomic node, which must also provide an appropriate security environment. In accordance with [I-D.ietf-anima-reference-model], this SHOULD be the Autonomic Control Plane (ACP) [I-D.ietf-anima-autonomic-control-plane]. I wonder what happens if the security environment isn't the ACP. Is that obvious? In this text, An implementation MUST support use of TCP. It MAY support use of another transport protocol. However, GRASP itself does not provide for error detection or retransmission. Use of an unreliable transport protocol is therefore NOT RECOMMENDED. just to educate me, is the strategy here, that (for instance) if synchronization fails over an unreliable transport protocol, that eventually it will be attempted again, just because the two ACAs know they aren't synchronized? I'm really confused by this text. Nevertheless, when running within a secure ACP on reliable infrastructure, UDP MAY be used for unicast messages not exceeding the minimum IPv6 path MTU; however, TCP MUST be used for longer messages. In other words, IPv6 fragmentation is avoided. If a node receives a UDP message but the reply is too long, it MUST open a TCP connection to the peer for the reply. Note that when the network is under heavy load or in a fault condition, UDP might become unreliable. Since this is when autonomic functions are most necessary, automatic fallback to TCP MUST be implemented. The simplest implementation is therefore to use only TCP. We've been having quite the discussion about how well Path MTU Discovery works, even in IPv6. Because GRASP could be running over virtual interfaces, I suspect there's a chance that you'll be running in a tunnel that will give you a Path MTU that's smaller than the IPv6 minimum. But ignoring that for now ... IIRC, we've had poor experiences with protocols that are expected to switch from UDP transport to TCP transport in the middle of a request/response pair. But, setting THAT aside for now ... This text correctly points out that UDP transport is most likely to fail under heavy network load or in a fault condition, when autonomic functions are most necessary. If TCP is mandatory to implement, and implementations will need to switch from UDP to TCP at the most awkward times, and that's been a problem area for other protocols in the past, why not just require TCP in the first place? I see that the UDP/TCP question was listed as an open issue before it was closed, so I'm not balloting Discuss, because I assume I'm missing something that people will help me understand, but I thought about it for a while ... Thanks for this text, If no discovery response is received within a reasonable timeout (default GRASP_DEF_TIMEOUT milliseconds, Section 3.6), the Discovery message MAY be repeated, with a newly generated Session ID (Section 3.7). An exponential backoff SHOULD be used for subsequent repetitions, to limit the load during busy periods. Frequent repetition might be symptomatic of a denial of service attack. and especially for the warning about DoS attacks. I found Appendix D and E useful. Thanks for including both of them.
Firstly, thank you for addressing Joel's OpsDir review. As others have noted, this is a long document :-) I think that, in spite of this, it is very well written.... These comments were written against v-11, but I think are still applicable to -12. Section 2.1, D1: "... the protocol can represent and discover any kind of technical objective ..." While the document *does* say that readers should be familiar with RFC7575, RFC7576, and I-D.ietf-anima-reference-model, I think it would still be helpful to (briefly) describe an objective here, or simply mention that "technical objective" is a term of art and point to the Terminology section (or Sec. 3.10). When I initially read this it sounded incredibly broad, once I found the Terminology section it all made more sense... S2.2. Requirements for Synchronization and Negotiation Capability "SN5. ... It follows that the protocol’s resource requirements must be appropriate for any device that would otherwise need human intervention." I found this sentence confusing / hard to parse. I *think* that you are saying that the protocol should not require so many resources that it cannot be deployed on devices (and so humans would still need to manually manage them)? If so, I think that this could be clearer, but, unfortunately I cannot provide better text... 3.2. High Level Deployment Model "A more common model is expected to be a multi-purpose device capable of containing several ASAs." I'm sure you are right... but for a reader new to the topic this is not obvious (nor clear) - would it be possible to provide some sort of examples of such devices (or brief description of why a more common model would have several ASAs?) E.g: "multi-purpose device capable of containing several ASAs (such as a router or large switch)" (or whatever...) "..it is essential that every implementation is as robust as possible." -- this sounds suspiciously like "Don't write bad code...". What is the purpose if this statement? Do you think that it will somehow make people write better / more robust code? If so, shouldn't this be in our standard boilerplate? This whole paragraph feels like it is not actionable / is something that all code for all implementations of everything should follow... (I have a horrible feeling that I'm heading off on a soapbox rant / that this is a pet-peeve...)
Thanks for addressing the SecDir review, as well as Ben's questions on the WG decisions for authentication & encryption and Spencer's on running in a secure ACP. Clarifying the text for the latter would be helpful.
The document includes a couple of instances of "reasonable" in normative statements (e.g., "reasonable timeout"). I would strongly recommend having specific recommendations in the document where this happens. The CBOR definition has constants for IP_PROTO_TCP and IP_PROTO_UDP, but no way to register additional values with IANA. This does not seem future-proof. Section 3.8.4 talks about behavior when a node has a "globally unique address," but provides no guidance for detecting this. Are nodes expected to check for link-local, zeroconf, RFC 1918, and RFC 6598 addresses? Any others?