Ballot for draft-ietf-anima-grasp

Comment (2017-05-22 for -12) Unknown

Firstly, thank you for addressing Joel's OpsDir review.
As others have noted, this is a long document :-) I think that, in spite of this, it is very well written....

These comments were written against v-11, but I think are still applicable to -12.

Section 2.1, D1:
"... the protocol can represent and discover any kind of
technical objective ..." While the document *does* say that readers should be familiar with RFC7575, RFC7576, and I-D.ietf-anima-reference-model, I think it would still be helpful to (briefly) describe an objective here, or simply mention that "technical objective" is a term of art and point to the Terminology section (or Sec. 3.10). When I initially read this it sounded incredibly broad, once I found the Terminology section it all made more sense...

S2.2. Requirements for Synchronization and Negotiation Capability

"SN5.
...
It follows that the protocol’s resource requirements must be appropriate for any device that would otherwise need human intervention."

I found this sentence confusing / hard to parse. I *think* that you are saying that the protocol should not require so many resources that it cannot be deployed on devices (and so humans would still need to manually manage them)?
If so, I think that this could be clearer, but, unfortunately I cannot provide better text...

3.2. High Level Deployment Model
"A more common model is expected to be a multi-purpose device capable of containing several ASAs."
I'm sure you are right... but for a reader new to the topic this is not obvious (nor clear) - would it be possible to provide some sort of examples of such devices (or brief description of why a more common model would have several ASAs?) E.g: "multi-purpose device capable of containing several ASAs (such as a router or large switch)" (or whatever...)

"..it is essential that every implementation is as robust as possible."
-- this sounds suspiciously like "Don't write bad code...". What is the purpose if this statement? Do you think that it will somehow make people write better / more robust code? If so, shouldn't this be in our standard boilerplate? This whole paragraph feels like it is not actionable / is something that all code for all implementations of everything should follow... (I have a horrible feeling that I'm heading off on a soapbox rant / that this is a pet-peeve...)

Yes (for -11) Unknown

No Objection (2017-05-22 for -12) Unknown

The document includes a couple of instances of "reasonable" in normative statements (e.g., "reasonable timeout"). I would strongly recommend having specific recommendations in the document where this happens.

The CBOR definition has constants for IP_PROTO_TCP and IP_PROTO_UDP, but no way to register additional values with IANA. This does not seem future-proof.

Section 3.8.4 talks about behavior when a node has a "globally unique address," but provides no guidance for detecting this. Are nodes expected to check for link-local, zeroconf, RFC 1918, and RFC 6598 addresses? Any others?

No Objection (2017-06-05 for -13) Unknown

Thank you for addressing my DISCUSS and comments.

Martin's ART Review comments seem to be addressed (other than some possible cleanup of text about TLS use).

As a general comment, the document has several SHOULD/MUST level requirements which are sometimes addressed at people deploying the protocol, sometimes at UI designers and sometimes at designers of new objectives. I generally don't mind, but the document doesn't always make it clear what is the intended audience for different requirements.

No Objection (for -12) Unknown

No Objection (2017-05-22 for -12) Unknown

Substantive:

-3.5.2.1: "Messages MUST be authenticated and encryption MUST be
implemented."
Should the latter be "... MUST be used"? It seems odd for authentication to be MUST use, but crypto to only be MTI.

-3.5.4.3: "An exponential backoff SHOULD be used for subsequent
repetitions, to limit the load during busy periods."
Why not MUST? Also, is there a retry limit? (Comment applies to the other sections that mention retries with exponential backoff)

-3.5.6.2: "To ensure that flooding does not result in a loop, the originator of
the Flood Synchronization message MUST set the loop count in the
objectives to a suitable value "
I assume this is true for discovery and negotiation as well? I don't think it was mentioned in those sections (although I think I saw a related mention in the message format sections.)

- 3.10.5: "SHOULD NOT be used in
unmanaged networks such as home networks."
Why not MUST?

-5, Privacy and Confidentiality: Did people consider IP Addresses and other potentially persistent identifiers as impacting privacy?

-7, Grasp Message and Options table: Why "Standards Action"? Would you expect some harm to be done if this were only Spec Required?

Editorial:

- Is section 2 expected to be useful to implementers once this is published as an RFC? Unless there's a reason otherwise, I would suggest moving this to an appendix, or even removing it entirely. As it is, you have to wade through an unusual amount of front material before you get to the meat of the protocol.

- Along the lines of the previous comment, I found the organization a bit hard to follow. I didn't find actual protocol details until around page 21. Procedures are split (and sometimes repeated) between the procedure sections and the message format sections. I think that will make this more difficult and error prone than necessary for implementors to read and reference. I fear readers will read one section and think they understand the procedures, and miss a requirement in the other.

- 3.5.2.2: First bullet:
Please consider a "MUST NOT construction. "MUST only" can be ambiguous.
It would be helpful to explain why the loop count must not be more than one. I can infer that from the later sections on relays, but it was not obvious when reading this section. And unless I missed something, there's no text that puts the two ideas together.

- 3.5.4.5: This section seems redundant to the similar sections under negotiation . Since those sections have more information, would it make sense to consolidate them there?

No Objection (2017-05-23 for -12) Unknown

The comparison text to routing protocols is outdated as ignores TE
which can support any link/node attribute desired (bandwidth,
availability, latency, etc.), discovery, bidirectional negotiation for use, and
autoconfiguration (e.g. RFC 5340). When first discussing automatic
networks, it may have been useful to compare with routing, as at a
very high level view, it may look similar, but I think it is no longer
relevant, and very confusing for a routing person. Suggest instead of a
"I'm more complex than you" approach, remove these paragraphs.

A few minor edits will fix.

Suggest to remove the first paragraph of Section 2.2. Or edit:
1. links are no longer simple: "consider simple link"/s/"consider link"
2. Delete from "nodes need a consistent, although partial, view of the
network topology in order for the routing algorithm to converge.  Also,
routing is mainly based on simple information synchronization between
peers, rather than on bi-directional negotiation." I think what you want to
infer by "partial" is for a protocol instance/region. But there is support today
for multi-layer and multi-region networks. And convergence scale is
implementation. But none of this is relevant to anima so best is to delete vs.
trying to fix.

Appendix E
Remove the paragraph on routing or preface with "Early routing protocols.."
And the paragraph on RSVP is really not relevant for this comparison. Unless
want to edit, as RSVP-TE does do "discovery".

No Objection (2017-05-24 for -14) Unknown

S 3.5.4.3.
   After a GRASP device successfully discovers a locator for a Discovery
   Responder supporting a specific objective, it MUST cache this
   information, including the interface index via which it was
   discovered.  This cache record MAY be used for future negotiation or
   synchronization, and the locator SHOULD be passed on when appropriate
   as a Divert option to another Discovery Initiator.

What's an "interface index"


S 3.5.4.4.
   Since the relay device is unaware of the timeout set by the original
   initiator it SHOULD set a timeout at least equal to GRASP_DEF_TIMEOUT
   milliseconds.

I'm not sure I'm following here. Does the relay instance retransmit
with its own timeout?


   It MUST cache the Session ID
   value and initiator address of each relayed Discovery message until
   any Discovery Responses have arrived or the discovery process has
   timed out.

How does this behave if the original initiator's timeout is
longer than GRASP_DEF_TIMEOUT?


S 3.5.5.
   A negotiation procedure concerns one objective and one counterpart.
   Both the initiator and the counterpart may take part in simultaneous
   negotiations with various other ASAs, or in simultaneous negotiations
   about different objectives.  Thus, GRASP is expected to be used in a
   multi-threaded mode.  Certain negotiation objectives may have
   restrictions on multi-threading, for example to avoid over-allocating
   resources.

"multi-threaded" is an odd word here. I assume you mean that you
are doing multiple stuff at once, but you might actually write
the system using non-multi-threaded techniques.


S 3.7.
You seem to be going to a lot of trouble to deal wit session
ID collisions. Why don't you just make session IDs 128-bit
random values and then you won't have to worry about
collisions.

  The Session ID SHOULD have a very low collision rate locally.  It
   MUST be generated by a pseudo-random algorithm using a locally
   generated seed which is unlikely to be used by any other device in
   the same network [RFC4086].

Why don't you just require a cryptographically secure PRNG?
That will be required to implement the rest of this protocol


S 3.8.2.
You seem to introduce a normative dependency on CDDL here.
I see that it's in your changelog here, but what are
your intentions about this document, given that CDDL seems
to not even be a WG document


S 3.8.5.
      It MUST contain a time-to-live (ttl) for the validity of the
      response, given as a positive integer value in milliseconds.  Zero
      is treated as the default value GRASP_DEF_TIMEOUT (Section 3.6).

Why do this, rather than just forbidding 0.


S 3.8.6.
   If a node receives a Request message for an objective for which no
   ASA is currently listening, it MUST immediately close the relevant
   socket to indicate this to the initiator.  This is to avoid
   unnecessary timeouts if, for example, an ASA exits prematurely but
   the GRASP core is listening on its behalf.

This is not secure. You need a secure indication of non-knowledge,
not a transport-level close.

S 3.9.5.4.
What are the semantics of a Divert URI? What do I dow ith the
path part?


S 3.10.4.
The semantics of "dry run" seem pretty unclear. Is it just
"tell me if you would be sad about doing this"?

No Objection (2017-05-23 for -12) Unknown

Thanks for addressing the SecDir review, as well as Ben's questions on the WG decisions for authentication & encryption and Spencer's on running in a secure ACP.  Clarifying the text for the latter would be helpful.

No Objection (2017-07-07 for -14) Unknown

Thanks for addressing my discuss in the upcoming version -15!

-----
Old comments for the record (I didn't check these):

Other mostly editorial comments:
- ASA needs to be spelled out in the intro.
- I would recommend to move section 2 and 3.3 into the appendix
- section 3.5.4.2: "A neighbor with multiple interfaces will respond with a cached discovery response if any."
"cached response" is explained in the next section and not clear in this paragraph.
- section 3.5.4.3: "After a GRASP device successfully discovers a locator for a Discovery
Responder supporting a specific objective, it MUST cache this
information, including the interface index via which it was
discovered. This cache record MAY be used for future negotiation or
synchronization, and the locator SHOULD be passed on when appropriate
as a Divert option to another Discovery Initiator."
Not sure why the first is a MUST and the later is a SHOULD. I guess a SHOULD for caching would be sufficient.
- section 3.8.6 "If a node receives a Request message for an objective for which no
ASA is currently listening, it MUST immediately close the relevant
socket to indicate this to the initiator."
How is that indicated? Should really be further clarified
- Also section 3.8.6: "In case of a clash, it MUST discard the Request message, in
which case the initiator will detect a timeout."
Why don't you send an error message instead? How does the initiator know that is should retry (assuming there is a TCP connection underneath that provides reliable transport)?
- Also section 3.8.9: "If not, the initiator MUST abandon or restart the negotiation
procedure, to avoid an indefinite wait."
How does the initiator decide for abandoning or restarting instead? Needs clarification!
- Could be useful to include an optional reasoning field in the Invalid Message and make copying the received message up to the maximum message size of this message a SHOULD (section 3.8.12.).
- Not sure I fully understand the purpose of the No Operation Message (section 3.8.13.). If you just want to open a socket for probing, you perform a TCP handshake and send a RST right after. No need for further application layer interactions. And should there also be an optional reasoning phrase?
- Not sure why the objectives flag is needed. I assume that unknown objectives are ignored anyway and if a objective is known the receiver should know if that objective is valid for the respective message type (section 3.10.2).
- section 3.10.4: "An issue requiring particular attention is that GRASP itself is a stateless protocol."
It's not. It caches information and needs to remember previous messages sent to reply correctly.
- section 5: "Generally speaking, no personal information is expected to be
involved in the signaling protocol, so there should be no direct impact on personal privacy."
I don't think this is true because the protocol is so generic that you cannot say anything about the services it is used for.
Please see also further comments from Martin's tsv-art review (Thanks again!)!

No Objection (2017-05-07 for -11) Unknown

In this text,

T6. The protocol must be capable of supporting multiple simultaneous
operations with one or more peers, especially when wait states occur.

I understand every word, but I'm not sure what this requires the protocol to do. Are you asking that the protocol be non-blocking? But that's a guess.

In this text,

A GRASP implementation will be part of the Autonomic Networking
Infrastructure in an autonomic node, which must also provide an
appropriate security environment. In accordance with
[I-D.ietf-anima-reference-model], this SHOULD be the Autonomic
Control Plane (ACP) [I-D.ietf-anima-autonomic-control-plane].

I wonder what happens if the security environment isn't the ACP. Is that obvious?

In this text,

An implementation MUST support use of TCP.
It MAY support use of another transport protocol. However, GRASP
itself does not provide for error detection or retransmission. Use
of an unreliable transport protocol is therefore NOT RECOMMENDED.

just to educate me, is the strategy here, that (for instance) if synchronization fails over an unreliable transport protocol, that eventually it will be attempted again, just because the two ACAs know they aren't synchronized?

I'm really confused by this text.

Nevertheless, when running within a secure ACP on reliable
infrastructure, UDP MAY be used for unicast messages not exceeding
the minimum IPv6 path MTU; however, TCP MUST be used for longer
messages. In other words, IPv6 fragmentation is avoided. If a node
receives a UDP message but the reply is too long, it MUST open a TCP
connection to the peer for the reply. Note that when the network is
under heavy load or in a fault condition, UDP might become
unreliable. Since this is when autonomic functions are most
necessary, automatic fallback to TCP MUST be implemented. The
simplest implementation is therefore to use only TCP.

We've been having quite the discussion about how well Path MTU Discovery works, even in IPv6. Because GRASP could be running over virtual interfaces, I suspect there's a chance that you'll be running in a tunnel that will give you a Path MTU that's smaller than the IPv6 minimum. But ignoring that for now ...

IIRC, we've had poor experiences with protocols that are expected to switch from UDP transport to TCP transport in the middle of a request/response pair. But, setting THAT aside for now ...

This text correctly points out that UDP transport is most likely to fail under heavy network load or in a fault condition, when autonomic functions are most necessary. If TCP is mandatory to implement, and implementations will need to switch from UDP to TCP at the most awkward times, and that's been a problem area for other protocols in the past, why not just require TCP in the first place?

I see that the UDP/TCP question was listed as an open issue before it was closed, so I'm not balloting Discuss, because I assume I'm missing something that people will help me understand, but I thought about it for a while ...

Thanks for this text,

If no discovery response is received within a reasonable timeout
(default GRASP_DEF_TIMEOUT milliseconds, Section 3.6), the Discovery
message MAY be repeated, with a newly generated Session ID
(Section 3.7). An exponential backoff SHOULD be used for subsequent
repetitions, to limit the load during busy periods. Frequent
repetition might be symptomatic of a denial of service attack.

and especially for the warning about DoS attacks.

I found Appendix D and E useful. Thanks for including both of them.

No Objection (for -12) Unknown