Early Review of draft-ietf-lsvr-l3dl-03
review-ietf-lsvr-l3dl-03-tsvart-early-ott-2020-05-05-00

Request Review of draft-ietf-lsvr-l3dl
Requested rev. no specific revision (document currently at 07)
Type Early Review
Team Transport Area Review Team (tsvart)
Deadline 2020-04-28
Requested 2020-02-10
Requested by Wesley Eddy
Authors Randy Bush, Rob Austein, Keyur Patel
Draft last updated 2020-05-05
Completed reviews Tsvart Early review of -03 by Joerg Ott (diff)
Comments
Requested via Erik Kline.
Assignment Reviewer Joerg Ott 
State Completed
Review review-ietf-lsvr-l3dl-03-tsvart-early-ott-2020-05-05
Posted at https://mailarchive.ietf.org/arch/msg/tsv-art/rLHOXu0fRbxVOIKXrBsE-SHQdmE
Reviewed rev. 03 (document currently at 07)
Review result Ready with Issues
Review completed: 2020-05-05

Review
review-ietf-lsvr-l3dl-03-tsvart-early-ott-2020-05-05

The draft describes a peer/neighbour discovery mechanisms for large-scale L2/L3 topologies in data centres. The aim is provide a protocol by means of which the involved nodes can learn about other nodes connected to their (broadcast or point-to-point) L2 links and about their respectively support encapsulation schemes, identifiers, L2/L3 addresses, etc. This information is then provided to a higher layer for further processing.

The document is well written and fairly easy to follow, but could benefit from a bit of extra context and target application domain in the introduction. E.g., explaining explicitly who would talk L3DL to whom.

From a transport perspective, I see three potential issues that deserve clarification or reconsideration:

1. Section 10 spells out a default HELLO interval of 60 seconds. With a large broadcast domain, this may create quite a bit of traffic. While this may not be an issue in well-provisioned data center networks,  a remark about sensible value ranges and the implications may be worthwhile. Just to provide some guidelines to implementers (who want to offer choices) and operators (who pick them).

2. Section 10 also suggest that in response to HELLO messages nodes will issue OPEN PDUs to newly discovered peers. This appears to bear the clear risk of an OPEN implosion when many system come up at the same time. Shouldn't guidance be given to avoid repeated traffic surges and possible losses and thus unnecessary delays? (I noted that other places foresee exponential backoff when retransmitting OPEN and other ACKed PDUs).

3. When the protocol applies fragmentation, should there be a note on preventing bursts?

Other notes:
Section 7 on the checksum needs more detail. It also talks about a "suggested" algorithm but this should be clearly mandated or way to choose one by means of configuration for a complete data centre would need to be made explicit. I also assume that the pseudo code on p.11 would benefit from a leader '0' in 0xffffffff -> 0x0ffffffff, otherwise expansion to 64 bits might fill the high order bits with '1's, which is clearly not intended.

Section 11, p.17, second to last para ("If a properly authenticated...").  From the text, it is unclear what is meant by an "OPEN with the Serial Number of the last data received".

I am curious about the error code, providing 16 bits for additional explanation. Why not a text field?
Also wondering if repeated retries (due to failure, not lost packets) could yield fast repeated transmissions.

Section 15, should the KEEPALIVE interval have suggested (lower) bounds?
At the top of p.26, it says "One per second is the default", the previous page at the bottom refers to the inter-KEEPALIVE interval of ten seconds. Not sure if the two are the same, I suppose so. If they are, the numbers should match. If they are not, we'll need some extra text to explain the difference.

Nits:
There are two spellings of "Encapsulation", capitalised and lower case. Use one consistently.
p10, first para: comprise -> comprising