A Common Operational Problem in DNS Servers: Failure to Communicate
RFC 8906

Note: This ballot was opened for revision 17 and is now closed.

Erik Kline Yes

Comment (2020-04-04 for -18)



* s/answers responses/responses/  (or answers)


* Is there a reference for a definition of "scrubbing service"?


* s/None the tests/None of the tests/

Warren Kumari Yes

Alvaro Retana No Objection

Benjamin Kaduk No Objection

Comment (2020-04-07 for -19)
Someone (maybe the RFC Editor) will end up tweaking a lot of commas.
I didn't try to list them all.

I didn't see a response to the secdir reviewer's question (though I'm
also not sure that there's an easy answer to it).

Section 1

   The existence of servers which fail to respond to queries results in
   developers being hesitant to deploy new standards.  Such servers need

nit: it feels a little like a juxtaposition to have "developers" that
"deploy" new standards (vs. "developers that implement" or "operators
that deploy").

   indication that the server is under attack.  Parent zone operators
   are advised to regularly check that the delegating NS records are
   consistent with those of the delegated zone and to correct them when
   they are not [RFC1034].  Doing this regularly should reduce the
   instances of broken delegations.

I can't tell if this 1034 reference is for the recommendation to
regularly check or the definition of "consistent" or something else; if
the recommendation is new, then would BCP 14 keywords be appropriate?

Section 2

   o  The AD flag bit in a response cannot be trusted to mean anything
      as some servers incorrectly copy the flag bit from the request to
      the response [RFC1035], [RFC4035].

Would it be worth a 6840 ref here as well (to catch setting AD in a
request, even though that's not exactly what's being mentioned)?

Section 3.1.2

(Do we want to remind the reader on the NOERROR vs. NXDOMAIN rules? "No"
is probably acceptable.  I see we do so later, in Section 7, so even a
forward reference might suffice.)

Where's the first reference/mention of Meta-RRs?  I see RFC 2929
(obsoleted, transitively, by 6895) that we cite for the "range reserved
for private use" but not for terminology.  Even RFC 8499 (which we don't
cite) only has "meta-RR" in a parenthetical in the description of OPT.

Section 3.1.5

micro-nit: I guess firewalls don't exactly count as "nameservers", which
seems to be the claimed scope for this document.

Section 3.2.1

This section threw me a bit, at first, as the 3.1.x had led me to expect
"nameservers should behave in this way", but this section is "here is
how to tell if a nameserver is misbehaving".  That's not necessarily a
problem, just a ... comment :)

Section 3.2.6

   Some nameservers fail to copy the DO bit to the response despite
   clearly supporting DNSSEC by returning an RRSIG records to EDNS
   queries with DO=1.

I'm not sure if we also want an explicit "nameservers should copy to the
DO bit to the response when they support DNSSEC".

Section 3.2.7

[similarly an affirmative statement of what nameservers should do might
be appropriate here.]

Section 4

   Firewalls and load balancers can affect the externally visible
   behaviour of a nameserver.  Tests for conformance should to be done
   from outside of any firewall so that the system is tested as a whole.

(These are conformance tests run by the nameserver's own operator, or
externally-driven tests, too?)

   However, there may be times when a nameserver mishandles messages
   with a particular flag, EDNS option, EDNS version field, opcode, type
   or class field or combination thereof to the point where the
   integrity of the nameserver is compromised.  Firewalls should offer
   the ability to selectively reject messages using an appropriately
   constructed response based on all these fields while awaiting a fix
   from the nameserver vendor.

I would suggest reiterating that this is "with a response" vs. "drop the
packet silently".

Section 5

   Ideally, Operators should run these tests against a packet scrubbing
   service to ensure that these tests are not seen as attack vectors.

It feels like maybe the most we can say here is "not seen as attack
vectors during normal operation".  We can't exclude the possibility that
some actor decides to generate a flood of messages that happens to match
the test behavior (whether by accident or design), which seems fairly
likely to lead to blocking of the test-behavior traffic as collateral

Section 7

   If the server does not support EDNS at all, FORMERR is the expected
   error code.  That said a minimal EDNS server implementation requires
   parsing the OPT records and responding with an empty OPT record in
   the additional section in most cases.  There is no need to interpret
   any EDNS options present in the request as unsupported EDNS options
   are expected to be ignored [RFC6891].  Additionally EDNS flags can be
   ignored.  The only part of the OPT record that needs to be examined
   is the version field to determine if BADVERS needs to be sent or not.

It seems like there's an implied "so providing minimal EDNS support is
pretty trivial and you ought to do so already" in here; do we want to
make such sentiment explicit?

Section 8

   Testing is divided into two sections.  "Basic DNS", which all servers
   should meet, and "Extended DNS", which should be met by all servers
   that support EDNS (a server is deemed to support EDNS if it gives a
   valid EDNS response to any EDNS query).  If a server does not support
   EDNS it should still respond to all the tests.

Is this "respond to all the tests, albeit with [error responses]"?

   The tests below use dig from BIND 9.11.0.

I guess this version could become important if some future version
starts setting a new flag by default (that would need to be suppressed
if that version of dig was used for many of these tests).

Section 8.1.2

   Ask for the TYPE1000 RRset at the configured zone's name.  This query
   is made with no DNS flag bits set and without EDNS.  TYPE1000 has
   been chosen for this purpose as IANA is unlikely to allocate this
   type in the near future and it is not in a range reserved for private
   use [RFC6895].  Any unallocated type code could be chosen for this

Is there a risk that since we document TYPE1000 like this some server
will implement "respond to TYPE1000" without implementing the actual
desired behavior?


   AD use in queries is defined in [RFC6840].

(Knowing this would have been helpful up in the toplevel section 8 where
we talk about one or both AD=1 and DO=1 being a signal to expect AD=1.)

Section 8.2.3, 8.2.6

[Same comment about option code 100 as for TYPE1000 above; the same
response is assumed.]

Section 9

   When notification is not effective at correcting problems with a
   misbehaving name server, parent operators can choose to remove NS
   record sets (and glue records below) that refer to the faulty server
   until the servers are fixed.  This should only be done as a last
   resort and with due consideration, as removal of a delegation can
   have unanticipated side effects.  [...]

I have mixed feelings about recommending "cut you off until you fix your
bugs" as an option, but not strongly enough to override WG consensus.

Martin Duke No Objection

Comment (2020-03-26 for -18)
Thanks for the draft. It's always good for congestion controls if congestion-based packet losses are disambiguated from other types.

A few nits:
- Section 1 has a number of acronyms without clear references (DANE, SPF, TLSA). Please define them on first use.

- Sec. 3.1.5. Please add a comma after "attempts"

- Sec 3.2.4 uses lower case versions of the normative keywords. Selecting a synonym would improve it.

Martin Vigoureux No Objection

Murray Kucherawy No Objection

Comment (2020-03-30 for -18)
* I tripped almost every time on saying "set FOO bit to 1" and similar because I'm used to "set" implying one and "not set" or "clear" implying zero.  In other places the prose does go with simply saying "FOO bit is set".  Maybe that's just me though; we'll see how my colleagues feel.

Section 1:
* Suggest including a reference to RFC4732 in the discussion of amplification attacks.

Section 2:
* In the discussion of abandoned transition to the SPF type, suggest a reference to RFC6686.

* "Widespread non-response to EDNS queries has lead to  ..." -- s/lead/led/
* "Widespread non-response to EDNS options, requires ..." -- remove comma
* "... requires recursive servers to have to decide ..." -- s/to have//
* "... being present, leads to ..." -- remove comma

Section 3.1.2:
A nit:
* "The exception to this are ..." -- either s/exception/exceptions/ or s/are/is/.

Section 3.1.5:
A nit:
* "While firewalls should not block TCP connection attempts if they do they should ..." -- suggest: "While firewalls should not block TCP connection attempts, those that do should ..."

Section 3.2.2:
More nits:
* "... version 0 queries but ... version numbers that are higher than zero." -- why the digit in one place but prose in the other?

Section 4:
* Paragraphs 3, 4, and 5 could be common factored very easily since most of the text is identical.

Section 5:
* I've never heard of a "scrubbing service".  Is there a reference RFC, or could we include a short definition?
* "One needs to take care when choosing a scrubbing service." -- This is vague.  What, apart from the prior sentence (whose implications I don't understand), should an operator be looking for?

Section 8:
* "Testing is divided into two sections." -- a list follows, so s/./:/

Section 9:
* The final paragraph suggests disconnection of broken nameservers.  This can have serious non-technical implications as well.  That might be worth mentioning.

* "Name server operators ..." -- s/Name server/Nameserver/, to be consistent with the rest of the document

Robert Wilton No Objection

Roman Danyliw No Objection

Comment (2020-04-07 for -20)
Thanks for this document – it is allows for a very approachable way to verify conformance.

** Section 2. Per “Working around issues due to non-compliance with RFCs is not sustainable”, this seems like a bold statement.  What is the basis for it?

** Section 4.  This section repeats several times that firewall should not drop DNS traffic with unknown parameters and such traffic should not be construed as an attack.  In the general case with “normal clients”, this is good advice.  However, for certain highly controlled enclaves where a white-list-style approach to traffic is taken, this is not realistic.  The presence of unexpected classes of new DNS traffic would be a bad sign (e.g., of compromise, a new software load whose features were not understood, or a configuration which was not validated)

** Section 8.  For completeness, per “The test below use dig from BIND 9.11.0”, please provide a reference.

** Section 8 dig examples.  It would be worth explaining $zone and $server.

** Section 10.  Per “Testing protocol compliance can potentially result in false reports of attempts to break services from Intrusion Detection Services and firewalls.”, thanks for calling this out.  I would recommend tuning this language:

-- s/break services/attack services/

-- to acknowledge that uncommon DNS protocol fields or traffic (from this test regime) might trigger anomaly-detection/profile-based IDS alerts too

** Editorial Nits:

-- Section 8. s/is know/is known/

Éric Vyncke No Objection

Comment (2020-04-08 for -20)
Thank you for the work put into this document. I also like the extensive test scenarios with 'dig' ;-)

To be honest, I was about to ballot a DISCUSS as I have some doubts whether the objective of removing non-compliant servers (end of section 2) is achievable... Too many non-compliant servers, probably managed by clueless people. But, hey, we can always try!

I also wonder why this document is a generic BCP while section 8 and other parts seem to indicate more like a testing of servers. Balloting NO OBJECTION but also long hesitation for a DISCUSS.

Please address the nits found by Carlos during the INTDIR review:
https://mailarchive.ietf.org/arch/msg/int-dir/wfKo4vDmFJwPa1HeDY9wxP2JdEA (at least one nit is already addressed, thank you)

Please find below some non-blocking COMMENTs and NITs. An answer will be appreciated.

I hope that this helps to improve the document,



Generic: the objective of this document is a little unclear to me, is it to do compliance testing/certification specific DNS server software ? or to actual DNS servers on the Internet.

-- Section 1 --
Suggest to also add middle-box dropping EDNS in the sentence "Due to the inability to distinguish between packet loss and nameservers dropping EDNS" (see section 4).

-- Section 4 --
Why limiting the middle boxes to only firewalls and load balancers? There are many different types of middle-box (NAT, ...) also doing "packet massaging" on the fly... :-(

-- Section 10 --
The security considerations is rather weak...

If the intent is to probe Internet servers, then why not adding some text around 'do it only with prior agreement of the DNS servers operator' ?

Also, if the firewall is "protecting" the DNS server (cough cough), then as a security officer I would prefer to block all unknown opcodes/types at the firewall (possibly with a reply).

== NITS ==

-- section 2 --
Please add an expansion or a reference to "AD flag bit". (and in my non-native English speaker, it is a pleonasm).

(Alissa Cooper; former steering group member) No Objection

No Objection ( for -20)

(Barry Leiba; former steering group member) No Objection

No Objection (2020-04-07 for -20)
Thanks for a BCP on this.

I agree with Ben about the commas.

For what it’s worth, I disagree with Martin’s comment about “should” and such: the document does not cite BCP 14, and I think that’s fine.

Some editorial stuff:

— Section 1 —

   there is still a pool of servers that don't respond to EDNS requests,
   clients have no way to know if the lack of response is due to packet
   loss, or EDNS packets not being supported,

I tripped on the meaning of “while” here, and I suggest changing it to “As long as there are still servers...”, so as to avoid the ambiguity.

— Section 2 —

   Some are caused directly from the non-compliant
   behaviour and others as a result of work-arounds

Make it “directly by”, not “from”.  And then “and others are as a result”.

   o  Widespread non-response to EDNS queries has lead to recursive

Make it “has led”.

      servers to have to decide whether to probe to see if it is the
      EDNS option or just EDNS that is causing the non response.

I would say, “the specific EDNS option or the use of EDNS in general”.

(Deborah Brungard; former steering group member) No Objection

No Objection ( for -20)

(Magnus Westerlund; former steering group member) No Objection

No Objection ( for -19)