Serving Stale Data to Improve DNS Resiliency
Note: This ballot was opened for revision 09 and is now closed.
(Suresh Krishnan) Yes
Comment (2019-12-04 for -09)
* Section 6 It might be useful to include a reference to DITL for some background on the dataset mentioned in this section http://www.caida.org/projects/ditl/
(Barry Leiba) Yes
(Alexey Melnikov) Yes
(Deborah Brungard) No Objection
(Alissa Cooper) No Objection
Roman Danyliw No Objection
Comment (2019-12-03 for -09)
* I agree with Mirja, Section 8 is more informative than what is alluded to the paragraph starting with “Several recursive resolvers …” in Section 3, and IMO is worth keeping. I struck me as odd to call out the operation practice of a particular vendor (Akamai). We might want to check if this reference is ok – Ben? * A few reference nits: - Section 6. Per the mention to DNS-OARC, please provide a citation. - Section 6 and 9. The text references “during discussions in the IETF”. What is that specifically – WG deliberation? * Thanks for covering the attacker use cases of stale data in Section 10.
Benjamin Kaduk No Objection
Comment (2019-12-04 for -09)
Thanks for this document; it's some good comprehensive discussion of the issues related to this topic and will improve the stability of the internet. I have several minor coments and a few side notes that are expected to lead to at most my own elucidiation (but no textual changes). Section 2 For a comprehensive treatment of DNS terms, please see [RFC8499]. (side note: I myself would not use the word "comprehensive" when it explicitly says that "some DNS-related terms are interpreted quite differently by different DNS experts", but I understand why it is used here.) Section 3 There are a number of reasons why an authoritative server may become unreachable, including Denial of Service (DoS) attacks, network issues, and so on. If a recursive server is unable to contact the authoritative servers for a query but still has relevant data that side note: the way this is worded might make a reader wonder if the recursive is expected to attempt to contact all known authoritatives before declaring failure. Several recursive resolver operators, including Akamai, currently use stale data for answers in some way. A number of recursive resolver I did not follow the discussions that led to this wording, but one of my colleagues at Akamai suggested that "currently fall back to stale data for answers under some circumstances" might be a nicer wording, though I note that Adam has already proposed some text here as well, which is probably fine. Section 4 The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is amended to read: TTL a 32-bit unsigned integer number of seconds that specifies the duration that the resource record MAY be cached before the source of the information MUST again be consulted. Zero values are interpreted to mean that the RR can only be used for the transaction in progress, and should not be cached. Values SHOULD be capped on the orders of days to weeks, with a recommended cap of 604,800 seconds (seven days). If the data is unable to be authoritatively refreshed when the TTL expires, the record MAY be used as though it is unexpired. See the Section 5 and Section 6 sections for details. I recommend using "[this document]" in the section references, since a reader reading the updated content in the context of RFC 1035 might look there instead of here. Section 5 The resolver then checks its cache for any unexpired records that satisfy the request and returns them if available. If it finds no relevant unexpired data and the Recursion Desired flag is not set in the request, it should immediately return the response without consulting the cache for expired records. Typically this response would be a referral to authoritative nameservers covering the zone, but the specifics are implementation-dependent. side note: I'm slightly surprised that the semantics of the absence of Recusion Desired are not more tightly nailed down, but neither is it the role of this document to specify them. When no authorities are able to be reached during a resolution attempt, the resolver should attempt to refresh the delegation and restart the iterative lookup process with the remaining time on the query resolution timer. This resumption should be done only once during one resolution effort. Is the "during one" more like a global cap or more like "during a given"? Section 6 The client response timer is another variable which deserves consideration. If this value is too short, there exists the risk that stale answers may be used even when the authoritative server is actually reachable but slow; this may result in sub-optimal answers being returned. Conversely, waiting too long will negatively impact user experience. Not just sub-optimal but potentially even wrong or actively harmful answers, no? The balance for the failure recheck timer is responsiveness in detecting the renewed availability of authorities versus the extra resource use for resolution. If this variable is set too large, stale answers may continue to be returned even after the authoritative server is reachable; per [RFC2308], Section 7, this should be no more than five minutes. If this variable is too small, authoritative servers may be rapidly hit with a significant amount of traffic when they become reachable again. I think part of the concern is also that setting the value too small will cause additional traffic towards the authoritative even while it is nonresponsive/nonreachable, which could aggravate any DoS attack ongoing against the authoritative. Which is to say, that perhaps "became reachable again" does not quite reflect the full set of considerations. Regarding the TTL to set on stale records in the response, historically TTLs of zero seconds have been problematic for some implementations, and negative values can't effectively be communicated to existing software. Other very short TTLs could lead to congestive collapse as TTL-respecting clients rapidly try to refresh. The recommended value of 30 seconds not only sidesteps those potential problems with no practical negative consequences, it also rate limits further queries from any client that honors the TTL, such as a forwarding resolver. I a little-bit wonder whether an RFC 8085 reference would make sense here, but that's not exactly my area of expertise. There's also no record of TTLs in the wild having the most significant bit set in DNS-OARC's "Day in the Life" samples. With no Should we have a reference for DNS-OARC's samples? apparent reason for operators to use them intentionally, that leaves either errors or non-standard experiments as explanations as to why such TTLs might be encountered, with neither providing an obviously compelling reason as to why having the leading bit set should be treated differently from having any of the next eleven bits set and then capped per Section 4. side note(?): This discussion, as roughly "we can't think of any reason why the change would be problematic", calls to mind the ongoing discussions of RFC (text) format changes, where arguments are being made for more-strict backwards/historical compatibility. That said, I have no reason to doubt the WG consensus position here, hence "side note". Section 7 Be aware that Canonical Name (CNAME) and DNAME [RFC6672] records mingled in the expired cache with other records at the same owner name can cause surprising results. This was observed with an initial implementation in BIND when a hostname changed from having an IPv4 Address (A) record to a CNAME. The version of BIND being used did not evict other types in the cache when a CNAME was received, which in normal operations is not a significant issue. However, after both records expired and the authorities became unavailable, the fallback to stale answers returned the older A instead of the newer CNAME. I'm not sure to what extent the lesson from this scenario is limited to "CNAME/DNAME are special" versus "when serving stale, serve the least-stale you have". Section 8 Details of Apple's implementation are not currently known. I'm amenable to the other reviewer's comment that this section might be interesting to keep, RFC 6982 notwithstanding, in which case this might be more appropriately worded as "publicly disclosed" -- one assumes that the Apple employees that wrote it know what it does! Section 10 The most obvious security issue is the increased likelihood of DNSSEC validation failures when using stale data because signatures could be returned outside their validity period. Stale negative records can We seem to be carefully not giving explicit guidance about using "stale" DNSSEC keys in addition to stale resolution records. If the consequences of potentially using expired key material are more severe than the consequences of potentially using expired DNS records (as it seems to me), perhaps we should explicitly reiterate that serve-stale is not an excuse to ignore key validity periods (as we are implicitly doing here)? In [CloudStrife], it was demonstrated how stale DNS data, namely hostnames pointing to addresses that are no longer in use by the owner of the name, can be used to co-opt security such as to get domain-validated certificates fraudulently issued to an attacker. While this document does not create a new vulnerability in this area, it does potentially enlarge the window in which such an attack could be made. A proposed mitigation is that certificate authorities should fully look up each name starting at the DNS root for every name lookup. Alternatively, CAs should use a resolver that is not serving stale data. [I think Adam has probably already covered this one, but keeping just in case.] I note that the target of this guidance (CAs) is not obviously in the expected readership set for a document about DNS recursive resolver operational considerations. Can we do more to expand the visibility of this guidance to the audience where it would be most useful? (I don't see an obvious candidate for, e.g., an additional Updates: relationship, but perhaps someone has other ideas.)
(Mirja Kühlewind) No Objection
Comment (2019-12-02 for -09)
Two comments: 1) It seems to me that this sentence in section 7 should/could actually be phrased as a normative requirement in this document: "it is not necessary that every client request needs to trigger a new lookup flow in the presence of stale data, but rather that a good-faith effort has been recently made to refresh the stale data before it is delivered to any client." Maybe worth considering... 2) I find the Implementation Status section (8) actually quite interesting for this document and maybe it should be considered to keep it in the document for final publication.
Alvaro Retana No Objection
(Adam Roach) No Objection
Comment (2019-12-02 for -09)
Thanks to everyone who put work into documenting this useful and apparently well-deployed mechanism. I have a handful of comments on the current document. --------------------------------------------------------------------------- §3: > Several recursive resolver operators, including Akamai, currently use > stale data for answers in some way. This won't age well; and it's not clear why calling out Akamai amongst the various DNS service providers is warranted. Suggest: At the time of this document's publication, several recursive resolver operators use stale data for answers in some way (If the notion of citing Akamai is to indicate the scale of such operators, I suggest "...operators, including large-scale operators, use stale...") --------------------------------------------------------------------------- §4: > The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is > amended to read: > > TTL a 32-bit unsigned integer number of seconds that specifies the > duration that the resource record MAY be cached before the source > of the information MUST again be consulted. Zero values are > interpreted to mean that the RR can only be used for the > transaction in progress, and should not be cached. Values SHOULD > be capped on the orders of days to weeks, with a recommended cap > of 604,800 seconds (seven days). If the data is unable to be > authoritatively refreshed when the TTL expires, the record MAY be > used as though it is unexpired. See the Section 5 and Section 6 > sections for details. The addition of what I must presume is intended to be RFC 2119 language to a document that doesn't cite RFC 2119 seems questionable. I would suggest either explicitly adding RFC 2119 boilerplate to RFC 1035 as part of this update, or using plain English language to convey the same concepts as are intended. Nit: "See the Section 5 and Section 6 sections for details" is a very awkward way to phrase the closing sentence. More substantively: Sections 5 and 6 of RFC 1035 are "MASTER FILES" and "NAME SERVER IMPLEMENTATION" respectively. Is this final sentence intended to refer to those two sections? Or is it pointing to "Example Method" and "Implementation Considerations" of this document? If the latter, please specifically cite this document (e.g., "See Section 5 and Section 6 of [RFCXXXX] for details.") --------------------------------------------------------------------------- §4: > therefor leave any previous state intact. See Section 6 for a Nit: "therefore" --------------------------------------------------------------------------- §5: > When a request is received by a recursive resolver, it should start > the client response timer. The passive tense in this sentence makes "it" linguistically ambiguous. Suggest: "When the recursive resolver receives a request, it should start..." --------------------------------------------------------------------------- §10: > A proposed mitigation is that certificate authorities > should fully look up each name starting at the DNS root for every > name lookup. Alternatively, CAs should use a resolver that is not > serving stale data. This seems like a perfectly good solution, although I wonder how many CAs are likely to read this document. If I were the type to engage in wagering, I'd put all of my money on "zero." I'm not sure specific action is called for prior to publication of this document as an RFC, but it seems that additional publicity of this issue and the way that serve-stale interacts with it -- e.g., to CAB Forum and its members -- is warranted.
Martin Vigoureux No Objection
Éric Vyncke No Objection
Comment (2019-12-01 for -09)
Thank you for the work put into this document. The short document is easy to read. Feel free to ignore the sentences below. I loved the sentence "stale bread is better than no bread.", who said that I-D are boring? :-) Should the assertion about DNS stale data by products (end of section 3) be documented by external documents? Somehow addressed in section 8 (to be removed...) Finally, I am unsure whether it is worth documenting the WG discussion about EDNS. Regards, -éric
(Magnus Westerlund) No Objection
Warren Kumari Recuse
Comment (2019-11-18 for -09)
Recusing because I'm an author.