Hi Andreas,
Hi George,
thank you for your quick reply. Could you please elaborate on these failure recheck measures in unbound?
I dug a bit through the unbound code, the iterator specifically, but my C is not as good as to pretend that I understood very much.
The most common measures against failing queries have to do with
communication with the upstream nameservers. When timeouts occur unbound
records that in the infra cache that it also uses for the server
selection logic.
You can read more about that at
https://www.nlnetlabs.nl/documentation/unbound/info-timeout/
Unbound monitors the upstream servers responsiveness but it doesn't
monitor the "health" of individual queries. It caches no-answers like
NODATA and NXDOMAIN but not other errors like REFUSED.
I played a bit around with a zone of mine, added it to the authoritative servers, removed it and observed unbound's behavior. I could not see anything, that would indicate failure recheck measures (at least not for REFUSED codes) in a way that I would interpret the RFC.
The amount of requests I performed against unbound was pretty much identical to the amount of outgoing requests (times 6 for 3 authoritative Servers with both, IPv4 and IPv6).
From the description in the RFC I would have expected unbound to stop querying the authoritative servers for some time and only serve the stale data. At least with serve-expired-client-timeout set to 0. With a non-zero value for this option, the behavior to always query totally makes sense.
REFUSED is a particular case as also mentioned at the last paragraph of
section 6 of the RFC (https://tools.ietf.org/html/rfc8767#section-6).
For REFUSED, unbound does not do anything special; it is simply a
non-usable reply.
It tries to contact the next available nameserver. If all nameservers
return REFUSED (or no nameservers can give an answer) unbound will
return SERVFAIL to the client.
When the RFC behavior is enabled (serve-expired-client-timeout > 0), the
cache is going to be consulted for possible stale records instead of
returning SERVFAIL.
I believe there may also be some confusion for unbound's serve-expired
behavior (note that I specifically used serve-expired instead of
serve-stale).
Unbound had already the serve-expired functionality before the 1st draft
was written.
Unbound's initial serve-expired behavior was to always try to reply from
cache (even if a record is expired) and then try to fetch an updated
record. Combined with the prefetch behavior that prefetches (updates) a
record when within a percentage of the TTL it keeps the cache "almost"
up-to-date for popular queries.
The result is that it increases cache-hit ratio.
The drawback is that you may serve long stale data for records that have
a lower TTL than the query interval from the clients and are subject to
frequent changes (e.g., short-lived A/AAAA records in a dynamic hosted
environment).
The serve-expired-ttl option could help with that.
That initial serve-expired behavior is still desired by operators and
still available (and the default) in unbound.
The new RFC behavior that treats stale records as a last resort to
failing or slow queries is also available and enabled with
serve-expired-client-timeout > 0.
I hope that helps,
-- George