Hi Brian,
Our product uses unbound DNS recursor as a simple forwarding
interface to remote DNS servers owned by the customer. In this case
there are two DNS servers in the customer network and the
assumption is unbound will choose the server based on RTT
(RoundTripTime) delay.
It prefers the server with lowest RTT. But it uses the other servers
in case no answer can be obtained.
Recently, our customer had some issues with one of their DNS
servers (they were not specific), but from tcpdump output it
appears the DNS server responded to NAPTR requests very quickly (<1
msec) but had SERVFAIL (2) as the response code. The customer
claims the other DNS server did not have issues but was not chosen
(response took longer – maybe several msecs). The customer
complained that the other server should have been selected instead
of choosing the ‘bad’ server responses.
It selects the fast server first, then tries that. It does not work.
It then falls back to the slower server and gets the answer.
The scenario that you sketch sounds implausible or incomplete,
particularly the RTT values suggested. Unbound has an rtt band of 400
msec. Thus, the slower server has to be more than 400 msec slower
than the fast server for unbound to not consider this server (right
away). If the fast server is 5 msec RTT, 405 msec rtt for the slow
server. That is very, very slow.
If servers fall into the RTT band unbound picks a random server to
send to. Thus, unbound should under normal circumstances have picked
the 'slow' server fairly often, if this server had relatively normal
RTTs (<400).
Of course during this random selection, if the fast server responds
with SERVFAIL, unbound falls back to the slow server, after a couple
of queries, regardless of the RTT to that slow server. Unbound will
tolerate RTTs up to 2 minutes (!!) for that sort of case.
Was there something else going on? That slow server had other
problems (malformed responses, firewalled, rtt > 2 minutes) ?
I have seen the discussion on how unbound selects which server to
use based on RTT but it seems like it is designed more for handling
network connectivity issues, timeouts and such. So what is the
expected behavior when DNS responses are received but have a
response code other then NOERROR (particulary SERVFAIL)? Is there
any documents or discussions on this? Is there any settings
(configurations) which would change behavior in this case?
It will retry and once satisfied the request cannot be resolved on
this server, attempt to query other servers. Until it finally runs
out of servers to query, and then unbound returns SERVFAIL.
If the customer wants 'immediate query resolution' (why is msec
important anyway), prehaps enable prefetch? So that answers for
common queries are in the cache (and perhaps increase cache size)? If
you can also control the TTL, if you increase the TTL of the answers,
it will be more likely to get prefetch treatment (even at lower
traffic density).
Best regards,
Wouter