Unbound recursing and broken NS in RRSET

Sander_Smeenk · March 6, 2012, 2:50pm

Hi,

i'm running Unbound 1.4.6 on Linux for my recursing needs. It came to my
attention that this Unbound does not even answer(!) queries for domains
which have at least one malfunctioning NS in their NS RRSET.

In this case it's all about recursing 'archive.debian.org'.
Please keep in mind that debian.org is running DNSSEC enabled.
My Unbound is configured to do DNSSEC verification.

Unfortunately (:P) the situation seems all normal now, all listed
nameservers seem to be responding, making this issue a tad bit harder
to reproduce.

The domain 'debian.org' currently has four nameservers listed:

debian.org. 28800 IN NS ns1.debian.org.
debian.org. 28800 IN NS ns2.debian.org.
debian.org. 28800 IN NS ns3.debian.org.
debian.org. 28800 IN NS ns4.debian.com.

Subdomain 'archive.debian.org' has it's own NS RRSET,
geo[123].debian.org, these seem to work just fine.

From what i've seen, from time-to-time, ns4.debian.com seems not to

respond to queries which in turn makes recursing 'archive.debian.org'
(with no DNS cache) malfunction with Unbound like so:

[sanders@haze:~] % dig archive.debian.org
; <<>> DiG 9.8.1-P1 <<>> archive.debian.org
;; global options: +cmd
;; connection timed out; no servers could be reached

(i would expect SERVFAIL, at least)

At the same time a BIND9 server does not seem to have any real problems
recursing the query, it just takes a little longer for the answer to
appear as it seems to skip over the not-responding host.

I found that after the neg. cache ttl expires, sometimes Unbound *is*
able to resolve the domain. This all seems to depend on what NS is first
in the RRSET returned for 'debian.org'.

Friends on IRC comment that this behaviour (broken recursing with one
malfunctioning nameserver in a larger RRSET) is seen more and more,
also across different recursors...

I skimmed through RFCs 1912, 2182, 1034 and 1035 but could not really
find the proposed way to handle situations like the above.

Could someone please comment on this?

-Sndr.

Wouter · March 6, 2012, 3:56pm

Hi Sander,

Hi,

i'm running Unbound 1.4.6 on Linux for my recursing needs. It came
to my

Please update to 1.4.16. All of the fixes could fix you problem.
Also there have been fixes for reported (DoS) vulnerabilities.

attention that this Unbound does not even answer(!) queries for
domains which have at least one malfunctioning NS in their NS
RRSET.

What sort of malfunctioning are you talking about? Downtime I presume
from the text below.

In this case it's all about recursing 'archive.debian.org'. Please
keep in mind that debian.org is running DNSSEC enabled. My Unbound
is configured to do DNSSEC verification.

Unfortunately (:P) the situation seems all normal now, all listed
nameservers seem to be responding, making this issue a tad bit
harder to reproduce.

The domain 'debian.org' currently has four nameservers listed: |
debian.org. 28800 IN NS ns1.debian.org. | debian.org.
28800 IN NS ns2.debian.org. | debian.org. 28800 IN NS
ns3.debian.org. | debian.org. 28800 IN NS ns4.debian.com.

Subdomain 'archive.debian.org' has it's own NS RRSET,
geo[123].debian.org, these seem to work just fine.

From what i've seen, from time-to-time, ns4.debian.com seems not
to

respond to queries which in turn makes recursing
'archive.debian.org' (with no DNS cache) malfunction with Unbound
like so:

> [sanders@haze:~] % dig archive.debian.org | ; <<>> DiG 9.8.1-P1
<<>> archive.debian.org | ;; global options: +cmd | ;; connection
timed out; no servers could be reached (i would expect SERVFAIL, at
least)

That is strange, unbound should be able to fail over to ns1, ns2 and
ns3 to get the results there. This fail over should happen without a
fraction of a second. It should even remember this and favor the
other nameservers afterwards.

At the same time a BIND9 server does not seem to have any real
problems recursing the query, it just takes a little longer for the
answer to appear as it seems to skip over the not-responding host.

As far as I understand unbound should behave similar.

I found that after the neg. cache ttl expires, sometimes Unbound
*is* able to resolve the domain. This all seems to depend on what
NS is first in the RRSET returned for 'debian.org'.

No, unbound picks the servers randomly. The order in the NS set is
irrelevant.

(it picks them randomly to get extra randomisation out of it; as a
defense against Kaminsky things).

Friends on IRC comment that this behaviour (broken recursing with
one malfunctioning nameserver in a larger RRSET) is seen more and
more, also across different recursors...

If you give me bugreports then I can (try to) fix it (regarding the
unbound nameserver, then, ).

I skimmed through RFCs 1912, 2182, 1034 and 1035 but could not
really find the proposed way to handle situations like the above.

It should try the other nameserver.

Can you enable verbosity at a higher level (say 4) and give me the
resulting log? Can be runtime with unbound-control verbosity 4.

Could someone please comment on this?

-Sndr.

Best regards,
Wouter

Sander_Smeenk · March 17, 2012, 12:43am

Quoting W.C.A. Wijngaards (wouter@nlnetlabs.nl):

> [..] Unbound does not even answer(!) queries for domains which have
> at least one malfunctioning NS in their NS RRSET.
What sort of malfunctioning are you talking about?

This reply took some time because of other urgent matters to attend to
at work, but i owe you and this "thread" an update on this issue after
(wrongfully) accusing Unbound as quoted above.

We managed to figure out what Unboud was doing and also fixed this
issue which turned out to be a strict UDP-fragment filter 'protecting'
our DNS vlan.

This filter dropped UDP *fragments* but it let the *first* packet of the
fragemented flow go through. This caused alot of retransmissions, each
failing in the same way: First packet is received, the rest was dropped.
Retransmits. Rinse, repeat.

In fact this retransmissions took so long, dig and other stub resolvers
timed out on the query and indicated SERVFAIL.

Also, the broken nameserver for the debian.org zone was totally
unrelated to any of this as you explained in your reply.
The nameserver was fixed a few days after i sent my message.

The strange part was that, with the UDP filter in place, retrying the
same query to Unbound a few times eventually *did* turn up the correct
answer. My theory is that Unbound switches to TCP after $so_many
retransmits / failures via UDP.

I learned to use DNS-OARC's Reply Size test[1] /before/ annoying
mailinglist subscribers. Sorry.

With regards,
-Sander.
[1] https://www.dns-oarc.net/oarc/services/replysizetest