Cascading Unbound and automatic key update

Hello

we have a internal unbound cache using a second unbound instance at the border firewall to do dns resolution with DNSSEC enabled. Today our internal unbound stop working with errors like this:

Jan 10 14:33:53 mailer unbound: [27958:0] info: validation failure <www.at-web.de. A IN>: no DNSSEC records from x.x.x.x for DS at-web.de. while building chain of trust
Jan 10 14:33:53 mailer unbound: [27958:0] info: validation failure <www.heise.de. A IN>: no DNSSEC records from x.x.x.x for DS heise.de. while building chain of trust

The instance at the border firewall has no errors in the log and works fine all the time. After restarting the internal instance, it is also working fine again. The auto-trust-anchor-file of the internal instance has a timestamp from the restart of the instance, so i suspect something went wrong with the update of this file, but i have no glue why the restart cured it.

Both instances are Unbound version 1.4.14 with auto-trust-anchor enabled. The forwarding from internal to firewall instance is done this way:

forward-zone:
    name: "."
    forward-addr: x.x.x.x

What can we do to debug this problem and prevent it from happening again?

Thanks

Andreas

Hi Andreas,

Hello

we have a internal unbound cache using a second unbound instance at
the border firewall to do dns resolution with DNSSEC enabled. Today
our internal unbound stop working with errors like this:

Jan 10 14:33:53 mailer unbound: [27958:0] info: validation failure
<www.at-web.de. A IN>: no DNSSEC records from x.x.x.x for DS
at-web.de. while building chain of trust Jan 10 14:33:53 mailer
unbound: [27958:0] info: validation failure <www.heise.de. A IN>:
no DNSSEC records from x.x.x.x for DS heise.de. while building
chain of trust

So, what it looked like for this server was that dig @x.x.x.x DS
heise.de +dnssec +norec +cdflag did not return any DNSSEC data.

As if there were fragmentation problems. And since it was internal
there are extra firewalls or routers for that sort of thing to occur.

The instance at the border firewall has no errors in the log and
works fine all the time. After restarting the internal instance, it
is also working fine again. The auto-trust-anchor-file of the
internal instance has a timestamp from the restart of the instance,
so i suspect something went wrong with the update of this file, but
i have no glue why the restart cured it.

No, the timestamp was probably written right when you restart it.
Because it is written when the root DNSKEY is seen. When you restart
it the cache is empty and it fetches the root DNSKEY. And thus
updates the file to note that it saw the root key.

Both instances are Unbound version 1.4.14 with auto-trust-anchor
enabled. The forwarding from internal to firewall instance is done
this way:

forward-zone: name: "." forward-addr: x.x.x.x

This looks fine.

What can we do to debug this problem and prevent it from happening
again?

There is something happening with UDP. There seems nothing wrong with
key files. The error is that somehow it gets no DNSSEC data (edns
backoff, or messages arrive 'stripped' of DNSSEC data).

Best regards,
   Wouter

Zitat von "W.C.A. Wijngaards" <wouter@nlnetlabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andreas,

Hello

we have a internal unbound cache using a second unbound instance at
the border firewall to do dns resolution with DNSSEC enabled. Today
our internal unbound stop working with errors like this:

Jan 10 14:33:53 mailer unbound: [27958:0] info: validation failure
<www.at-web.de. A IN>: no DNSSEC records from x.x.x.x for DS
at-web.de. while building chain of trust Jan 10 14:33:53 mailer
unbound: [27958:0] info: validation failure <www.heise.de. A IN>:
no DNSSEC records from x.x.x.x for DS heise.de. while building
chain of trust

So, what it looked like for this server was that dig @x.x.x.x DS
heise.de +dnssec +norec +cdflag did not return any DNSSEC data.

The man-pages of my "dig" version does not know "+norec" and the above command lead to Status->Refused, without the "noreg" it got the following which looks sane to me:

; <<>> DiG 9.7.0-P1 <<>> @x.x.x.x heise.de DS +dnssec +cdflag
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13864
;; flags: qr rd ra cd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
;; QUESTION SECTION:
;heise.de. IN DS

;; AUTHORITY SECTION:
H319DM5GC3EDEK691VQBHEHOT7VGGJ2B.de. 7136 IN NSEC3 1 1 15 BA5EBA11 H31BIK9NA0MJD5K06JE5H9BBFDBD56DB NS SOA NAPTR RRSIG DNSKEY NSEC3PARAM
H319DM5GC3EDEK691VQBHEHOT7VGGJ2B.de. 7136 IN RRSIG NSEC3 8 2 7200 20120117131500 20120110131500 30565 de. fd08T4Fapf6tVVOA2VmYXceBTUS5Ckjz8iqdBttzt4DgAq2e8bI4l/aE wHgXBl2P+CEq6m5H7d4X6WHXvoi+mWYof4LYb1cSW2l212kJ/jT4M6q4 QMYrcocKZaFzKg/X4fZwD1ma0RQ7q8Mx09heV25TlZwxSBjbpRUQv4Ez /0U=
de. 7136 IN SOA f.nic.de. its.denic.de. 2012011062 7200 7200 3600000 7200
de. 7136 IN RRSIG SOA 8 1 86400 20120117131500 20120110131500 30565 de. R5N20le84Cacq8mtIwKifWifIOgJN2tWULiJU/DGDxsBQPiqYkM9zec7 dfgfs8XQbUx3Kkymsuo7sdanAQVld7ieew+aVP9yhgZdc18cmuk4hYBB 1X1Sb8X249kv6xxR/D87pl57g86HW3OzG2pFhV+pjt5IWNUGvBCiiQkQ HUU=
UMUKTKOLDUUT050M28LQE3R399Q894KV.de. 6942 IN NSEC3 1 1 15 BA5EBA11 UMUPU1E8C10ANEOEMVVG217UL77BN1H8 A RRSIG
UMUKTKOLDUUT050M28LQE3R399Q894KV.de. 6942 IN RRSIG NSEC3 8 2 7200 20120117131500 20120110131500 30565 de. Gf4tjJyx6WwHi8tyX7UwkI2CYoyA0I3Jyjv9zqo7o/kmm9ztleOZZSFG y5DzFihl4vyvSVu6ZSmeMHjy1dniIMmvIPMOsWGK120vp/LGYjc0r+J+ KsJsqb8F6bimi6EPy4Q80/Pc2UsOpoYToOawLCqHjMHE7mn76HpPJyXK oX8=

;; Query time: 0 msec
;; SERVER: x.x.x.x#53(x.x.x.x)
;; WHEN: Tue Jan 10 16:23:29 2012
;; MSG SIZE rcvd: 742

As if there were fragmentation problems. And since it was internal
there are extra firewalls or routers for that sort of thing to occur.

There is nothing between the two machines beside a switch, both machines have iptables but configured to let UDP/TCP port 53 pass. No logs for iptables either from this time.

The instance at the border firewall has no errors in the log and
works fine all the time. After restarting the internal instance, it
is also working fine again. The auto-trust-anchor-file of the
internal instance has a timestamp from the restart of the instance,
so i suspect something went wrong with the update of this file, but
i have no glue why the restart cured it.

No, the timestamp was probably written right when you restart it.
Because it is written when the root DNSKEY is seen. When you restart
it the cache is empty and it fetches the root DNSKEY. And thus
updates the file to note that it saw the root key.

That what strikes me odd. The internal unbound instance is not able to fetch the key some minutes ago, but on restart it is able to do so without problems. As there are also .com domains affected i suspect that it wasn't the key from heise.de which failed but it was simply the first to fail.

Both instances are Unbound version 1.4.14 with auto-trust-anchor
enabled. The forwarding from internal to firewall instance is done
this way:

forward-zone: name: "." forward-addr: x.x.x.x

This looks fine.

What can we do to debug this problem and prevent it from happening
again?

There is something happening with UDP. There seems nothing wrong with
key files. The error is that somehow it gets no DNSSEC data (edns
backoff, or messages arrive 'stripped' of DNSSEC data).

A said there is basically only a wire between the two. Will the keys be cached by unbound BTW? As said the external unbound does not have any problem at all while the internal does only delivers errors. Is it possible that the problem arise from DNS data delivered from external name servers?
It is very inconvenient if the central resolver cache stop working...

Thanks

Andreas

Hi Andreas,

Now I see this is a forward zone, so +norec no answer, because the
x.x.x.x is a recursive cache. Somehow this cache has trouble
returning dnssec enabled data (once in a while? Load balancer?)

Best regards,
   Wouter

Zitat von "W.C.A. Wijngaards" <wouter@nlnetlabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andreas,

Hello

we have a internal unbound cache using a second unbound instance at
the border firewall to do dns resolution with DNSSEC enabled. Today
our internal unbound stop working with errors like this:

Jan 10 14:33:53 mailer unbound: [27958:0] info: validation failure
<www.at-web.de. A IN>: no DNSSEC records from x.x.x.x for DS
at-web.de. while building chain of trust Jan 10 14:33:53 mailer
unbound: [27958:0] info: validation failure <www.heise.de. A IN>:
no DNSSEC records from x.x.x.x for DS heise.de. while building
chain of trust

So, what it looked like for this server was that dig @x.x.x.x DS
heise.de +dnssec +norec +cdflag did not return any DNSSEC data.

As if there were fragmentation problems. And since it was internal
there are extra firewalls or routers for that sort of thing to occur.

The instance at the border firewall has no errors in the log and
works fine all the time. After restarting the internal instance, it
is also working fine again. The auto-trust-anchor-file of the
internal instance has a timestamp from the restart of the instance,
so i suspect something went wrong with the update of this file, but
i have no glue why the restart cured it.

No, the timestamp was probably written right when you restart it.
Because it is written when the root DNSKEY is seen. When you restart
it the cache is empty and it fetches the root DNSKEY. And thus
updates the file to note that it saw the root key.

Both instances are Unbound version 1.4.14 with auto-trust-anchor
enabled. The forwarding from internal to firewall instance is done
this way:

forward-zone: name: "." forward-addr: x.x.x.x

This looks fine.

What can we do to debug this problem and prevent it from happening
again?

There is something happening with UDP. There seems nothing wrong with
key files. The error is that somehow it gets no DNSSEC data (edns
backoff, or messages arrive 'stripped' of DNSSEC data).

Would it be smart to set

module-config: "iterator"

at the internal unbound cache and do i need to set "ignore-cd-flag: yes" at the firewall to let the border unbound do all validation?

This should prevent this kind of error in any case, no?

Many Thanks

Andreas

Zitat von "W.C.A. Wijngaards" <wouter@nlnetlabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andreas,

Now I see this is a forward zone, so +norec no answer, because the
x.x.x.x is a recursive cache. Somehow this cache has trouble
returning dnssec enabled data (once in a while? Load balancer?)

Hello

no, it is a simple two stage unbound cascade. The forwarder does also act as resolver cache for the DMZ mailserver and had as said no problem resolving names during the whole outage of the internal unbound cache. During the outage i was also able to query the forwarder from the machine running the internal cache without problems, but i only tested simple A/MX queries. I guess it will be best to dumb-down the internal as cache only and let the firewall do the work, no?

Many Thanks

Andreas