Problem to resolve domains from a certain registrar

Dear all,

Since one month our company uses unbound-1.4.8-1 on two RH6 servers as caching and resolving servers with IPv6 and DNSSec enabled. These two servers deal with all our DNS traffic, generated by all our customers (2x 5Mbps peak traffic). They work as stand alone servers, no complicated network components (Load balancer...) around.

At the beginning we used to activate the option use-caps-for-id, but since we got complaints from customers that certain domains were available everywhere in the world except at us, we preferred to deactivate.

Currently we face the following rather strange problem:
Under normal working conditions, in 70-90% of the time our two production servers cannot resolve domains registered at register.be and lying on the three authoritative name servers ns1.register.be, ns3.register.be, ns2.register.be (example: leonidas.be, estates.lu). They return me a SERVFAIL. register.be itself works all the time. By chance it sometimes works correctly for a brief period of time. Even though it was not easy due to the thousands of packets passing through in a second, I succeeded to trace the packets the server sends to the authoritative servers and it gets correct answers back.

I tried to install unbound 1.4.8 with the same configuration file (see attachment) on a desktop machine and there was no issue. All resolutions against domains at register.be were immediate and correct.

As customers continued to complain I was forced to take one server out of production and to replace it with bind which works correctly. Now I have one server with unbound that has the problem and one server with bind, that works fine in production. The formerly faulty unbound server that is now offloaded currently responds correctly at all tests (no restart done, no reboot done, just IP address switched).

Does anybody have an idea how I can solve this problem? Shall I offer you more technical information? Do you have further tests to suggest?

kind regards and thank you for advices

Leo Bush

(attachments)

unbound.conf (1.53 KB)

Zitat von Leo Bush <leo.bush@mylife.lu>:

Dear all,

Since one month our company uses unbound-1.4.8-1 on two RH6 servers as caching and resolving servers with IPv6 and DNSSec enabled. These two servers deal with all our DNS traffic, generated by all our customers (2x 5Mbps peak traffic). They work as stand alone servers, no complicated network components (Load balancer...) around.

At the beginning we used to activate the option use-caps-for-id, but since we got complaints from customers that certain domains were available everywhere in the world except at us, we preferred to deactivate.

Currently we face the following rather strange problem:
Under normal working conditions, in 70-90% of the time our two production servers cannot resolve domains registered at register.be and lying on the three authoritative name servers ns1.register.be, ns3.register.be, ns2.register.be (example: leonidas.be, estates.lu). They return me a SERVFAIL. register.be itself works all the time. By chance it sometimes works correctly for a brief period of time. Even though it was not easy due to the thousands of packets passing through in a second, I succeeded to trace the packets the server sends to the authoritative servers and it gets correct answers back.

I tried to install unbound 1.4.8 with the same configuration file (see attachment) on a desktop machine and there was no issue. All resolutions against domains at register.be were immediate and correct.

As customers continued to complain I was forced to take one server out of production and to replace it with bind which works correctly. Now I have one server with unbound that has the problem and one server with bind, that works fine in production. The formerly faulty unbound server that is now offloaded currently responds correctly at all tests (no restart done, no reboot done, just IP address switched).

Does anybody have an idea how I can solve this problem? Shall I offer you more technical information? Do you have further tests to suggest?

Looks for me like EDNS problem. At least some part of the .be zone is DNSSEC signed an the replies get bigger than 512 Byte like with "dig x.dns.be A +dnssec". Bind has a feature to reduce the EDNS size in case of trouble, not sure if Unbound does the same. What you should check:
- Do the trouble domain/names resolve with unbound if you use checking disabled (+cdflag)
- Do you have any firewall device in front of your resolvers maybe some Cisco inspecting DNS traffic
- Do you have disabled Unbound tcp

For some hints on the problem have a look here:
https://www.dns-oarc.net/oarc/services/replysizetest

Regards

Andreas

add "edns-buffer-size: 512" to unbound.conf

If that fixes your problems, check which device filters udp pakets > 512 Byte.
Cisco firewalls are wellknown to do such things ...

> Looks for me like EDNS problem. At least some part of the .be zone is DNSSEC signed an the replies get bigger than 512 Byte like with "dig x.dns.be A +dnssec". Bind has a feature to reduce the EDNS size in case of trouble, not sure if Unbound does the same. What you should check:
> - Do the trouble domain/names resolve with unbound if you use checking disabled (+cdflag)
> - Do you have any firewall device in front of your resolvers maybe some Cisco inspecting DNS traffic
> - Do you have disabled Unbound tcp
>
> For some hints on the problem have a look here:
> https://www.dns-oarc.net/oarc/services/replysizetest
>
> Regards
>
> Andreas

Hi,

Thank you for helping my case. Here are my answers.
- I have no firewall or other device inspecting the traffic in front of the box, only packet filtering with iptables.
- In the config file I have:
         # Enable TCP, "yes" or "no".
         # do-tcp: yes
         # edns-buffer-size: 4096
   So I assume that by default tcp is enabled.

Following your suggestions I tried

(initial settings)
# dig leos.leonidas.be @resolv1 +cdflag

; <<>> DiG 9.3.4-P1 <<>> leos.leonidas.be @resolv1 +cdflag
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27603
;; flags: qr rd ra cd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;leoS.leonidas.be. IN A

;; Query time: 14 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 14:35:38 2011
;; MSG SIZE rcvd: 34

(initial settings)
# dig leos.leonidas.be @resolv1 +cdflag +tcp

; <<>> DiG 9.3.4-P1 <<>> leos.leonidas.be @resolv1 +cdflag +tcp
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27736
;; flags: qr rd ra cd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;leos.leonidas.be. IN A

;; Query time: 9 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 14:35:53 2011
;; MSG SIZE rcvd: 34

(initial settings)
# dig @resolv1 rs.dns-oarc.net txt

; <<>> DiG 9.3.4-P1 <<>> @resolv1 rs.dns-oarc.net txt
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35701
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;rs.dns-oarc.net. IN TXT

;; ANSWER SECTION:
rs.dns-oarc.net. 60 IN CNAME rst.x3827.rs.dns-oarc.net.
rst.x3827.rs.dns-oarc.net. 59 IN CNAME rst.x3837.x3827.rs.dns-oarc.net.
rst.x3837.x3827.rs.dns-oarc.net. 58 IN CNAME rst.x3843.x3837.x3827.rs.dns-oarc.net.
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "194.154.192.101 DNS reply size limit is at least 3843"
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "194.154.192.101 sent EDNS buffer size 4096"
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "Tested at 2011-08-24 12:38:52 UTC"

;; AUTHORITY SECTION:
x3843.x3837.x3827.rs.dns-oarc.net. 57 IN NS ns00.x3843.x3837.x3827.rs.dns-oarc.net.

;; ADDITIONAL SECTION:
ns00.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN A 149.20.58.136

;; Query time: 5972 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 14:38:52 2011
;; MSG SIZE rcvd: 307

Then I changed the following two settings:
     do-tcp: yes
     edns-buffer-size: 512

I restarted the unbound daemon. I find immediately the following messages in the log:
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: mem error generating DNSKEY request
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: Could not generate request: out of memory
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: mem error generating DNSKEY request
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: Could not generate request: out of memory

I repeated my tests from before:

# dig @resolv1 leos.leonidas.be

; <<>> DiG 9.3.4-P1 <<>> @resolv1 leos.leonidas.be
; (2 servers found)
;; global options: printcmd
;; connection timed out; no servers could be reached

1 minute later

# dig @resolv1 leos.leonidas.be +nodnssec

; <<>> DiG 9.3.4-P1 <<>> @resolv1 leos.leonidas.be +nodnssec
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 65189
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;leos.leonidas.be. IN A

;; ANSWER SECTION:
leos.leonidas.be. 3600 IN A 81.246.74.153

;; Query time: 56 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 15:46:49 2011
;; MSG SIZE rcvd: 50

# dig @resolv1 leos.leonidas.be

; <<>> DiG 9.3.4-P1 <<>> @resolv1 leos.leonidas.be
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8193
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;leos.leonidas.be. IN A

;; ANSWER SECTION:
leos.leonidas.be. 2834 IN A 81.246.74.153

;; Query time: 5 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 15:59:35 2011
;; MSG SIZE rcvd: 50

# dig @resolv1 leos.leonidas.be +dnssec

; <<>> DiG 9.3.4-P1 <<>> @resolv1 leos.leonidas.be +dnssec
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26318
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 512
;; QUESTION SECTION:
;leos.leonidas.be. IN A

;; ANSWER SECTION:
leos.leonidas.be. 2825 IN A 81.246.74.153

;; Query time: 16 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 15:59:44 2011
;; MSG SIZE rcvd: 61

# dig @resolv1 rs.dns-oarc.net txt

; <<>> DiG 9.3.4-P1 <<>> @resolv1 rs.dns-oarc.net txt
; (2 servers found)
;; global options: printcmd
;; connection timed out; no servers could be reached

As in the meantime my cacti monitoring signals me lots of Dropped packets, and as the reaction of the server seems slower to me (subjective feeling), I put back the initial settings.

# dig @resolv1 leos.leonidas.be

; <<>> DiG 9.3.4-P1 <<>> @resolv1 leos.leonidas.be
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 51586
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;leos.leonidas.be. IN A

;; Query time: 10 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 16:06:58 2011
;; MSG SIZE rcvd: 34

# dig @resolv1 rs.dns-oarc.net txt

; <<>> DiG 9.3.4-P1 <<>> @resolv1 rs.dns-oarc.net txt
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9723
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;rs.dns-oarc.net. IN TXT

;; ANSWER SECTION:
rs.dns-oarc.net. 60 IN CNAME rst.x3827.rs.dns-oarc.net.
rst.x3827.rs.dns-oarc.net. 59 IN CNAME rst.x3837.x3827.rs.dns-oarc.net.
rst.x3837.x3827.rs.dns-oarc.net. 58 IN CNAME rst.x3843.x3837.x3827.rs.dns-oarc.net.
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "xxxxx DNS reply size limit is at least 3843"
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "xxxxx sent EDNS buffer size 4096"
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "Tested at 2011-08-24 14:07:15 UTC"

;; AUTHORITY SECTION:
x3843.x3837.x3827.rs.dns-oarc.net. 57 IN NS ns00.x3843.x3837.x3827.rs.dns-oarc.net.

;; ADDITIONAL SECTION:
ns00.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN A 149.20.58.136

;; Query time: 1073 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 16:07:15 2011
;; MSG SIZE rcvd: 307

kind regards

Leo Bush

Zitat von Leo Bush <leo.bush@mylife.lu>:

Looks for me like EDNS problem. At least some part of the .be zone is DNSSEC signed an the replies get bigger than 512 Byte like with "dig x.dns.be A +dnssec". Bind has a feature to reduce the EDNS size in case of trouble, not sure if Unbound does the same. What you should check:
- Do the trouble domain/names resolve with unbound if you use checking disabled (+cdflag)
- Do you have any firewall device in front of your resolvers maybe some Cisco inspecting DNS traffic
- Do you have disabled Unbound tcp

For some hints on the problem have a look here:
https://www.dns-oarc.net/oarc/services/replysizetest

Regards

Andreas

Hi,

Thank you for helping my case. Here are my answers.
- I have no firewall or other device inspecting the traffic in front of the box, only packet filtering with iptables.
- In the config file I have:
        # Enable TCP, "yes" or "no".
        # do-tcp: yes
        # edns-buffer-size: 4096
  So I assume that by default tcp is enabled.

Following your suggestions I tried

(initial settings)
# dig leos.leonidas.be @resolv1 +cdflag

; <<>> DiG 9.3.4-P1 <<>> leos.leonidas.be @resolv1 +cdflag
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27603
;; flags: qr rd ra cd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;leoS.leonidas.be. IN A

;; Query time: 14 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 14:35:38 2011
;; MSG SIZE rcvd: 34

(initial settings)
# dig leos.leonidas.be @resolv1 +cdflag +tcp

; <<>> DiG 9.3.4-P1 <<>> leos.leonidas.be @resolv1 +cdflag +tcp
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27736
;; flags: qr rd ra cd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;leos.leonidas.be. IN A

;; Query time: 9 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 14:35:53 2011
;; MSG SIZE rcvd: 34

(initial settings)
# dig @resolv1 rs.dns-oarc.net txt

; <<>> DiG 9.3.4-P1 <<>> @resolv1 rs.dns-oarc.net txt
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35701
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;rs.dns-oarc.net. IN TXT

;; ANSWER SECTION:
rs.dns-oarc.net. 60 IN CNAME rst.x3827.rs.dns-oarc.net.
rst.x3827.rs.dns-oarc.net. 59 IN CNAME rst.x3837.x3827.rs.dns-oarc.net.
rst.x3837.x3827.rs.dns-oarc.net. 58 IN CNAME rst.x3843.x3837.x3827.rs.dns-oarc.net.
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "194.154.192.101 DNS reply size limit is at least 3843"
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "194.154.192.101 sent EDNS buffer size 4096"
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "Tested at 2011-08-24 12:38:52 UTC"

;; AUTHORITY SECTION:
x3843.x3837.x3827.rs.dns-oarc.net. 57 IN NS ns00.x3843.x3837.x3827.rs.dns-oarc.net.

;; ADDITIONAL SECTION:
ns00.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN A 149.20.58.136

;; Query time: 5972 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 14:38:52 2011
;; MSG SIZE rcvd: 307

Then I changed the following two settings:
    do-tcp: yes
    edns-buffer-size: 512

I restarted the unbound daemon. I find immediately the following messages in the log:
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: mem error generating DNSKEY request
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: Could not generate request: out of memory
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: mem error generating DNSKEY request
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: Could not generate request: out of memory

This doesn't look good anyway. Are you low on memeory? What are the other unbound settings look like?

I repeated my tests from before:

# dig @resolv1 leos.leonidas.be

; <<>> DiG 9.3.4-P1 <<>> @resolv1 leos.leonidas.be
; (2 servers found)
;; global options: printcmd
;; connection timed out; no servers could be reached

1 minute later

# dig @resolv1 leos.leonidas.be +nodnssec

; <<>> DiG 9.3.4-P1 <<>> @resolv1 leos.leonidas.be +nodnssec
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 65189
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;leos.leonidas.be. IN A

;; ANSWER SECTION:
leos.leonidas.be. 3600 IN A 81.246.74.153

;; Query time: 56 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 15:46:49 2011
;; MSG SIZE rcvd: 50

# dig @resolv1 leos.leonidas.be

; <<>> DiG 9.3.4-P1 <<>> @resolv1 leos.leonidas.be
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8193
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;leos.leonidas.be. IN A

;; ANSWER SECTION:
leos.leonidas.be. 2834 IN A 81.246.74.153

;; Query time: 5 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 15:59:35 2011
;; MSG SIZE rcvd: 50

# dig @resolv1 leos.leonidas.be +dnssec

; <<>> DiG 9.3.4-P1 <<>> @resolv1 leos.leonidas.be +dnssec
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26318
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 512
;; QUESTION SECTION:
;leos.leonidas.be. IN A

;; ANSWER SECTION:
leos.leonidas.be. 2825 IN A 81.246.74.153

;; Query time: 16 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 15:59:44 2011
;; MSG SIZE rcvd: 61

# dig @resolv1 rs.dns-oarc.net txt

; <<>> DiG 9.3.4-P1 <<>> @resolv1 rs.dns-oarc.net txt
; (2 servers found)
;; global options: printcmd
;; connection timed out; no servers could be reached

As in the meantime my cacti monitoring signals me lots of Dropped packets, and as the reaction of the server seems slower to me (subjective feeling), I put back the initial settings.

# dig @resolv1 leos.leonidas.be

; <<>> DiG 9.3.4-P1 <<>> @resolv1 leos.leonidas.be
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 51586
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;leos.leonidas.be. IN A

;; Query time: 10 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 16:06:58 2011
;; MSG SIZE rcvd: 34

# dig @resolv1 rs.dns-oarc.net txt

; <<>> DiG 9.3.4-P1 <<>> @resolv1 rs.dns-oarc.net txt
; (2 servers found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9723
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;rs.dns-oarc.net. IN TXT

;; ANSWER SECTION:
rs.dns-oarc.net. 60 IN CNAME rst.x3827.rs.dns-oarc.net.
rst.x3827.rs.dns-oarc.net. 59 IN CNAME rst.x3837.x3827.rs.dns-oarc.net.
rst.x3837.x3827.rs.dns-oarc.net. 58 IN CNAME rst.x3843.x3837.x3827.rs.dns-oarc.net.
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "xxxxx DNS reply size limit is at least 3843"
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "xxxxx sent EDNS buffer size 4096"
rst.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN TXT "Tested at 2011-08-24 14:07:15 UTC"

;; AUTHORITY SECTION:
x3843.x3837.x3827.rs.dns-oarc.net. 57 IN NS ns00.x3843.x3837.x3827.rs.dns-oarc.net.

;; ADDITIONAL SECTION:
ns00.x3843.x3837.x3827.rs.dns-oarc.net. 57 IN A 149.20.58.136

;; Query time: 1073 msec
;; SERVER: xxxxx#53(xxxxx)
;; WHEN: Wed Aug 24 16:07:15 2011
;; MSG SIZE rcvd: 307

There lately was an issue with priming the root with DNSSEC last very long in some cases...
What are the settings for your trusted keys and do you use IPv6?

Regards

Andreas

Hello

Then I changed the following two settings:
    do-tcp: yes
    edns-buffer-size: 512

I restarted the unbound daemon. I find immediately the following messages in the log:
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: mem error generating DNSKEY request
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: Could not generate request: out of memory
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: mem error generating DNSKEY request
Aug 24 15:28:57 resolv5 unbound: [10817:1] error: Could not generate request: out of memory

This doesn't look good anyway. Are you low on memeory? What are the other unbound settings look like?

Up to now, I never had these errors, only yesterday after the option edns-buffer-size: 512.
This is a photo of the current state.
Mem: 2055060k total, 1984176k used, 70884k free, 89028k buffers
Swap: 2064376k total, 7932k used, 2056444k free, 114800k cached

My current config file looks as follows (all other lines are commented):
server:
         verbosity: 1
         statistics-interval: 0
         statistics-cumulative: no
         extended-statistics: yes
         num-threads: 2
         interface: 0.0.0.0
         interface: 2001:7e8:f00:1::1
         interface-automatic: no
         outgoing-range: 768
         so-rcvbuf: 2m
         so-sndbuf: 2m
         msg-cache-size: 350m
         msg-cache-slabs: 2
         rrset-cache-size: 700m
         rrset-cache-slabs: 2
         infra-cache-slabs: 2
         access-control: 127.0.0.0/8 allow
         access-control: ::1 allow
         access-control: 83.199.0.0/17 allow
         ...
         chroot: ""
         username: "unbound"
         directory: "/etc/unbound"
         log-time-ascii: yes
         pidfile: "/var/run/unbound/unbound.pid"
         hide-identity: yes
         hide-version: yes
         harden-glue: yes
         harden-dnssec-stripped: yes
         harden-referral-path: yes
         use-caps-for-id: no
         unwanted-reply-threshold: 10000000
         prefetch: yes
         prefetch-key: yes
         auto-trust-anchor-file: "/etc/unbound/root.key"
         val-clean-additional: yes
         val-permissive-mode: no
         val-log-level: 2
         key-cache-size: 16m
         key-cache-slabs: 2
         neg-cache-size: 2m
remote-control:
         control-enable: yes
         server-key-file: "/etc/unbound/unbound_server.key"
         server-cert-file: "/etc/unbound/unbound_server.pem"
         control-key-file: "/etc/unbound/unbound_control.key"
         control-cert-file: "/etc/unbound/unbound_control.pem"

There lately was an issue with priming the root with DNSSEC last very long in some cases...
What are the settings for your trusted keys and do you use IPv6?

We use IPv6.

[resolv1 ~]$ ls -l "/etc/unbound/root.key"
-rw-r--r--. 1 unbound unbound 759 Aug 25 15:40 /etc/unbound/root.key
[resolv1 ~]$ cat "/etc/unbound/root.key"
; autotrust trust anchor file
;;id: . 1
;;last_queried: 1314279650 ;;Thu Aug 25 15:40:50 2011
;;last_success: 1314279650 ;;Thu Aug 25 15:40:50 2011
;;next_probe_time: 1314322249 ;;Fri Aug 26 03:30:49 2011
;;query_failed: 0
;;query_interval: 43200
;;retry_time: 8640
. 172800 IN DNSKEY 257 3 8 AwEAAagAIKlVZrpC6Ia7gEzahOR+9W29euxhJhVVLOyQbSEW0O8gcCjFFVQUTf6v58fLjwBd0YI0EzrAcQqBGCzh/RStIoO8g0NfnfL2MTJRkxoXbfDaUeVPQuYEhg37NZWAJQ9VnMVDxP/VHL496M/QZxkjf5/Efucp2gaDX6RS6CXpoY68LsvPVjR0ZSwzz1apAzvN9dlzEheX7ICJBBtuA6G3LQpzW5hOA2hzCTMjJPJ8LbqF6dsV6DoBQzgul0sGIcGOYl7OyQdXfZ57relSQageu+ipAdTTJ25AsRTAoub8ONGcLmqrAmRLKBP1dfwhYB4N7knNnulqQxA+Uk1ihz0= ;{id = 19036 (ksk), size = 2048b} ;;state=2 [ VALID ] ;;count=0 ;;lastchange=1308043580 ;;Tue Jun 14 11:26:20 2011

This afternoon I tried to block traffic from my resolver to one or two of the three resolvers from register.be. Especially when blocking outgoing traffic towards ns3.register.be, unbound's behavior improved (approximately ~60% of success), but the problem did not disappear. I still got SERVFAILs.

kind regards

Leo Bush

Dear all,

My problem is not gone yet. I analysed it as far as I could according to your key word EDNS and packet loss or fragmentation which looked promising.
I disabled the iptables rule set, restarted the unbound and let it run for a couple of days: same result. Sometimes the domains hosted on ns1.register.be and ns2.register.be were resolved. Most of the time unbound returned a SERVFAIL.
Then I disabled DNSSEC and let it run too. Again same result.

I nevertheless noticed better results for domains like leonidas.be which are hosted on ns1.register.be, ns2.register.be, ns3.register.be than for domains like estates.lu which is only delegated towards ns1.register.be, ns2.register.be.

Then I read something in the unbound documentation about «Unbound Timeout Information» and flush_infra and dump_infra commands. I think this time I am on a promising way, but I do not understand enough how it works and interacts. I did the following checks:

[resolv ~]# unbound-control lookup leonidas.be
The following name servers are used for lookup of leonidas.be.
;rrset 85727 2 0 2 0
leonidas.be. 85727 IN NS ns1.register.be.
leonidas.be. 85727 IN NS ns2.register.be.
;rrset 37 1 0 8 3
ns2.register.be. 37 IN A 194.78.23.152
;rrset 37 1 0 8 3
ns1.register.be. 37 IN A 80.169.63.207
Delegation with 2 names, of which 0 can be examined to query further addresses.
It provides 2 IP addresses.
80.169.63.207 rto 120000 msec, ttl 412, ping 0 var 94 rtt 376, EDNS 0 assumed.
194.78.23.152 rto 120000 msec, ttl 478, ping 0 var 94 rtt 376, EDNS 0 assumed.

[root@resolv ~]# unbound-control flush_infra 80.169.63.207 && unbound-control flush_infra 194.78.23.152; while [ 1 ] ; do date; unbound-control dump_infra | grep -E "80.169.63.207|194.78.23.152|91.121.5.186"; sleep 30; echo; done ok
ok
Thu Sep 8 18:02:51 CEST 2011
91.121.5.186 ttl 717 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:03:21 CEST 2011
80.169.63.207 ttl 875 ping 2 var 60 rtt 242 rto 242 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 889 ping 5 var 16 rtt 69 rto 69 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 687 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:03:52 CEST 2011
80.169.63.207 ttl 844 ping 2 var 60 rtt 242 rto 242 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 858 ping 5 var 16 rtt 69 rto 69 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 656 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:04:23 CEST 2011
80.169.63.207 ttl 813 ping 2 var 60 rtt 242 rto 3872 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 827 ping 5 var 16 rtt 69 rto 4416 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 625 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:04:53 CEST 2011
80.169.63.207 ttl 783 ping 2 var 60 rtt 242 rto 15488 ednsknown 1 edns 0 delay 9
194.78.23.152 ttl 797 ping 5 var 16 rtt 69 rto 17664 ednsknown 1 edns 0 delay 13
91.121.5.186 ttl 595 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:05:24 CEST 2011
80.169.63.207 ttl 752 ping 2 var 60 rtt 242 rto 30976 ednsknown 1 edns 0 delay 13
194.78.23.152 ttl 766 ping 5 var 16 rtt 69 rto 35328 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 564 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:05:55 CEST 2011
80.169.63.207 ttl 721 ping 2 var 60 rtt 242 rto 61952 ednsknown 1 edns 0 delay 54
194.78.23.152 ttl 735 ping 5 var 16 rtt 69 rto 35328 ednsknown 1 edns 0 delay 8
91.121.5.186 ttl 533 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:06:25 CEST 2011
80.169.63.207 ttl 691 ping 2 var 60 rtt 242 rto 61952 ednsknown 1 edns 0 delay 24
194.78.23.152 ttl 705 ping 5 var 16 rtt 69 rto 70656 ednsknown 1 edns 0 delay 53
91.121.5.186 ttl 503 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:06:56 CEST 2011
80.169.63.207 ttl 660 ping 2 var 60 rtt 242 rto 120000 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 674 ping 5 var 16 rtt 69 rto 70656 ednsknown 1 edns 0 delay 22
91.121.5.186 ttl 472 ping 1 var 9 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:07:26 CEST 2011
80.169.63.207 ttl 630 ping 2 var 60 rtt 242 rto 120000 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 644 ping 5 var 16 rtt 69 rto 120000 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 442 ping 4 var 6 rtt 50 rto 50 ednsknown 1 edns 0 delay 0
^C

I noticed that the rto value increases very quickly towards 120000 for 2 name servers ns1.register.be, ns2.register.b and it stays there. Resolutions work for a certain time. Afterwards unbound returns SERVFAIL.

I did the same test with another twin unbound-nameserver (off traffic), and could not notice the same thing.

unbound-control flush_infra 80.169.63.207 && unbound-control flush_infra 194.78.23.152; while [ 1 ] ; do date; unbound-control dump_infra | grep -E "80.169.63.207|194.78.23.152|91.121.5.186"; sleep 30; echo; done
ok
Thu Sep 8 17:55:56 CEST 2011

Thu Sep 8 17:56:26 CEST 2011
80.169.63.207 ttl 877 ping 9 var 16 rtt 73 rto 73 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 877 ping 0 var 20 rtt 80 rto 80 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 877 ping 0 var 13 rtt 52 rto 52 ednsknown 1 edns 0 delay 0

Thu Sep 8 17:56:57 CEST 2011
80.169.63.207 ttl 846 ping 9 var 16 rtt 73 rto 73 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 846 ping 0 var 20 rtt 80 rto 80 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 846 ping 0 var 13 rtt 52 rto 52 ednsknown 1 edns 0 delay 0

Thu Sep 8 17:57:27 CEST 2011
80.169.63.207 ttl 816 ping 9 var 16 rtt 73 rto 73 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 816 ping 0 var 20 rtt 80 rto 80 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 816 ping 0 var 13 rtt 52 rto 52 ednsknown 1 edns 0 delay 0

Thu Sep 8 17:57:57 CEST 2011
80.169.63.207 ttl 786 ping 9 var 16 rtt 73 rto 73 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 786 ping 0 var 20 rtt 80 rto 80 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 786 ping 0 var 13 rtt 52 rto 52 ednsknown 1 edns 0 delay 0

Thu Sep 8 17:58:27 CEST 2011
80.169.63.207 ttl 756 ping 9 var 16 rtt 73 rto 73 ednsknown 1 edns 0 delay 0
91.121.5.186 ttl 756 ping 0 var 20 rtt 80 rto 80 ednsknown 1 edns 0 delay 0
194.78.23.152 ttl 756 ping 0 var 13 rtt 52 rto 52 ednsknown 1 edns 0 delay 0

Now I did the following test on the "buggy" unbound server:
[root@resolv ~]# unbound-control flush_infra 80.92.67.140 && unbound-control flush_infra 80.92.65.2; while [ 1 ] ; do date; unbound-control dump_infra | grep -E "80.92.67.140|80.92.65.2"; sleep 30; echo; done ok
ok
Thu Sep 8 17:58:38 CEST 2011

Thu Sep 8 17:59:08 CEST 2011
80.92.67.140 ttl 870 ping 21 var 9 rtt 57 rto 57 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 872 ping 9 var 8 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 17:59:39 CEST 2011
80.92.67.140 ttl 839 ping 20 var 8 rtt 52 rto 52 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 841 ping 11 var 10 rtt 51 rto 51 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:00:09 CEST 2011
80.92.67.140 ttl 808 ping 23 var 11 rtt 67 rto 67 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 810 ping 26 var 29 rtt 142 rto 142 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:00:41 CEST 2011
80.92.67.140 ttl 777 ping 21 var 9 rtt 57 rto 57 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 779 ping 8 var 8 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:01:11 CEST 2011
80.92.67.140 ttl 747 ping 18 var 6 rtt 50 rto 50 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 749 ping 9 var 7 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:01:42 CEST 2011
80.92.67.140 ttl 716 ping 18 var 5 rtt 50 rto 50 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 718 ping 8 var 6 rtt 50 rto 50 ednsknown 1 edns 0 delay 0

Thu Sep 8 18:02:12 CEST 2011
80.92.67.140 ttl 685 ping 23 var 13 rtt 75 rto 75 ednsknown 1 edns 0 delay 0
80.92.65.2 ttl 687 ping 11 var 7 rtt 50 rto 50 ednsknown 1 edns 0 delay 0
^C

Does anybody have an explanation or a suggestion for this?

regards

Leo Bush

> Zitat von Leo Bush <leo.bush@mylife.lu>:
>
>> Dear all,
>>
>> Since one month our company uses unbound-1.4.8-1 on two RH6 servers as caching and resolving servers with IPv6 and DNSSec enabled. These two servers deal with all our DNS traffic, generated by all our customers (2x 5Mbps peak traffic). They work as stand alone servers, no complicated network components (Load balancer...) around.
>>
>> At the beginning we used to activate the option use-caps-for-id, but since we got complaints from customers that certain domains were available everywhere in the world except at us, we preferred to deactivate.
>>
>> Currently we face the following rather strange problem:
>> Under normal working conditions, in 70-90% of the time our two production servers cannot resolve domains registered at register.be and lying on the three authoritative name servers ns1.register.be, ns3.register.be, ns2.register.be (example: leonidas.be, estates.lu). They return me a SERVFAIL. register.be itself works all the time. By chance it sometimes works correctly for a brief period of time. Even though it was not easy due to the thousands of packets passing through in a second, I succeeded to trace the packets the server sends to the authoritative servers and it gets correct answers back.
>>
>> I tried to install unbound 1.4.8 with the same configuration file (see attachment) on a desktop machine and there was no issue. All resolutions against domains at register.be were immediate and correct.
>>
>> As customers continued to complain I was forced to take one server out of production and to replace it with bind which works correctly. Now I have one server with unbound that has the problem and one server with bind, that works fine in production. The formerly faulty unbound server that is now offloaded currently responds correctly at all tests (no restart done, no reboot done, just IP address switched).
>>
>> Does anybody have an idea how I can solve this problem? Shall I offer you more technical information? Do you have further tests to suggest?
>>
>
> Looks for me like EDNS problem. At least some part of the .be zone is DNSSEC signed an the replies get bigger than 512 Byte like with "dig x.dns.be A +dnssec". Bind has a feature to reduce the EDNS size in case of trouble, not sure if Unbound does the same. What you should check:
> - Do the trouble domain/names resolve with unbound if you use checking disabled (+cdflag)
> - Do you have any firewall device in front of your resolvers maybe some Cisco inspecting DNS traffic
> - Do you have disabled Unbound tcp
>
> For some hints on the problem have a look here:
> https://www.dns-oarc.net/oarc/services/replysizetest
>
> Regards
>
> Andreas
>
> _______________________________________________
> Unbound-users mailing list
> Unbound-users@unbound.net
> http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users
>
a

Hi Leo,

I do not have a solution for you, but wanted to help read the output.
The rto value to 120000 means timeouts. This means that the host is
timing out, and it does not reply to you.

rto: roundtrip-time-out value. The roundtrip value modified by
exponential backoff due to timeouts. The 'ping' time would be the
pingtime when it does respond to you (msec).

The leonidas.be servers seem to have blacklisted you? Or some firewall
or other script is throttling traffic to zero for you? So, it works for
a bit, then it blacklists you, and it stops working and timeouts.

Best regards, Wouter

Dear all,

Thank you Wouter for you last answer. This answer pushed me to get into contact with the particular operator, but we did not find a new hint until I found the following explanation for the problem today:
- Since weeks, our unbound resolving server gets every minute a request for A www.coolbox.be from a device in our network.
- unbound tries to get the answer from ns1.register.be or ns2.register.be -> in both cases: no answer -> timeout -> rto climbs quickly to 120000
- in parallel our unbound server gets various requests for domains hosted at ns1.register.be or ns2.register.be. Normally they all get answered quickly. But we notice, that once the rto is arrived at 120000 (because of www.coolbox.be), unbound does not try to contact the remote authoritative servers any more and only returns SERVFAILs even though an answer would be available. The whole registrar's nameserver farm is blacklisted because one zone is not working any more.
- This explains, why I noticed the error only for ns1.register.be and ns2.register.be and not on ns3.register.be, because coolbox.be is not delegated on the third server.
- This explains why I noticed that in rare moments, the resolution works correctly and why it does not work most of the time. (it works in the short periods when the nameservers are not yet blacklisted again)
- I would say unbound does a bad negative caching for two nameservers that only respond when they have something to respond. If they are asked things they do not know any more, they do not answer (no REJECT). So this penalizes the whole communication between the (unbound) resolvers and the authoritative server.
- I found the following text in RFC2308

7 - Other Negative Responses
... not covered by any existing RFC.

7.1 Server Failure (OPTIONAL)
...
a resolver MAY cache a server failure response. If it
    does so it MUST NOT cache it for longer than five (5) minutes, and it
    MUST be cached against the specific query tuple<query name, type,
    class, server IP address>

7.2 Dead / Unreachable Server (OPTIONAL)

    Dead / Unreachable servers are servers that fail to respond in any
    way to a query or where the transport layer has provided an
    indication that the server does not exist or is unreachable. A
    server may be deemed to be dead or unreachable if it has not
    responded to an outstanding query within 120 seconds.

    Examples of transport layer indications are:

       ICMP error messages indicating host, net or port unreachable.
       TCP resets
       IP stack error messages providing similar indications to those above.

    A server MAY cache a dead server indication. If it does so it MUST
    NOT be deemed dead for longer than five (5) minutes. The indication
    MUST be stored against query tuple<query name, type, class, server
    IP address> unless there was a transport layer indication that the
    server does not exist, in which case it applies to all queries to
    that specific IP address.

- Can you tell me if my interpretation is correct: requests which do not get answered, make unbound blacklist the whole server so that it does not even request correct domains which would get answered). Does unbound do a caching over the complete tuple <query name, type, class, server IP

?

kind regards

Leo Bush

FWIW, this sounds like normal behaviour of an authoritative server
running tinydns.