Setting max-time before servfail

Hi,

I am in the process of moving a number of caching boxes to unbound.

One thing I have noticed is the time it takes for a servfail to get generated should a domain not be available/visible.

Example.

With unbound I get a timeout (which some clients see as the dns server failing and not answering)

dig bagmail.com mx @dnscache1-ctn.is.co.za

; <<>> DiG 9.6.1-P2 <<>> bagmail.com mx @unbound_server
;; global options: +cmd
;; connection timed out; no servers could be reached

With our current product I get a servfail.

dig bagmail.com mx @current_cache

; <<>> DiG 9.6.1-P2 <<>> bagmail.com mx @dnscache2-ctn.is.co.za
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 35397
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;bagmail.com. IN MX

;; Query time: 5000 msec

;; WHEN: Fri Jan 15 16:00:17 2010
;; MSG SIZE rcvd: 29

The issue with this specific domain is the NS servers, ns1 and ns2.goldkey.com don’t exist

bagmail.com. 172800 IN NS ns1.goldkey.com.
bagmail.com. 172800 IN NS ns2.goldkey.com.

unbound-control lookup on that domain shows the following

unbound-control lookup bagmail.com

The following name servers are used for lookup of bagmail.com.
;rrset 84946 2 0 2 0
bagmail.com. 171346 IN NS ns1.goldkey.com.
bagmail.com. 171346 IN NS ns2.goldkey.com.
;rrset 84946 1 0 1 0
ns2.goldkey.com. 171346 IN A 206.83.79.29
;rrset 84946 1 0 1 0
ns1.goldkey.com. 171346 IN A 64.95.64.222
Delegation with 2 names, of which 2 can be examined to query further addresses.
It provides 2 IP addresses.
64.95.64.222 rtt 120000 msec, 12 lost. noEDNS probed.
206.83.79.29 rtt 120000 msec, 17 lost. noEDNS probed.

Is there anyway to get unbound to return a servfail straight away ?

Thanks

Gareth

Hi Gareth,

The lookup is really taking very long and unbound assumes that you
should keep waiting for the answer. Unbound does not know what the
timeout of the client is, so cannot tell it servfail.

Perhaps the clients should have longer timeouts? Or how else can they
insist on an answer within some time? This is not part of the DNS
protocol? They are obviously broken.

Now, to step back from ranting about broken other stuff, in reality, you
want stuff to work. Right now unbound does not do what you want. What
would work well?

Best regards,
   Wouter

Hi Wouter,

I am extremely happy with the way unbound works and thank-you very much for the work that you
have done.

I’m just not looking forward to the customer queries about why “no dns servers could be reached”
and other odd error messages. Trying to explain to Joe Public about how name servers are
in fact broken and that is why they don’t get the response they expect is always challenging.
Their reply is usually “but if I use opendns or googledns it answers” (never mind the fact that they
are still answering with a servfail.)

An example I am seeing on one of the unbound caches is as follows.

The nameservers for 233.165.in-addr.arpa are broken

233.165.in-addr.arpa. 59408 IN NS dbndns1.ifusion.co.za.
233.165.in-addr.arpa. 59408 IN NS jhbdns1.ifusion.co.za.
;; BAD (HORIZONTAL) REFERRAL
;; Received 133 bytes from 165.233.48.99#53(jhbdns1.ifusion.co.za) in 21 ms

233.165.in-addr.arpa. 82036 IN NS jhbdns1.ifusion.co.za.
233.165.in-addr.arpa. 82036 IN NS dbndns1.ifusion.co.za.
;; BAD (HORIZONTAL) REFERRAL
;; Received 133 bytes from 165.233.152.114#53(dbndns1.ifusion.co.za) in 35 ms

233.165.in-addr.arpa. 59408 IN NS dbndns1.ifusion.co.za.
233.165.in-addr.arpa. 59408 IN NS jhbdns1.ifusion.co.za.
;; BAD (HORIZONTAL) REFERRAL
;; Received 133 bytes from 165.233.48.99#53(jhbdns1.ifusion.co.za) in 20 ms

233.165.in-addr.arpa. 82036 IN NS jhbdns1.ifusion.co.za.
233.165.in-addr.arpa. 82036 IN NS dbndns1.ifusion.co.za.
;; BAD (HORIZONTAL) REFERRAL
;; Received 133 bytes from 165.233.152.114#53(dbndns1.ifusion.co.za) in 33 ms

233.165.in-addr.arpa. 59408 IN NS dbndns1.ifusion.co.za.
233.165.in-addr.arpa. 59408 IN NS jhbdns1.ifusion.co.za.
;; BAD (HORIZONTAL) REFERRAL
;; Received 133 bytes from 165.233.48.99#53(jhbdns1.ifusion.co.za) in 21 ms

A unbound-control dump_requestlist shows the following.

232 PTR IN 17.75.233.165.in-addr.arpa. 117.590562 iterator wait for 165.233.48.99
233 PTR IN 18.75.233.165.in-addr.arpa. 106.083322 iterator wait for 165.233.48.99
234 PTR IN 19.75.233.165.in-addr.arpa. 110.606661 iterator wait for 165.233.48.99
235 PTR IN 20.75.233.165.in-addr.arpa. 116.093442 iterator wait for 165.233.48.99
236 PTR IN 21.67.233.165.in-addr.arpa. 105.611471 iterator wait for 165.233.48.99
237 PTR IN 21.75.233.165.in-addr.arpa. 115.076346 iterator wait for 165.233.48.99
238 PTR IN 22.75.233.165.in-addr.arpa. 114.074878 iterator wait for 165.233.48.99
239 PTR IN 23.75.233.165.in-addr.arpa. 113.083954 iterator wait for 165.233.48.99
240 PTR IN 24.75.233.165.in-addr.arpa. 112.056811 iterator wait for 165.233.48.99
241 PTR IN 25.75.233.165.in-addr.arpa. 111.071265 iterator wait for 165.233.48.99
242 PTR IN 26.75.233.165.in-addr.arpa. 110.086471 iterator wait for 165.233.48.99
243 PTR IN 27.75.233.165.in-addr.arpa. 109.110294 iterator wait for 165.233.48.99
244 PTR IN 28.74.233.165.in-addr.arpa. 60.251117 iterator wait for 165.233.48.99
245 PTR IN 28.75.233.165.in-addr.arpa. 108.101261 iterator wait for 165.233.48.99
246 PTR IN 29.75.233.165.in-addr.arpa. 107.636158 iterator wait for 165.233.48.99

Would it be possible to get unbound to send a servfail if all nameservers give a bad referral ? The above
seems to indicate it will continue trying until it gets the data it is looking for, but in this case it never will and
the query times out. The same query against google dns gives a servfail.

unbound cache

time dig 24.75.233.165.in-addr.arpa PTR @dnscache1-ctn.is.co.za

; <<>> DiG 9.6.1-P2 <<>> 24.75.233.165.in-addr.arpa PTR @dnscache1-ctn.is.co.za
;; global options: +cmd
;; connection timed out; no servers could be reached

real 0m15.285s
user 0m0.000s
sys 0m0.012s

google cache

time dig 24.75.233.165.in-addr.arpa PTR @8.8.8.8

; <<>> DiG 9.6.1-P2 <<>> 24.75.233.165.in-addr.arpa PTR @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 33284
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;24.75.233.165.in-addr.arpa. IN PTR

;; Query time: 577 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Mon Jan 18 18:32:30 2010
;; MSG SIZE rcvd: 44

real 0m0.585s
user 0m0.008s
sys 0m0.001s

Thanks again

Cheers

Gareth

Hi Gareth,

So, I am not arguing about the customer pleasing part. That sounds
good. My trouble is how to do that.

max-time-before-servfail: 1 second is an option that I could possibly add.

But my question (to the others on the list:) is that a good idea? Could
that be a good default?

Basically 2-second lookups are then no longer possible, since you get
servfail after 1 second. (Hopefully the 2 second lookup ends up into
the cache and thus can be queried later).

Does the customer actually see a different response from their
internet-app? I understand we are the small guys here and changing,
say, IE display of error reports is not an option.

Best regards,
   Wouter