odd increase in SERVFAIL with "misc failure" reason

Hi!

I'm operating a small private (low volume) recurser for my own purpose for
years using unbound since about 1.6.x. Without (recognized) issues so far.

But with 1.22+ I noticed some oddities with unexpected SERVFAILs.

Incoming requests are made with DoT on port 853 and locally (classic port
53). My config mostly uses defaults except [0].

I first recognized it with failed mail reception from GMX, because unbound
occasionally was not able to resolve the PTR RRs of their outgoing mail
relay. The "verb 1; log-servfail: yes" log showed only
error: SERVFAIL <18.15.227.212.in-addr.arpa. PTR IN>: misc failure

A closer look to the logs showed a lot of rather odd "misc failure"s. eg.:
error: SERVFAIL <ctldl.windowsupdate.com. AAAA IN>: misc failure
error: SERVFAIL <alexa.amazon.de. A IN>: misc failure
error: SERVFAIL <www.paypal.com. A IN>: misc failure

All of them worked at a later retry as expected.

I searched the source for the "misc failure" message and found the new (at
least to me) option "max-global-quota" as one reason. Afterwards I raised
the verbosity to 3 to see more details. At the same time I added
  msg-cache-size: 4m
  num-queries-per-thread: 4096
  rrset-cache-size: 8m
  cache-min-ttl: 10
  cache-max-negative-ttl: 3600
  infra-cache-min-rtt: 100
to [0]. But I still didn't change the "max-global-quota" default.

To my surprise this also influenced the "misc failure" rate positively and
only some "in-addr.arpa" SERVFAILed with it. They all triggered the
"request xxxx has exceeded the maximum global quota on number of upstream
queries yyy" message in the debug log.

I then removed the modifications from the config again and returned to
plain [0] and the raised rate of "misc failures" including quite prominent
zones returned as well.

eg.:
debug: request 3.pool.ntp.org. has exceeded the maximum global quota on
number of upstream queries 155
debug: return error response SERVFAIL

Searching for the highest "number of upstream queries" gave 180 for
error: SERVFAIL <at.mirrors.cicku.me. AAAA IN>: misc failure

This one failed again when I retried while writing this mail with "139".
The second try gave the correct answer.

Obviously the cache size and primarily the contents influences the needed
maximum number of requests.

I'm wondering if I'm the only one seeing this?

IMO either the default of 128 is simply to low for low volume recursers or
there is some other oddity with this option.

Greetings,
Wolfgang Breyha

[0] config (stripped access, tls keys, common stuff)
        outgoing-port-permit: 32768-60999
        outgoing-port-avoid: 0-32767
        so-rcvbuf: 4m
        so-sndbuf: 4m
        so-reuseport: yes
        ip-transparent: yes
        max-udp-size: 4096
        log-servfail: yes
        harden-glue: yes
        harden-dnssec-stripped: yes
        harden-below-nxdomain: yes
        harden-referral-path: yes
        qname-minimisation: yes
        aggressive-nsec: yes
        use-caps-for-id: no
        unwanted-reply-threshold: 10000000
        prefetch: yes
        prefetch-key: yes
        rrset-roundrobin: yes
        minimal-responses: no
        val-clean-additional: yes
        val-permissive-mode: no
        serve-expired: no
        val-log-level: 1

Hi Wolfgang,

I observe the same increased SERVFAILs ("misc failure") after updating
to Unbound 1.22.0. Also on a low-volume recursor.

I have not had the opportunity to take a closer look, but wanted to
provide anecdotal evidence that you are not alone!

Cheers,
Otto

Hi Wolfgang, Otto,

Thanks for bringing this up!
We also had other operational feedback about the value and we decided to bump it up to 200 from the initial 128.
Still keeping the possible amplification factor for CAMP-style issues in the hundreds.

https://github.com/NLnetLabs/unbound/commit/fd1a1d5fa0f012e8eeaa0ecc89da52d9ca25c216

Best regards,
-- Yorgos

Here we observe a big increase of UDP Out queries when updating from 1.21.0 to 1.22.0 (factor x4).
Doesn’t seem to be linked to max-global-quota

Ok, tried that as well. Changed config only by adding
max-global-quota: 200

The empty cache produced for
at.mirrors.cicku.me. AAAA

debug: request at.mirrors.cicku.me. has exceeded the maximum global quota
on number of upstream queries 201

with a final top of 222 :frowning: I think even 200 is too low especially with near
empty cache or there is something else involved as well.

I'm tempted to raise the bar to full 8 bits;-)

Greetings, Wolfgang

Seems too low as well.

dl.acronis.com. A
reached
"number of upstream queries 292"
immediately after server reload.

This happend while I was trying to see if at.mirror.cicku.me AAAA is
reproducible if I flush caches using "unbound-control reload". Which in
fact is and reached the formerly reported >200 requests as well.

Greetings, Wolfgang

… and CPU usage is x1.5.
Something seems wrong.

Hi Olivier,

This (more upstream queries) may be related to another issue that we are investigating.
Do you happen to have serve-expired enabled in Unbound?

Best regards,
-- Yorgos

Hi Wolfgang,

This domain gives a lot of IPv6 timeouts for me while resolving (which contribute to the global send out quota).
Could you try with 'do-ip6: no' or 'prefer-ip4: yes' and see if the problem persists?

Best regards,
-- Yorgos

Hi Yorgos,

yes, about serve-expired we have:

serve-expired: yes

serve-expired-ttl: 172800

serve-expired-reply-ttl: 300

serve-expired-client-timeout: 1600

ede-serve-expired: yes

I’ve just downgraded one server in our DNS farm from 1.22.0 to 1.21.1 ; UDP Out queries value was immediately divided by 2.

Hmm, my gut reaction is that there must be something wrong with how
the resulting queries are attributed to the original recursive query.
E.g. if I ask a name server which doesn't know acronis.com about the A
record for dl.acronis.com, I get back:

$ dig dl.acronis.com. a +norec

; <<>> DiG 9.18.24 <<>> dl.acronis.com. a +norec
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51324
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 13, ADDITIONAL: 27

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 9c1f694d66e16b4e01000000672bc46570aadef12d60dc52 (good)
;; QUESTION SECTION:
;dl.acronis.com. IN A

;; AUTHORITY SECTION:
com. 89946 IN NS j.gtld-servers.net.
com. 89946 IN NS a.gtld-servers.net.
com. 89946 IN NS e.gtld-servers.net.
com. 89946 IN NS l.gtld-servers.net.
com. 89946 IN NS g.gtld-servers.net.
com. 89946 IN NS f.gtld-servers.net.
com. 89946 IN NS b.gtld-servers.net.
com. 89946 IN NS c.gtld-servers.net.
com. 89946 IN NS d.gtld-servers.net.
com. 89946 IN NS h.gtld-servers.net.
com. 89946 IN NS m.gtld-servers.net.
com. 89946 IN NS i.gtld-servers.net.
com. 89946 IN NS k.gtld-servers.net.

;; ADDITIONAL SECTION:
a.gtld-servers.net. 64485 IN A 192.5.6.30
b.gtld-servers.net. 64485 IN A 192.33.14.30
c.gtld-servers.net. 64485 IN A 192.26.92.30
d.gtld-servers.net. 64485 IN A 192.31.80.30
e.gtld-servers.net. 64485 IN A 192.12.94.30
f.gtld-servers.net. 64485 IN A 192.35.51.30
g.gtld-servers.net. 64485 IN A 192.42.93.30
h.gtld-servers.net. 64485 IN A 192.54.112.30
i.gtld-servers.net. 64485 IN A 192.43.172.30
j.gtld-servers.net. 64485 IN A 192.48.79.30
k.gtld-servers.net. 64485 IN A 192.52.178.30
l.gtld-servers.net. 64485 IN A 192.41.162.30
m.gtld-servers.net. 64485 IN A 192.55.83.30
a.gtld-servers.net. 64485 IN AAAA 2001:503:a83e::2:30
b.gtld-servers.net. 64485 IN AAAA 2001:503:231d::2:30
c.gtld-servers.net. 64485 IN AAAA 2001:503:83eb::30
d.gtld-servers.net. 64485 IN AAAA 2001:500:856e::30
e.gtld-servers.net. 64485 IN AAAA 2001:502:1ca1::30
f.gtld-servers.net. 64485 IN AAAA 2001:503:d414::30
g.gtld-servers.net. 64485 IN AAAA 2001:503:eea3::30
h.gtld-servers.net. 64485 IN AAAA 2001:502:8cc::30
i.gtld-servers.net. 64485 IN AAAA 2001:503:39c1::30
j.gtld-servers.net. 64485 IN AAAA 2001:502:7094::30
k.gtld-servers.net. 64485 IN AAAA 2001:503:d2d::30
l.gtld-servers.net. 64485 IN AAAA 2001:500:d937::30
m.gtld-servers.net. 64485 IN AAAA 2001:501:b1f9::30

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1) (UDP)
;; WHEN: Wed Nov 06 20:32:53 CET 2024
;; MSG SIZE rcvd: 870

$

Now, how many of those NS records needs to be resolved to an address
to successfully make progress in resolving the original query? One?
Two? All of them? And ... when unbound is configured to do DNSSEC
validation, is it then effectively prevented from using glue records
from the additional section? I guess that at least in this case
"yes", since they are ... not in a subzone of .com. And then we pile
on queries about DS and DNSKEY records, but still... 200ish queries to
resolve a single 3-layer name? Even with two CNAME records inside the
Akamai maze that seems like an awful lot to blame on the original
recursive query?

Regards,

- Håvard

… we’ve downgraded everything to 1.21.1, CPU and UDP Out Queries are back to normal.
Definitely 1.22.0 has a problem.

Hi Olivier,

We are already working on it and we believe we have a solution.
Would you like to be involved with preliminary testing?

Best regards,
-- Yorgos

Yorgos Thessalonikefs via Unbound-users:

We are already working on it and we believe we have a solution.
Would you like to be involved with preliminary testing?

if you release a beta, I would also run my usual tests...
Andreas

Hi Andreas,

This is pre-RC talk to see if our current fix addresses the issue properly. When we know we have a solution, we'll go ahead with planning a release.

If you want to test a patch for this issue on top of 1.22.0 let me know and I will contact you off-list.

Best regards,
-- Yorgos

Hi Yorgos,

sure, we could try on one low load server here.