Jostle logic seems to randomly stop working

Hello,

I have setup unbound as a caching DNS server on my home router, and I have been
having intermittent problems with it. In my attempt to fix it, I recently
switched my desktop DNS resolver to systemd-resolved, and now the problem
appears a little more clear cut:

Most of the time, unbound works great. However, it seems that every day when I
come back home and fire up my PC, a vast majority of queries no longer work.
Restarting unbound solves the issue.
Turning on maximum logging, I notice that the syslog is filling up with:
"Too many queries. dropping incoming query."

Running "unbound-control stats_noreset" also shows the
`total.requestlist.exceeded` stat at several hundred within minutes, and
actively climbing as systemd-resolved keeps trying to resolve names.

Ok, so I looked in the code a bit, and I'm guessing that I'm hitting the
`num-queries-per-thread` limit, which I have configured at 225. However, based
on the documentation and the code, stale queries should be jostled out after a
delay (I have it configured as 500ms, due to DNS over TLS being a bit slower).
Yet I can sit for several minutes and watch as a vast majority of queries fail.
It appears like stale queries aren't being jostled out like they should.
Interestingly, every once in a while a query makes it through, so it's not a
100% failure rate.

I'm not sure how to proceed in order to debug this issue further. I confess I
don't have intimate knowledge of the details of DNS, and my setup is somewhat
experimental so I can play with stuff (router is a custom Linux box with
manually written firewall rules, aggressive QoS settings, things like that), so
it's possible the issue may lie outside of my unbound configuration.

-Dmitri

Snippets of my logs and config file(s):

### Snippet from syslog during failing condition:
...
May 15 18:50:55 homebrew unbound[575]: [575:0] debug: answer from the cache failed
May 15 18:50:55 homebrew unbound[575]: [575:0] debug: udp request from ip4 192.168.0.2 port 41884 (len 16)
May 15 18:50:55 homebrew unbound[575]: [575:0] debug: Too many queries. dropping incoming query.
May 15 18:50:55 homebrew unbound[575]: [575:0] debug: cache memory msg=495151 rrset=506194 infra=7600 val=109330
May 15 18:50:55 homebrew unbound[575]: [575:0] debug: answer from the cache failed
May 15 18:50:55 homebrew unbound[575]: [575:0] debug: udp request from ip4 192.168.0.2 port 33122 (len 16)
May 15 18:50:55 homebrew unbound[575]: [575:0] debug: Too many queries. dropping incoming query.
May 15 18:50:55 homebrew unbound[575]: [575:0] debug: cache memory msg=495151 rrset=506194 infra=7600 val=109330
...

### Info from stats_noreset:
...
total.num.queries=4666
total.num.queries_ip_ratelimited=0
total.num.cachehits=844
total.num.cachemiss=3822
total.num.prefetch=8
total.num.zero_ttl=0
total.num.recursivereplies=3064
total.requestlist.avg=1.11906
total.requestlist.max=14
total.requestlist.overwritten=0
total.requestlist.exceeded=758
total.requestlist.current.all=0
total.requestlist.current.user=700
total.recursion.time.avg=0.247922
total.recursion.time.median=0.13798
total.tcpusage=0
time.now=1526424323.220624
time.up=85257.239793
time.elapsed=85257.239793

### unbound.conf:
server:
   aggressive-nsec: yes
   jostle-timeout: 500
   key-cache-slabs: 2
   low-rtt: 160
   low-rtt-pct: 900
   msg-cache-slabs: 2
   num-queries-per-thread: 225
   num-threads: 2
   outgoing-range: 450
   rrset-roundrobin: yes
   so-reuseport: yes
   trust-anchor-file: trusted-key.key
   use-caps-for-id: yes
   use-syslog: yes
   username: "unbound"
   verbosity: 2
   directory: "/etc/unbound"
   root-hints: root.hints

   interface: 127.0.0.1
   interface: 192.168.0.1
   access-control: 192.168.0.0/24 allow

   msg-cache-size: 256m
   rrset-cache-size: 512m

   neg-cache-size: 32m
   prefetch: yes
   prefetch-key: yes
   qname-minimisation: yes
   tls-cert-bundle: /etc/ca-certificates/extracted/ca-bundle.trust.crt
   tls-upstream: yes

auth-zone:
   name: "."
   fallback-enabled: yes
   for-downstream: no
   for-upstream: yes
   zonefile: root.keys
   master: f.root-servers.net
   master: k.root-servers.net
   master: g.root-servers.net
   master: xfr.lax.dns.icann.org
   master: c.root-servers.net
   master: xfr.cjr.dns.icann.org
   master: b.root-servers.net

forward-zone:
   name: "."
   forward-addr: 199.58.81.218@853#dns.cmrg.net
   forward-addr: 1.0.0.1@853#cloudflare-dns.com
   forward-addr: 1.1.1.1@853#cloudflare-dns.com
   forward-addr: 2606:4700:4700::1111@853#cloudflare-dns.com
   forward-addr: 2606:4700:4700::1001@853#cloudflare-dns.com
   forward-addr: 9.9.9.9@853#dns.quad9.net
   forward-addr: 149.112.112.112@853#dns.quad9.net
   forward-addr: 2620:fe::fe@853#dns.quad9.net

remote-control:
   control-enable: yes

Dmitri Kourennyi via Unbound-users:

Most of the time, unbound works great. However, it seems that every day when I
come back home and fire up my PC, a vast majority of queries no longer work.
Restarting unbound solves the issue.

Hello,

I've a system that also require restarting unbound since 1-2 weeks
maybe related to the last updates to 1.7.0 or 1.7.1
I've also an other system that don't respond on some queries but become healthy again without restarts
But, I also have other systems without such symptoms. All are configured mostly in the same way.

No idea if there is is an relation to your problem and how to debug my problem.

Andreas

Adding a little bit more information:

When the issue cropped up again, I tried running
"unbound-control dump_requestlist" to inspect the list, expecting it to be full
of requests that are stuck for some reason. Instead I found that it was
completely empty. Yet incoming queries were constantly being rejected.

Then I tried "unbound-control flush_requestlist" to see if it had any effect,
and sure enough, it resolves the issue, even though there doesn't appear to be
anything to flush.

When I get a spare moment, I'll try to add some more logging in that part of the
code, see if I can specifically determine what condition is failing that is
causing the queries to be dropped.

More investigation results:

I think the issue appears when unbound is trying to probe the master servers for
the auth_zone section. The logs show unbound doing lookups on all the auth_zone
domain names in my config, and I think unbound is answering them from its own
cache. After the lookups, I get the following in the logs:

May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 13 recursion states (462 with reply, 13 detached), 13 waiting replies, 1444 recursion replies sent, 1 replies dropped, 0 states jostled out
[*snip* histogram]
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 0RDd mod0 b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 1RDd mod0 c.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 2RDd mod0 f.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 3RDd mod0 g.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 4RDd mod0 k.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 5RDd mod0 xfr.cjr.dns.icann.org. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 6RDd mod0 xfr.lax.dns.icann.org. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 7RDd mod0 b.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 8RDd mod0 c.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 9RDd mod0 g.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 10RDd mod0 k.root-servers.net. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 11RDd mod0 xfr.cjr.dns.icann.org. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: 12RDd mod0 xfr.lax.dns.icann.org. AAAA IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 12 recursion states (462 with reply, 12 detached), 12 waiting replies, 1444 recursion replies sent, 1 replies dropped, 0 states jostled out

The above pattern repeats, with the last entry in the RDd list dropping off, and
the `detached` and `waiting reply` counters going down by one each time. Once
the list is emptied I get:


May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 1 recursion states (462 with reply, 1 detached), 1 waiting replies, 1444 recursion replies sent, 1 replies dropped, 0 states jostled out
[*snip* histogram]
0RDd mod0 b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: mesh_run: end 1 recursion states (462 with reply, 0 detached), 0 waiting replies, 1444 recursion replies sent, 1 replies dropped, 0 states jostled out
[*snip* histogram]
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth zone .: soa probe serial is 2018051800
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth_zone unchanged, new lease, wait
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth zone . timeout in 1800 seconds
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: close fd 15

From this point onward, queries are rejected. Half an hour later, the auth-zone
lookup attempt repeats, and it all looks like above. 4 hours later, I get a
bunch of lines like this:

May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point start listening 15
May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: Reading tcp query of length 17540
May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point stop listening 15
May 18 16:59:11 homebrew unbound[30620]: [30620:0] debug: comm point start listening 15

May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: Reading tcp query of length 6523
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: comm point stop listening 15
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: xfr b.root-servers.net: last AXFR packet

Followed by a big list:

May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr at[0] . SOA
May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr at[1] . RRSIG

May 18 16:59:12 homebrew unbound[30620]: [30620:0] debug: apply_axfr at[3028] camera. DS
May 18 16:59:57 homebrew systemd-journald[199]: Suppressed 19535 messages from unbound.service

Which looks like it got new records for the auth_zone? I'm not sure. It keeps
looping the auth_zone lookups every half hour still.

I added some logging to mesh_make_new_space(), which showed that no jostle is
attempted because mesh->jostle_first is a nullptr. I don't know if that's
normal.

Since it seems to coincide with the auth-zone lookups, I will try disabling it
to see if the issue resolves itself.

Dmitri Kourennyi via Unbound-users:

More investigation results:

Since it seems to coincide with the auth-zone lookups, I will try disabling it to see if the issue resolves itself.

similar observation: the systems I mentioned also had auth-zones enabled.
I removed the feature and now the seem to run more reliable.

unbound-1.7.1

problematic, now disabled config:
auth-zone:
         name: "."
         for-downstream: no
         for-upstream: yes
         fallback-enabled: yes
         master: b.root-servers.net
         master: c.root-servers.net
         master: e.root-servers.net
         master: f.root-servers.net
         master: g.root-servers.net
         master: k.root-servers.net
         zonefile: "auth-zones/root-rfc7706.zone"

auth-zone:
         name: "arpa."
         for-downstream: no
         for-upstream: yes
         fallback-enabled: yes
         master: b.root-servers.net
         master: c.root-servers.net
         master: e.root-servers.net
         master: f.root-servers.net
         master: g.root-servers.net
         master: k.root-servers.net
         zonefile: "auth-zones/arpa-rfc7706.zone"

Andreas

I reported issue with 1.7.1rc1 (on this list). After this information I
can say the issue was caused by auth-zone: for "." which I tried. So
1.7.0 worked but 1.7.1rc1 and 1.7.1 are broken.

Hi Dmitry,

More investigation results:

I think the issue appears when unbound is trying to probe the master
servers for
the auth_zone section. The logs show unbound doing lookups on all the
auth_zone
domain names in my config, and I think unbound is answering them from
its own
cache. After the lookups, I get the following in the logs:

Can you show the work that it does for looking up one of the root
servers? Not getting an address must be causing it to not have content.
But it does work (eventually), you say, once the long list appears,
that means the AXFR has completed. Meanwhile you should have normal
service, because the zone is loaded (the file that is configured has
content, right?)? When a normal query arrives, it should just be
answered with the auth-zone?

The bug that was fixed in 1.7.1 (causes problem now?), supposedly fixes
behaviour with respect to the forward-zones configured. Is that still
not right somehow? Note that having a forward zone for "." and also an
auth-zone 7706 for the root, in 1.7.1 answers only queries for the root
itself from the root (only domain ".") and other queries from the
forward-zone. Where in 1.7.0 it would pick the auth-zone referral and
go make authoritative queries (and that was a bug and fixed). So, if
1.7.1 does not work, perhaps authoritative queries work, but the
forward-zone does not work so well. And if you remove that, then
unbound starts making authoritative queries again.

That the root zone is downloaded every half hour is normal, that is
exactly the AXFR of the root zone that the auth-zone is supposed to do.
So that seems to be working fine and is keeping the root zone up to date.

Best regards, Wouter

I did hit this same issue with 1.7.1rc1 with just root zone and without
any forward zone. In my small installation unbound stopped answering
queries after a day or two. And fix for the issue is removing auth-zone
7706. And this worked on 1.7.0.

My original reply was rejected by the server (I think) as I included too
much log data. Apologies this gets posted twice.

Can you show the work that it does for looking up one of the root
servers? Not getting an address must be causing it to not have content.
But it does work (eventually), you say, once the long list appears,
that means the AXFR has completed. Meanwhile you should have normal
service, because the zone is loaded (the file that is configured has
content, right?)? When a normal query arrives, it should just be
answered with the auth-zone?

Detailed log of lookups included below. I would like to add that the
issue may be associated specifically be the lookup of the auth-zone
domain names. At work, I use unbound as a local caching server on my
laptop, with a similar configuration as my home router, but I used IP
addresses instead of hostnames, and I have not had any problems with it
(this might be a useful work-around to try for anyone who still want to
use the auth-zone functionality):

auth-zone:
   name: "."
   master: 193.0.14.129 # k.root-servers.net
   master: 192.112.36.4 # g.root-servers.net
   master: 2001:500:2f::f # f.root-servers.net
   master: 2001:500:84::b # b.root-servers.net
   master: 192.0.47.132 # xfr.cjr.dns.icann.org
   master: 2001:7fd::1 # k.root-servers.net
   master: 192.0.32.132 # xfr.lax.dns.icann.org
   master: 192.33.4.12 # c.root-servers.net
   master: 2620:0:2830:202::132 # xfr.cjr.dns.icann.org
   master: 2620:0:2d0:202::132 # xfr.lax.dns.icann.org
   master: 192.228.79.201 # b.root-servers.net
   master: 192.5.5.241 # f.root-servers.net
   fallback-enabled: yes
   for-downstream: no
   for-upstream: yes
   zonefile: root.zone

Looking at the logs again, I noticed there was one incident of a
dropped query ~15min before all queries start to fail. However, that
query ends up completing, I think, and at 14:43:40 a different query
goes through without issue. It's at 15:00 that queries begin to fail.
I don't know if this error before the root server lookup is part of
the cause or a red herring.

May 18 14:43:35 homebrew unbound[30620]: [30620:0] info: 0RDd mod1 rep www.google.com. AAAA IN
May 18 14:43:35 homebrew unbound[30620]: [30620:0] info: 1RDd mod1 rep connectivitycheck.gstatic.com. AAAA IN
May 18 14:43:35 homebrew unbound[30620]: [30620:0] debug: cache memory msg=382505 rrset=462002 infra=7600 val=135128
May 18 14:43:35 homebrew unbound[30620]: [30620:0] debug: answer from the cache failed
May 18 14:43:35 homebrew unbound[30620]: [30620:0] debug: udp request from ip4 192.168.0.154 port 64667 (len 16)
May 18 14:43:35 homebrew unbound[30620]: [30620:0] warning: No jostle attempt. m=(nil) # <= This is logging I added to find out which if condition is failing
May 18 14:43:35 homebrew unbound[30620]: [30620:0] debug: Too many queries. dropping incoming query.
May 18 14:43:35 homebrew unbound[30620]: [30620:0] debug: cache memory msg=382505 rrset=462002 infra=7600 val=135128
May 18 14:43:35 homebrew unbound[30620]: [30620:0] debug: comm point listen_for_rw 15 0
May 18 14:43:35 homebrew unbound[30620]: [30620:0] debug: comm point listen_for_rw 16 0

*Log of root server lookups below. Note that this occurs when I am not
home, with a single tablet as the only active device, so it's not that
there's a large amount of traffic or anything like that.

May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: auth zone .: master lookup for task_probe b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: mesh_run: start
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: validator[module 0] operate: extstate:module_state_initial event:module_event_new
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: validator operate: query b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: validator: pass to next module
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: mesh_run: validator module exit state is module_wait_module
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: iterator[module 1] operate: extstate:module_state_initial event:module_event_pass
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: process_request: new external request event
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: iter_handle processing q with state INIT REQUEST STATE
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: resolving b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: request has dependency depth of 0
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: msg from cache lookup ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
                                          ;; flags: qr rd ra ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
                                          ;; QUESTION SECTION:
                                          b.root-servers.net. IN A

                                          ;; ANSWER SECTION:
                                          b.root-servers.net. 1841075 IN A 199.9.14.201

                                          ;; AUTHORITY SECTION:

                                          ;; ADDITIONAL SECTION:
                                          ;; MSG SIZE rcvd: 52
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: msg ttl is 41390, prefetch ttl 32750
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: returning answer from cache.
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: iter_handle processing q with state FINISHED RESPONSE STATE
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: finishing processing for b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: mesh_run: iterator module exit state is module_finished
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: validator[module 0] operate: extstate:module_wait_module event:module_event_moddone
May 18 14:59:11 homebrew unbound[30620]: [30620:0] info: validator operate: query b.root-servers.net. A IN
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: validator: nextmodule returned
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: response has already been validated: sec_status_insecure
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: mesh_run: validator module exit state is module_finished
May 18 14:59:11 homebrew unbound[30620]: [30620:0] debug: auth host b.root-servers.net lookup 199.9.14.201

Hi Dmitri,

There is a fix slated for the next release, which is as a patch below.
I think this will solve those non-jostle list too full errors. It
decrements the num_reply_states counter and thus the incoming queries
won't get dropped any more because that counter became too big.

Best regards, Wouter

diff -Naur tags/release-1.7.1/services/mesh.c trunk/services/mesh.c
--- tags/release-1.7.1/services/mesh.c 2018-05-07 10:17:33.605761180 +0200
+++ trunk/services/mesh.c 2018-05-07 13:28:16.907313134 +0200
@@ -1173,6 +1173,10 @@
         while((c = mstate->cb_list) != NULL) {
                 /* take this cb off the list; so that the list can be
                  * changed, eg. by adds from the callback routine */
+ if(!mstate->reply_list && mstate->cb_list && !c->next) {
+ /* was a reply state, not anymore */
+ mstate->s.env->mesh->num_reply_states--;
+ }
                 mstate->cb_list = c->next;
                 if(!mstate->reply_list && !mstate->cb_list &&
                         mstate->super_set.count == 0)

deployed that version on the host I mentioned earlier[1].
I'll see if hopefully nothing happen the next days...

Andreas

[1] https://unbound.nlnetlabs.nl/pipermail/unbound-users/2018-May/005227.html

There is a fix slated for the next release, which is as a patch below.
I think this will solve those non-jostle list too full errors. It
decrements the num_reply_states counter and thus the incoming queries
won't get dropped any more because that counter became too big.

Thank you for the patch. I've applied it to my router, will monitor it
for a few days to see how it works out.

-Dmitri