Unbound stops resolving some hosts sporadically

Hi,
I'm running unbound 1.4.6-1 from debian squeeze on a couple of machines as a resolver and have come across strange problems where unbound will stop resolving some hosts. It seems sporadic as to what works and what doesn't, but nagios generally reports that it timed out after 10s when trying to resolve www.google.com. I sometimes notice it before nagios does if using that resolver. I've tried sniffing traffic but it's hard to pinpoint queries as the servers are quite busy.

What I've noticed though is that sometimes for example i won't be able to resolve yahoo.com but after a minute or two it works again, but then fails a few mins after that. I don't believe there is a network problem as it fails for some domains whilst working for others which are on the same name servers. others on the same.

If I restart unbound it clears up the problem, but this is obviously less than ideal. I'm tending to do this once a day now, sometimes more frequently. I've moved one of the resolvers over to other software now to try and avoid issues where they are both broken for obvious reasons.

Server config looks like this from the one which is still being used (minus interface, outgoing-interface and access-control lines):

server:
     verbosity: 1
     statistics-interval: 86400
     num-threads: 2
     outgoing-range: 256
     msg-cache-size: 128m
     num-queries-per-thread: 1024
     rrset-cache-size: 256m
     do-ip6: no
     chroot: ""
     root-hints: /etc/unbound/named.cache
     hide-identity: yes
     hide-version: yes

The other I tweaked slightly with socket receive buffers in case it was using all sockets but it didn't make any difference:

server:
     verbosity: 2
     statistics-interval: 86400
     num-threads: 2
     outgoing-range: 462
     so-rcvbuf: 4m
     msg-cache-size: 128m
     num-queries-per-thread: 1024
     rrset-cache-size: 256m
     do-ip6: no
     chroot: ""
     logfile: "/var/log/unbound.log"
     root-hints: /etc/unbound/named.cache
     hide-identity: yes
     hide-version: yes

# cat /proc/sys/net/core/rmem_max
4194304

Logs didn't really show much, and produce too much data to trawl through easily.

This morning I've changed to running one thread and when I receive a problem just now dumped stats which were as follows:

# unbound-control status
version: 1.4.6
verbosity: 1
threads: 1
modules: 2 [ validator iterator ]
uptime: 3095 seconds
unbound (pid 24730) is running...

# unbound-control stats_noreset
thread0.num.queries=87145
thread0.num.cachehits=38156
thread0.num.cachemiss=48989
thread0.num.prefetch=0
thread0.num.recursivereplies=41036
thread0.requestlist.avg=327.143
thread0.requestlist.max=1091
thread0.requestlist.overwritten=3188
thread0.requestlist.exceeded=18
thread0.requestlist.current.all=1091
thread0.requestlist.current.user=1024
thread0.recursion.time.avg=27.134193
thread0.recursion.time.median=0.00895803
total.num.queries=87145
total.num.cachehits=38156
total.num.cachemiss=48989
total.num.prefetch=0
total.num.recursivereplies=41036
total.requestlist.avg=327.143
total.requestlist.max=1091
total.requestlist.overwritten=3188
total.requestlist.exceeded=18
total.requestlist.current.all=1091
total.requestlist.current.user=1024
total.recursion.time.avg=27.134193
total.recursion.time.median=0.00895803
time.now=1301394872.957300
time.up=3099.708199
time.elapsed=3099.708199

I've read the changelogs of newer versions but can't see anything that looks like this problem. I'd prefer to avoid upgrading to the latest source distribution on the off-chance it will fix it as that just seems like clutching at straws.

Any ideas?

Cheers,

john

Hi John,

Your outgoing range = 256 and num queries per thread 1024. Below in
your stats you show that you have enough traffic to fill the 1024 list
completely. Having 1024 queries compete for only 256 sockets, makes
this problem. They have to wait until a socket becomes free and
available before they can do work.

Recompile with libevent (if you have not already).

Then set outgoing-range to 1.5x the num-queries-per-thread (or 2x).

outgoing-range: 8192
num-queries-per-thread: 4096

I also increased the num-queries-per-thread because below stats show
that you need more.

Then there is enough capacity and queries do not have to wait for a socket.

    outgoing-range: 256
    msg-cache-size: 128m
    num-queries-per-thread: 1024
    rrset-cache-size: 256m

The other I tweaked slightly with socket receive buffers in case it was
using all sockets but it didn't make any difference:

server:
    verbosity: 2
    statistics-interval: 86400
    num-threads: 2
    outgoing-range: 462
    so-rcvbuf: 4m
    msg-cache-size: 128m
    num-queries-per-thread: 1024
    rrset-cache-size: 256m
    do-ip6: no
    chroot: ""
    logfile: "/var/log/unbound.log"
    root-hints: /etc/unbound/named.cache
    hide-identity: yes
    hide-version: yes

# cat /proc/sys/net/core/rmem_max
4194304

Logs didn't really show much, and produce too much data to trawl through
easily.

This morning I've changed to running one thread and when I receive a
problem just now dumped stats which were as follows:

# unbound-control status
version: 1.4.6
verbosity: 1
threads: 1
modules: 2 [ validator iterator ]
uptime: 3095 seconds
unbound (pid 24730) is running...

# unbound-control stats_noreset
thread0.num.queries=87145
thread0.num.cachehits=38156
thread0.num.cachemiss=48989
thread0.num.prefetch=0
thread0.num.recursivereplies=41036
thread0.requestlist.avg=327.143
thread0.requestlist.max=1091

The max hits your num-queries-per-thread (and exceeds it, because of
internal generated queries (for nameservers usually)). Time to increase
the num-queries-per-thread.

thread0.requestlist.overwritten=3188

And this is what happened as a result: queries are getting dropped. It
tries to give priority to your www.google.com query, but you also
configure so little sockets, that those queries then had to wait as well.

thread0.requestlist.exceeded=18

Life was bad, but not that bad, while it dropped 3188 queries on the
floor, those were 'old' queries that had waited for a while. Here 18
were dropped because there was no way to make space (even considered
dropping other queries). This number would go up if you were under very
high rate queries that come in 1000s per millisecond or so.

thread0.requestlist.current.all=1091
thread0.requestlist.current.user=1024

Here you see the requestlist is completely full, 1024 of 1024. And 1091
including internal entries. It would be good to have 1091 sockets in
this situation: outgoing-range 1091, but since we need to increase the
requestlist size anyway, that number needs to increase as well.

thread0.recursion.time.avg=27.134193

Yes you had to wait 27 seconds on average.

thread0.recursion.time.median=0.00895803

But, the median is very low, meaning the bulk was very fast anyway (once
a socket was available), but some queries took very long causing a high
average.

total.num.queries=87145
total.num.cachehits=38156
total.num.cachemiss=48989

The cachehits are only 38/87 : 44% cache hits. That is pretty low.
Maybe configure more memory for the cache. This could also be because
you had just started and the cache has not warmed up yet.

total.num.prefetch=0

No prefetch enabled.

total.num.recursivereplies=41036
total.requestlist.avg=327.143
total.requestlist.max=1091
total.requestlist.overwritten=3188
total.requestlist.exceeded=18
total.requestlist.current.all=1091
total.requestlist.current.user=1024
total.recursion.time.avg=27.134193
total.recursion.time.median=0.00895803
time.now=1301394872.957300
time.up=3099.708199
time.elapsed=3099.708199

I've read the changelogs of newer versions but can't see anything that
looks like this problem. I'd prefer to avoid upgrading to the latest
source distribution on the off-chance it will fix it as that just seems
like clutching at straws.

Any ideas?

Yes, you need more capacity, and the queries are waiting for resources.

Best regards,
   Wouter