Unbound periodically stops responding

Hi,

I'm using unbound 1.4.6 on 40ish machines, and for the most part it works fantastically. On one my machines however it seems to randomly stop responding. I'm not convinced the fault here lies with unbound, but as far as I can tell it's the only thing that seems to fail (which triggers everything else since they can no longer resolve anything).

When this issue happens, I can't communicate with unbound via unbound-control and it will never resolve anything. I can cleanly shut it down and start a new instance and it will behave exactly the same. The only solution I've found is to restart the VPS. I have another VPS from the same provider which is setup almost identically and it has never had this issue.

I've got debug logging on and the log file has dozens of requests per minute up until the issue occurred, then nothing until it received SIGTERM and shut down. The next time this happens (could be hours/days/weeks) is there anything else I can check that would explain why it does not appear to be even receiving the lookup requests?

Thanks,
--Will

Hi Will,

Hi,

I'm using unbound 1.4.6 on 40ish machines, and for the most part it
works fantastically. On one my machines however it seems to randomly
stop responding. I'm not convinced the fault here lies with unbound, but
as far as I can tell it's the only thing that seems to fail (which
triggers everything else since they can no longer resolve anything).

That sounds bad.

When this issue happens, I can't communicate with unbound via
unbound-control and it will never resolve anything. I can cleanly shut
it down and start a new instance and it will behave exactly the same.
The only solution I've found is to restart the VPS. I have another VPS
from the same provider which is setup almost identically and it has
never had this issue.

So, it is somehow unique to that machine. Can you see in 'top' what
unbound is doing? (is it using cpu, 100% in a busy loop?, it is not
responding to unbound-control, so it must be completely hosed somehow)

netstat -su may be interesting (packet counters for UDP).

Another thing you can do is use 'gcore' to make a coredump of the
'failed' unbound process. (and then kill it and start a new unbound for
your production). Then you can use 'gdb' and your compiled unbound
executable to read the core image and produce a stack backtrace what it
is doing.

I've got debug logging on and the log file has dozens of requests per
minute up until the issue occurred, then nothing until it received
SIGTERM and shut down. The next time this happens (could be
hours/days/weeks) is there anything else I can check that would explain
why it does not appear to be even receiving the lookup requests?

Well it should respond to the unbound-control utility. If it does not
this means it is somehow no longer processing the main loop, or that
network traffic does not reach it.

Best regards,
   Wouter

So, it is somehow unique to that machine. Can you see in 'top' what
unbound is doing? (is it using cpu, 100% in a busy loop?, it is not
responding to unbound-control, so it must be completely hosed somehow)

...the sysadmin in me usually reaches for 'strace' at this point for
clues.

Another thing you can do is use 'gcore' to make a coredump of the
'failed' unbound process. (and then kill it and start a new unbound for
your production). Then you can use 'gdb' and your compiled unbound
executable to read the core image and produce a stack backtrace what it
is doing.

...and when strace fails me, gdb is "bring out the cannon".

Cheers

+1

Depending on your OS, dtrace/SystemTap can be useful too.

When this issue happens, I can’t communicate with unbound via
unbound-control and it will never resolve anything. I can cleanly shut
it down and start a new instance and it will behave exactly the same.
The only solution I’ve found is to restart the VPS. I have another VPS
from the same provider which is setup almost identically and it has
never had this issue.

So, it is somehow unique to that machine. Can you see in ‘top’ what
unbound is doing? (is it using cpu, 100% in a busy loop?, it is not
responding to unbound-control, so it must be completely hosed somehow)

Sorry I meant to include that in my original email. It does not appear to be in a busy loop; top shows 0% CPU usage for unbound.

netstat -su may be interesting (packet counters for UDP).

Okay, I’ll remember to take a look, see if the packets are sitting unread.

Another thing you can do is use ‘gcore’ to make a coredump of the
‘failed’ unbound process. (and then kill it and start a new unbound for
your production). Then you can use ‘gdb’ and your compiled unbound
executable to read the core image and produce a stack backtrace what it
is doing.

I’m not familiar with “gcore” can I just configure ulimit to allow core dumps then send the ABRT signal? I’ll make sure I install the debug libraries so I get something useful there. The weird thing is restarting unbound won’t fix it. I really have to restart the machine (so it’s likely something else is really broken).

Well it should respond to the unbound-control utility. If it does not
this means it is somehow no longer processing the main loop, or that
network traffic does not reach it.

Interesting, all the requests should be done over localhost. My resolv.conf only contains the line “nameserver 127.0.0.1” and doing “dig @localhost foo.com” also fails. I can check the routing table and do the obvious pings and see if those at least work.

I did run strace last time this happened, but I wasn’t really sure what to look for; I was really just checking that it was doing something and not just hanging. Next time I’ll capture the output and try and take a better look. If it matters, this is on an amd64 Debian GNU/Linux Squeeze (6.0) system.

Thanks for the tips,
–Will

To add some resolution to this issue, this is clearly not unbound's fault. When this situation is triggered I cannot locally ping any of the IPv4 addresses on the machine, so clearly the communication to unbound as a DNS lookup or via unbound-control are going to fail. I'm at a loss as to explain why this happens :slight_smile:

Regards,
--Will

I'm using 'Unbound' v1.4.18 on Windows XP SP3 4GB RAM 32bit Dual Core
AMD CPU. Unbound is configured with "validator iterator" mode.
"target-fetch-policy" is currently "2 1 0 0 0 0". DLV option is enabled.
It stops responding periodically in my side as well :frowning:
I installed windows process monitoring tools like, Process Hacker,
Process Explorer, etc and also have firewall able to show, warn, block
any active network connections. Nothing is blocked for unbound in
firewall, only set to show messages/info on what unbound is doing.
Firewall is also set to show message/info what app is trying to
communicate (send DNS query) with local resolver (the unbound).
When user like me tries to do a ping or do a nslookup or do a DiG on an
internet host, or when a web-browser or any other internet service
client app tries to send DNS query via unbound (working on 127.0.0.1 udp
port 53), then at first attempt, unbound internally does its query very
slowly (or sometime does not work), then query sender app shows server
could not be reached or servfail, etc error/result. 'Unbound' starts to
use around 98% or more cpu resources at that point. So other apps, mouse
becomes non or less responsive. After about 1 min or 2 mins, cpu usage
goes down to normal level. And then, if 2nd attempt is done on the same
internet site or host, then 'unbound' usually sends the answer back very
quickly and can reach sites.
If a different fetch policy is used then how will it affect? We need a
better fetch policy. Even when i specified it to use 1 Thread, it
sometime uses even 3 or 4 threads. If "iterator validator" is used, then
will it work better ? then what fetch policy will be better ?
-- Bry8Star.

I will try to help myself & others.

The "iterator validator" option will not work/validate.

Below config file gave me better result (on Windows XP), you may try
this out and suit to your need:

- - - - - - - - - - - - - - - -
# BEGIN of service.conf / unbound.conf file
# Last Modified 2012-08-31 01:30
# Copyright (C) 2012 Bry8Star. (bry8 star a.t ya hoo d.o.t c om)
server:
verbosity: 1 # logs errors & operational info
#verbosity: 0 # logs errors
statistics-interval: 0
statistics-cumulative: "no"
extended-statistics: "no"
num-threads: 1
interface: 127.0.0.1
interface: 192.168.0.10 # My Network Adapter's IP adrs
interface: ::1
interface-automatic: "no"
port: 53
outgoing-interface: 192.168.0.10
outgoing-range: 950
outgoing-port-permit: 52000-56096
outgoing-port-avoid:
"22,25,26,37,53,54,55,67,68,69,80,110,123,135,137,138,139,143,443,445,465,500,587,843,990,912,993,995,1025,1863,1935,2082,2083,2096,2400,4242,4400,4421,4444,4445,4480,4500,4569,5038,5050,5060,5061,5062,5063,5064,5065,5198,5199,5200,5222,5555,5800,5801,5900,5901,6666,6667,6668,6669,7000,7001,7002,7003,7004,7005,7006,7658,7659,7660,7777,8050,8052,8054,8056,8058,8060,8080,8110,8118,8120,8123,8125,8143,8210,8225,8243,8998,9001,9022,9030,9050,9051,9052,9053,9054,9055,9056,9057,9058,9059,9060,9080,10000,15000,15001,15002,15003,15004,16001,16999,20000,20001,25000,26999,30600,31000,32000,36999,50300"
outgoing-num-tcp: 25
incoming-num-tcp: 25
so-rcvbuf: 8m
so-sndbuf: 8m
edns-buffer-size: 4096
msg-buffer-size: 65552
msg-cache-size: 48m
msg-cache-slabs: 1
num-queries-per-thread: 475
jostle-timeout: 200
rrset-cache-size: 96m
rrset-cache-slabs: 1
cache-min-ttl: 0
cache-max-ttl: 21600 # 6 hours
infra-host-ttl: 900
infra-cache-slabs: 1
infra-cache-numhosts: 10000
do-ip4: "yes"
do-ip6: "no" # for now
do-udp: "yes"
do-tcp: "yes"
tcp-upstream: "no"
do-daemonize: "yes"
access-control: 0.0.0.0/0 refuse
access-control: ::0/0 refuse
access-control: 127.0.0.0/8 allow
access-control: 192.168.0.10/24 allow
access-control: ::1 allow
logfile: "C:\Program Files\Unbound\unbound.log"
use-syslog: "no"
log-time-ascii: "yes"
log-queries: "no"
root-hints: "C:\Program Files\Unbound\named.cache"
hide-identity: "yes"
hide-version: "yes"
identity: "DNS"
version: "1.0.0"
target-fetch-policy: "0 0 0 0 0 0"
harden-short-bufsize: "no"
harden-large-queries: "no"
harden-glue: "yes"
harden-dnssec-stripped: "yes"
harden-below-nxdomain: "no"
harden-referral-path: "no"
use-caps-for-id: "no"
unwanted-reply-threshold: 8000
prefetch: "yes"
prefetch-key: "yes"
rrset-roundrobin: "yes"
minimal-responses: "no"
module-config: "validator iterator"
dlv-anchor-file: "C:\Program Files\Unbound\dlv.isc.org.key"
# Downloaded from http://ftp.isc.org/www/dlv/dlv.isc.org.key
# DLV, DNS Lookaside Validation, for the root
auto-trust-anchor-file: "C:\Program Files\Unbound\root.key"
#domain-insecure: "TLD" # TLDs from various TLD providers
val-bogus-ttl: 60
val-sig-skew-max: 86400
val-clean-additional: "yes"
val-permissive-mode: "no"
ignore-cd-flag: "yes"
val-log-level: 1 # log validation failed queries
#val-nsec3-keysize-iterations: "1024 150 2048 500 4096 2500"
key-cache-size: 48m
key-cache-slabs: 1
neg-cache-size: 36m
# Blocking below TLDs, can also be used to block sites
local-zone: "onion." refuse # disallow to go via public route
local-zone: "i2p." refuse # suppose to go via proxy route
remote-control:
control-enable: "no"
# stub-zones SZ, for TLDs from other TLD providers (root opr)
# Forward zones FZ, if used hostname/namesrvr in stub-zones
# Default Forward Root Zone:
#forward-zone:
#name: "."
# You may use your ISP dns, for bit faster results.
#forward-addr: i.p.adrs.1 # ISP DNS / Recursive/Caching
#forward-addr: i.p.adrs.2 # ISP DNS / Recursive/Caching
# Or use other root caching or recursive dns servers.
# END of service.conf / unbound.conf file
- - - - - - - - - - - - - - - -

I express thanks to various users from various IRC channels who has
helped with various suggestions.

If you have better performing config file, then please share, thanks in
advance.

And use this below technique to run the 'Unbound DNS Validator' with
"Below Normal" Priority, so it does not affect other processes, it is
temporary fix.
(1) Start Windows Task Manager like this:
ntsd -c qd taskmgr.exe
(2) goto "Processes" tab > select "Show Processes from All Users".
(3) find 'Unbound.exe" in the process list. Right click on it > Set
Priority > select "BelowNormal". Ok.
(4) close Task manager.
There are script/batch file as well to do automatically like above when
windows starts up. Dont know of a registry hack to do that. If any1
knows, then please share.

-- Bry8Star.

By using "Process Explorer" or "Process Hacker" tool,
i can see that,
unbound.exe windows service is using two threads:
(1) advapi32.dll!CryptVerifySignatureW+0x17
(2) unbound.exe
And this 1st thread, uses high amount of CPU resource periodically and
when any app is requesting DNS queries.

What can be done to lower the cpu usage of that thread or improve
performance ?

If i were to change thread priority of that to BelowNormal, will it
affect only Unbound Validator windows service, or, will affect the
entire Windows system ?

-- Bry8Star.

I have done few more tests and what i observed are mentioned below:

By using 'Process Hacker' type of windows process manipulation and
management tool:

Test & result 1 : i have lowered the 'unbound.exe' process priority from
'Normal' to 'BelowNormal', unbound still uses lot of cpu resources when
an app tries to resolve DNS queries. Specially the thread
"advapi32.dll!CryptVerifySignatureW+0x17" uses around 48% or more cpu,
and, dont know what exactly, but something else also uses very high cpu
at that moment, and total cpu resource usage reaches close to 98% or
100% ! then apps, audio-device, mouse etc stops working or stops
responding. :frowning:

Test & result 2 : I have lowered the 'unbound.exe' process priority from
'Normal' to 'BelowNormal', and also changed 'unbound.exe' &
'advapi32.dll!CryptVerifySignatureW+0x17' both thread's priority from
'Normal' into 'Below Normal', ... which did not lower the cpu resource
usage when an app does dns queries. Unbound was causing around 2 or more
seconds of non-responsiveness time, after each 2 seconds or so :frowning:
Unbound process sometime also used another (3rd) thread
'mswsock.dll!WSPStartup+0x102b" other than previously mentioned two
threads. did not change anything on that thread.

Test & result 3 : I have changed the 'unbound.exe' process priority from
'Normal' to 'BelowNormal', then kept priority settings of
"advapi32.dll!CryptVerifySignatureW+0x17" thread at 'Normal' level, and
changed it's CPU affinity from all cores ('CPU 0' & 'CPU 1'), into only
1 core ('CPU 0'). And then changed "unbound.exe" thread's priority into
"Below Normal" and changed it's affinity from all cores ('CPU 0' & 'CPU
1'), into 1 core ('CPU 0') only, ... then it seems/appearing that,
unbound is performing little bit better than before, that is, even when
the "advapi32.dll!CryptVerifySignatureW+0x17" thread (and something
else) uses very high cpu resources, windows system does not become very
very non-responsive. i observed a non-responsive time for almost around
0.5 second after each 2 seconds when an application was doing DNS
queries for new domains/hostnames.

Test & result 4 : I have changed 'unbound.exe' process priority from
'Normal' to 'BelowNormal', then made following changes for the
"advapi32.dll!CryptVerifySignatureW+0x17" & "unbound.exe" both thread :
Changed both thread's priority from 'Normal' into 'Below Normal', and,
changed both thread's CPU affinity from all cores ('CPU 0' & 'CPU 1'),
into 1 core ('CPU 0') only, ... it seems/appears now, unbound is
performing little bit better than before, that is, even when the
"advapi32.dll!CryptVerifySignatureW+0x17" thread (and something else)
uses very high cpu resources, windows system does not become very very
non-responsive like before (when an app was doing DNS queries for new
domains/hostnames). i observed, around 0.25 second of non-responsiveness
time, after each 1 to 1.5 second of time. :slight_smile: so a micro non-responsive
moment was better than longer non-responsive moment.

Solution candidates : if we could use few 'wmic' commands or a tool to
set those threads on such priority and affinity level, along with
setting 'process' priority level, then that would have been great. And
to automate such when Windows starts-up, windows 'Scheduled Tasks' app
can be used. It would be a temporary fix, until the source code is fixed.

Following command allows to change a windows service or app's process
priority level to 'BelowNormal'. (Initially freenode user 'PZt' has
suggested this wmic command to change 'explorer.exe' process priority, i
have changed that and also added in 'Scheduled Tasks', and worked:

C:\Windows\system32\wbem\wmic.exe process where name="unbound.exe" call
setpriority 16384

Now we need command(s) for changing thread priority & affinity.

Can 'Unbound' be re-coded ? to use its own DLL and CryptVerifySignatureW
function calls on windows for DNSSEC validations.

-- Bry8Star.

Hi Bry8,

I have done few more tests and what i observed are mentioned
below:

By using 'Process Hacker' type of windows process manipulation and
management tool:

Can 'Unbound' be re-coded ? to use its own DLL and
CryptVerifySignatureW function calls on windows for DNSSEC
validations.

Unbound only starts one thread on windows (unless you configure
num-threads differently), from unbound.exe. Unbound does not use the
windows-api-CryptVerifySignatureW. Unbound uses OpenSSL for
verification, and this happens inside its thread.

advapi32.dll is some part of the windows API. Unbound makes use of
few OS calls (start the service, simple read/write files, network stuff).

Thus, Unbound is already re-coded, and it uses OpenSSL routines for
verification of DNSSEC signatures. I do not know where this other
thread came from.

Best regards,
   Wouter

-- Bry8Star.

By using "Process Explorer" or "Process Hacker" tool, i can see
that, unbound.exe windows service is using two threads: (1)
advapi32.dll!CryptVerifySignatureW+0x17 (2) unbound.exe And this
1st thread, uses high amount of CPU resource periodically and
when any app is requesting DNS queries.

What can be done to lower the cpu usage of that thread or
improve performance ?

If i were to change thread priority of that to BelowNormal, will
it affect only Unbound Validator windows service, or, will affect
the entire Windows system ?

-- Bry8Star.

I will try to help myself & others.

The "iterator validator" option will not work/validate.

Below config file gave me better result (on Windows XP), you
may try this out and suit to your need:

- - - - - - - - - - - - - - - - # BEGIN of service.conf /
unbound.conf file # Last Modified 2012-08-31 01:30 # Copyright
(C) 2012 Bry8Star. (bry8 star a.t ya hoo d.o.t c om) server:
verbosity: 1 # logs errors & operational info #verbosity: 0 #
logs errors statistics-interval: 0 statistics-cumulative: "no"
extended-statistics: "no" num-threads: 1 interface: 127.0.0.1
interface: 192.168.0.10 # My Network Adapter's IP adrs
interface: ::1 interface-automatic: "no" port: 53
outgoing-interface: 192.168.0.10 outgoing-range: 950
outgoing-port-permit: 52000-56096 outgoing-port-avoid:
"22,25,26,37,53,54,55,67,68,69,80,110,123,135,137,138,139,143,443,445,465,500,587,843,990,912,993,995,1025,1863,1935,2082,2083,2096,2400,4242,4400,4421,4444,4445,4480,4500,4569,5038,5050,5060,5061,5062,5063,5064,5065,5198,5199,5200,5222,5555,5800,5801,5900,5901,6666,6667,6668,6669,7000,7001,7002,7003,7004,7005,7006,7658,7659,7660,7777,8050,8052,8054,8056,8058,8060,8080,8110,8118,8120,8123,8125,8143,8210,8225,8243,8998,9001,9022,9030,9050,9051,9052,9053,9054,9055,9056,9057,9058,9059,9060,9080,10000,15000,15001,15002,15003,15004,16001,16999,20000,20001,25000,26999,30600,31000,32000,36999,50300"

outgoing-num-tcp: 25

Hi Wouter,
Few emails back i posted my unbound.conf/service.conf file that i use on
Windows XP.
on Windows XP i'm having these responsiveness issues.
on WinXP, unbound uses those mentioned DLLs/APIs.
My observations, tests & results are from Windows XP.
on Windows 7, i did not notice those API callings.
on Win7 unbound worked so far fine.

Solution (partial) :
in WinXP, any of these next tools can be used to change Processor
Affinity of running unbound.exe service:
ImageCFG.exe , ProcAff.exe
i have used/applied these, and worked.
i have assigned only 1 cpu core for unbound.exe windows service, and
using windows task Scheduler to do it automatically.
Now, i'm observing, cpu usage still jumping up (used by unbound.exe,
when an app tries to resolve dns), but at least it is not as much as
before. Mouse movement & audio, etc are responsive & working. :slight_smile:
-- Bry8Star.