Issues with Unbound 1.4.18

Hi,

I have 6 DNS caching Resolvers serving many thousands queries. Until recently all 6 were running Bind. As i've heard such great things about Unbound I decided to switch 3 of them to Unbound. So i've installed the latest Unbound 1.4.18 with Libevent and at first things seemed good but i've noticed at least 4 times a day Unbound is crashing with:

  kernel: pid 82301 (unbound), uid 59: exited on signal 11

I have a script that automatically restarts programs that should be running, so I didn't notice at first. I've tried tweaking loads of config options but nothing seems to make a difference. The servers aren't exactly that High spec:

FreeBSD 8.3
Dual Core CPU
4G Ram

However with Bind running the server is stable and if anything Bind uses a lot more CPU than Unbound for the same task. Can someone give me some idea's on what I can do to find out whats causing it to crash.

Regards

Phil

Hi Phil,

Hi,

I have 6 DNS caching Resolvers serving many thousands queries.
Until recently all 6 were running Bind. As i've heard such great
things about Unbound I decided to switch 3 of them to Unbound. So
i've installed the latest Unbound 1.4.18 with Libevent and at first
things seemed good but i've noticed at least 4 times a day Unbound
is crashing with:

kernel: pid 82301 (unbound), uid 59: exited on signal 11

I have a script that automatically restarts programs that should
be running, so I didn't notice at first. I've tried tweaking loads
of config options but nothing seems to make a difference. The
servers aren't exactly that High spec:

FreeBSD 8.3 Dual Core CPU 4G Ram

However with Bind running the server is stable and if anything Bind
uses a lot more CPU than Unbound for the same task. Can someone
give me some idea's on what I can do to find out whats causing it
to crash.

Get it to store coredumps. Compile with debuginfo (-g option to gcc,
enabled by default I believe). Store coredump with a ulimit statement
before you start the daemon to allow coredumps; I think it ends up in
the working directory. Then use gdb executable coredump; bt for stack
backtrace. (coredump without executable is not workable for me, I
would rather have the stack backtrace than the coredump itself).

A quick question, some people have ulimit set up to limit memory size
(heap size), and then configure unbound to use, e.g., XGb memory.
Does this seem the case? (this seems more common on BSDs) (if its out
of memory unbound should still not segv, but print errors).

If the above fails, tcpdump to capture the packets at the time of crash.

If you compile with assertions enabled (debug-mode), then you might
get an assertion failure (and what line number it crashed on), and not
sigsegv. Assertions are fairly cheap in performance terms.

The worst option for performance is to enable verbosity (level 4 or
so), it would likely slow down your system so that it could not handle
your incoming traffic any longer. Produce Gbs of logfile. More
useful if you can trigger the error yourself (thus keep logfile short).

Best regards,
   Wouter

Phil,

multiple unbound processes running here for more then 4 months
but is't 1.4.16 ...

# unbound-control status
version: 1.4.16
verbosity: 1
threads: 4
modules: 2 [ validator iterator ]
uptime: 10955807 seconds
unbound (pid 12419) is running...

# unbound-control stats_noreset | grep QUERY
num.query.opcode.QUERY=124185112

-> ~ 10 per second, that's not really 'load'

Hi Wouter,

Thanks for the advice , I've recompiled unbound with debug enabled, So I'll wait until the next crash and see if I get any extra information. In terms of load these servers are doing about 1400 queries per second and when they are running handle this with about 10-15% CPU. As for ulimit I haven't changed any defaults.

Regards

Phil

Hi Phil,

Hi Wouter,

Thanks for the advice , I've recompiled unbound with debug enabled,
So I'll wait until the next crash and see if I get any extra
information. In terms of load these servers are doing about 1400
queries per second and when they are running handle this with about
10-15% CPU. As for ulimit I haven't changed any defaults.

The default could be small for the data seg size (one Freebsd server
tells me 512M, another 1G). That could be easily too small for
unbound if you unbound.conf a large cache. Configuring unbound to use
500 M according to config would use about 1000M in practice (due to
malloc). Monitor your process size, or check the config file (default
unbound easily fits in the 512Mb default, but then would be a bit
small for the amount of queries per second you get). I would guess
that you want to use a big chunk of memory of your machine to service
that queryload, multiple GBs, and this does not fit in the default ulimit.

Also, I am interested if unbound prints errors to syslog? Or the
logfile if you enabled one?

Best regards,
   Wouter

Hi.

I think you have 512M default data seg size. You can check it “ulimit -d”.
If is it true, try tune “kern.maxdsiz”.

Hi,

Here's my ulimit output:

  ulimit -a
socket buffer size (bytes, -b) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) 33554432
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 24716
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 524288
cpu time (seconds, -t) unlimited
max user processes (-u) 5547
virtual memory (kbytes, -v) unlimited
swap size (kbytes, -w) unlimited

Its set to 32G which seems to be the default on FreeBSD 8.3 amd64 as I have checked this value on a number of severs running at least 8.2 amd64 and they are all the same.

Phil

Hi Wouter,

I managed to get unbound to generate a core dump. The back trace suggests a possible issue with DNSSEC and OpenSSL related to when verifying rsa keys any way here it is:

# gdb /usr/local/sbin/unbound /usr/local/etc/unbound/unbound.core
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `unbound'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/local/lib/libssl.so.8...done.
Loaded symbols for /usr/local/lib/libssl.so.8
Reading symbols from /usr/local/lib/libldns.so.1...done.
Loaded symbols for /usr/local/lib/libldns.so.1
Reading symbols from /lib/libutil.so.8...done.
Loaded symbols for /lib/libutil.so.8
Reading symbols from /usr/local/lib/libevent-1.4.so.4...done.
Loaded symbols for /usr/local/lib/libevent-1.4.so.4
Reading symbols from /usr/local/lib/libcrypto.so.8...done.
Loaded symbols for /usr/local/lib/libcrypto.so.8
Reading symbols from /lib/libthr.so.3...done.
Loaded symbols for /lib/libthr.so.3
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /usr/lib/libz.so...done.
Loaded symbols for /usr/lib/libz.so
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
#0 0x00000000005b10c0 in rbtree_null_node ()
[New Thread 8013bc1c0 (LWP 100105/unbound)]
[New Thread 8012041c0 (LWP 100174/unbound)]
(gdb) bt
#0 0x00000000005b10c0 in rbtree_null_node ()
#1 0x0000000800ca001a in int_rsa_verify (dtype=672, m=0x7fffffffdab0
"?�\203�)�!\232��\202�\206zK\224�5\204\026�D\216m\233\r��\213��0@��\002\b",
m_len=32, rm=0x0, prm_len=0x0, sigbuf=0x82967e77d "\\N���", siglen=128,
rsa=0x82be59940) at rsa_sign.c:199
#2 0x0000000800ca032b in RSA_verify (dtype=Variable "dtype" is not
available.
) at rsa_sign.c:317
#3 0x0000000800ca48d3 in pkey_rsa_verify (ctx=Variable "ctx" is not
available.
) at rsa_pmeth.c:371
#4 0x0000000800ccd2c2 in EVP_VerifyFinal (ctx=0x7fffffffdbb0,
sigbuf=0x82967e77d "\\N���", siglen=Variable "siglen" is not available.
) at p_verify.c:91
#5 0x0000000000477e10 in verify_canonrrset (buf=0x80266cac0, algo=8,
sigblock=0x82967e77d "\\N���", sigblock_len=128, key=0x82976b562
"\001\003��m�֮�B\034\t`W�$�ȽY�|\215\204\203l\027�D}\215�\177���2\237�+\036�hc�>�p_\211U\031\205�\177�y\227\211K\207�g\234�o\214<AYT\034\016��p|�\001��AvOpI;S$�\004�i\211ZA��\211ͳ�l\177W\030X\216\\�#�xP�lo[F�F\"��\006���G�\001\027",
keylen=130, reason=0x7fffffffe1c0) at validator/val_secalgo.c:534
#6 0x000000000047b05d in dnskey_verify_rrset_sig (region=0x802737000,
buf=0x80266cac0, ve=0x801243150, now=1346696668, rrset=0x82967e3d0,
dnskey=0x7fffffffdfa0, dnskey_idx=1, sig_idx=0, sortree=0x7fffffffdec8,
buf_canon=0x7fffffffdd38, reason=0x7fffffffe1c0) at
validator/val_sigcrypt.c:1349
#7 0x0000000000479639 in dnskeyset_verify_rrset_sig (env=0x8013fd3e8,
ve=0x801243150, now=1346696668, rrset=0x82967e3d0,
dnskey=0x7fffffffdfa0, sig_idx=0, sortree=0x7fffffffdec8,
reason=0x7fffffffe1c0) at validator/val_sigcrypt.c:607
#8 0x0000000000479127 in dnskeyset_verify_rrset (env=0x8013fd3e8,
ve=0x801243150, rrset=0x82967e3d0, dnskey=0x7fffffffdfa0,
sigalg=0x82976b700 "\b", reason=0x7fffffffe1c0) at
validator/val_sigcrypt.c:504
#9 0x000000000047bbc3 in val_verify_rrset (env=0x8013fd3e8,
ve=0x801243150, rrset=0x82967e3d0, keys=0x7fffffffdfa0,
sigalg=0x82976b700 "\b", reason=0x7fffffffe1c0) at validator/val_utils.c:334
#10 0x000000000047be11 in val_verify_rrset_entry (env=0x8013fd3e8,
ve=0x801243150, rrset=0x82967e3d0, kkey=0x82976b360,
reason=0x7fffffffe1c0) at validator/val_utils.c:380
#11 0x0000000000475c3b in list_is_secure (env=0x8013fd3e8,
ve=0x801243150, list=0x82967e308, num=3, kkey=0x82976b360,
reason=0x7fffffffe1c0) at validator/val_nsec3.c:1351
#12 0x0000000000475dab in nsec3_prove_nods (env=0x8013fd3e8,
ve=0x801243150, list=0x82967e308, num=3, qinfo=0x82967c080,
kkey=0x82976b360, reason=0x7fffffffe1c0) at validator/val_nsec3.c:1378
#13 0x000000000046cc9d in ds_response_to_ke (qstate=0x829768080,
vq=0x82976b250, id=0, rcode=0, msg=0x82967e298, qinfo=0x82967c080,
ke=0x7fffffffe250) at validator/validator.c:2438
#14 0x000000000046d0d1 in process_ds_response (qstate=0x829768080,
vq=0x82976b250, id=0, rcode=0, msg=0x82967e298, qinfo=0x82967c080,
origin=0x82967eb40) at validator/validator.c:2555
#15 0x000000000046dda6 in val_inform_super (qstate=0x82967c080, id=0,
super=0x829768080) at validator/validator.c:2895
#16 0x00000000004425c8 in mesh_walk_supers (mesh=0x802a64480,
mstate=0x82967c030) at services/mesh.c:924
#17 0x0000000000442b0b in mesh_continue (mesh=0x802a64480,
mstate=0x82967c030, s=module_finished, ev=0x7fffffffe32c) at
services/mesh.c:1041
#18 0x0000000000442cc1 in mesh_run (mesh=0x802a64480,
mstate=0x82967c030, ev=module_event_moddone, e=0x0) at services/mesh.c:1072
#19 0x0000000000440ae7 in mesh_report_reply (mesh=0x802a64480,
e=0x82967e278, reply=0x7fffffffe680, what=0) at services/mesh.c:488
#20 0x00000000004167b5 in worker_handle_service_reply (c=0x80265f140,
arg=0x82967e278, error=0, reply_info=0x7fffffffe680) at daemon/worker.c:287
#21 0x00000000004899f1 in serviced_callbacks (sq=0x8044db1c0, error=0,
c=0x80265f140, rep=0x7fffffffe680) at services/outside_network.c:1511
#22 0x000000000048ab18 in serviced_udp_callback (c=0x80265f140,
arg=0x8044db1c0, error=0, rep=0x7fffffffe680) at
services/outside_network.c:1782
#23 0x00000000004866d3 in outnet_udp_cb (c=0x80265f140, arg=0x801bf1500,
error=0, reply_info=0x7fffffffe680) at services/outside_network.c:462
#24 0x000000000047f4dc in comm_point_udp_callback (fd=179, event=2,
arg=0x80265f140) at util/netevent.c:656
#25 0x0000000800ac2302 in event_process_active (base=0x8013cf900) at
event.c:400
#26 0x0000000800ac269c in event_base_loop (base=0x8013cf900, flags=0) at
event.c:552
#27 0x0000000800ac236a in event_base_dispatch (event_base=0x8013cf900)
at event.c:420
#28 0x000000000047e16b in comm_base_dispatch (b=0x8013c3360) at
util/netevent.c:265
#29 0x0000000000419d99 in worker_work (worker=0x8013fc000) at
daemon/worker.c:1208
#30 0x000000000040cee6 in daemon_fork (daemon=0x80120b140) at
daemon/daemon.c:487
#31 0x0000000000416059 in run_daemon (cfgfile=0x48dafe "/unbound.conf",
cmdline_verbose=0, debug_mode=0) at daemon/unbound.c:662
#32 0x0000000000416261 in main (argc=0, argv=0x7fffffffe970) at
daemon/unbound.c:756
(gdb)

For reference i'm using openssl version : openssl-1.0.1_4 (installed from the ports) . Let me know if you need any further information.

Thanks

Phil Davies

Hi Phil,

Thanks for the coredump! This bug is related to your query volume and
openssl locking, it has been fixed in the development version at 1 August.

workarounds would be to run with num-threads: 1 (probably easiest), or
to have unbound compiled --without-pthreads (forks a new process
instead of a new thread for multicore machines). Or disable
validation of DNSSEC. Or use the patch.

There is also a patch, attached.

It is caused because, although unbound does not use openssl's locking
and threading, openssl has global tables in which it needs locking.
This is why it can run so long without a problem, these tables simply
do not change, so the absence of locking is no problem, and the race
condition window is very small, but something happens inside openssl
memory allocation (reference counting) that causes problems. The
patch initializes locks for openssl.

Best regards,
   Wouter

Hi Wouter,

I managed to get unbound to generate a core dump. The back trace
suggests a possible issue with DNSSEC and OpenSSL related to when
verifying rsa keys any way here it is:

# gdb /usr/local/sbin/unbound /usr/local/etc/unbound/unbound.core
GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation,
Inc. GDB is free software, covered by the GNU General Public
License, and you are welcome to change it and/or distribute copies
of it under certain conditions. Type "show copying" to see the
conditions. There is absolutely no warranty for GDB. Type "show
warranty" for details. This GDB was configured as
"amd64-marcel-freebsd"... Core was generated by `unbound'. Program
terminated with signal 11, Segmentation fault. Reading symbols from
/usr/local/lib/libssl.so.8...done. Loaded symbols for
/usr/local/lib/libssl.so.8 Reading symbols from
/usr/local/lib/libldns.so.1...done. Loaded symbols for
/usr/local/lib/libldns.so.1 Reading symbols from
/lib/libutil.so.8...done. Loaded symbols for /lib/libutil.so.8
Reading symbols from /usr/local/lib/libevent-1.4.so.4...done.
Loaded symbols for /usr/local/lib/libevent-1.4.so.4 Reading symbols
from /usr/local/lib/libcrypto.so.8...done. Loaded symbols for
/usr/local/lib/libcrypto.so.8 Reading symbols from
/lib/libthr.so.3...done. Loaded symbols for /lib/libthr.so.3
Reading symbols from /lib/libc.so.7...done. Loaded symbols for
/lib/libc.so.7 Reading symbols from /usr/lib/libz.so...done. Loaded
symbols for /usr/lib/libz.so Reading symbols from
/libexec/ld-elf.so.1...done. Loaded symbols for
/libexec/ld-elf.so.1 #0 0x00000000005b10c0 in rbtree_null_node ()
[New Thread 8013bc1c0 (LWP 100105/unbound)] [New Thread 8012041c0
(LWP 100174/unbound)] (gdb) bt #0 0x00000000005b10c0 in
rbtree_null_node () #1 0x0000000800ca001a in int_rsa_verify
(dtype=672, m=0x7fffffffdab0
"?�\203�)�!\232��\202�\206zK\224�5\204\026�D\216m\233\r��\213��0@��\002\b",

m_len=32, rm=0x0, prm_len=0x0, sigbuf=0x82967e77d "\\N���", siglen=128,

(attachments)

patch_openssl_threads.diff (2.58 KB)

The OpenSSL Dev mailing-list has just discussed a fix for an OpenSSL
deadlock, with a thread trying to re-acquire an already-held lock, as a
rare occurrence; changelogged as "Fix possible deadlock when decoding
public keys". The report I read was a report from the OpenLDAP folks.

RT ticket (login guest/guest):
  OpenLDAP:
    http://rt.openssl.org/Ticket/Display.html?id=2866
  Earlier/original report:
    http://rt.openssl.org/Ticket/Display.html?id=2813

Fix in:
  http://cvs.openssl.org/chngview?cn=22570

Perhaps you're hitting the same issue? Does that patch help?

-Phil