deadlock in unbound-control commands

Hello,

I have encountered an issue in which unbound appears to hang when issuing commands via unbound-control. I am running unbound 1.5.4 built on an Ubuntu 12.04 system. I've configured unbound with the same options available from the unbound package for Ubuntu precise:

configured for x86_64-unknown-linux-gnu on Wed Feb 8 05:48:44 UTC 2012 with options: '--prefix=/usr' '--sysconfdir=/etc' '--disable-rpath' '--with-pidfile=/var/run/unbound.pid' '--with-libevent' '--with-pythonmodule' '--with-pyunbound'

This issue is somewhat difficult to reproduce, as it doesn't manifest itself regularly (perhaps 1 in 100 unbound-controll calls), but I have captured what info I can. I see first that the unbound-control shell call hangs; strace of the unbound-control process shows it is blocked on a read() call, reading from the unbound daemon. Unbound itself is hanging at the following syscall:

root@server:~# strace -p 538
Process 538 attached - interrupt to quit
futex(0x2f8ad3c, FUTEX_WAIT_PRIVATE, 0, NULL

Below is a backtrace, showing what appears to be a deadlock in local_zones_add_RR:

(gdb) bt full
#0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83
No locals.
#1 0x000000000040eccd in local_zones_add_RR (zones=0x2026d70, rr=0x38c7dcec35c "privfoo.p-hex.com IN A 10.10.10.68")
    at services/localzone.c:1352
        lockret_err = <optimized out>
        rr_name = 0x2a497d0 "\aprivfoo\005p-hex\003com"
        rr_class = 1
        len = 19
        labs = 4
        z = <optimized out>
        r = <optimized out>
#2 0x0000000000429011 in execute_cmd.7956 (rc=0x2021930, ssl=0x2a45430, cmd=<optimized out>, worker=0x1ef3010) at daemon/remote.c:1146
        arg = 0x38c7dcec35c "privfoo.p-hex.com IN A 10.10.10.68"
        worker = 0x1ef3010
        ssl = 0x2a45430
        p = 0x38c7dcec351 "local_data privfoo.p-hex.com IN A 10.10.10.68"
#3 0x000000000046ea9b in handle_req.isra.19 (rc=0x2021930, ssl=0x2a45430) at daemon/remote.c:2555
        r = <optimized out>
        pre = "UBCT1 \000\000", <incomplete sequence \352>
        magic = "UBCT1 "
        buf = " local_data privfoo.p-hex.com IN A 10.10.10.68\000\000\f\000\000\000\000\000\000\000\240\306\316}\214\003\000\000\005\000\000\000\000\000\000\000@\252\f\260<\003\000\000pi\244\002\000\000\000\000\355s֯<\003\000\000\220\304\316}\214\003\000\000\002E߯<\003\000\000\200\257\244\002\000\000\000\000\250\222U\260<\003\000\000\326\301b\330֡Cˠ\306\316}\214\003\000\000\f\000\000\000\326QS\330\000\001\004\000\000\000\000\000\340\304\316}\214\003\000\000\244W\244\002\000\000\000\000\060\000\000\000\060\000\000\000\364pW\260<\003\000\000 \306\316}\214\003", '\000' <repeats 18 times>, "\017\000\000\000\060\000\000\000\017\000\000\000\000\000\000\000\060", '\000' <repeats 15 times>, "0\000\000\000\005\000\000\000\200"...
#4 0x000000000046ec09 in remote_control_callback (c=<optimized out>, arg=0x2a341e0, err=<optimized out>, rep=<optimized out>)
    at daemon/remote.c:2624
        s = 0x2a341e0
        rc = 0x2021930
        r = 1
#5 0x0000033cb02f89cc in event_base_loop () from /usr/lib/libevent-2.0.so.5
No symbol table info available.
#6 0x0000000000430b9c in comm_base_dispatch (b=<optimized out>) at util/netevent.c:305
---Type <return> to continue, or q <return> to quit---
        retval = <optimized out>
#7 0x0000000000467ac6 in worker_work (worker=<optimized out>) at daemon/worker.c:1311
No locals.
#8 daemon_fork (daemon=0x1e723f0) at daemon/daemon.c:566
No locals.
#9 0x0000000000411eaa in run_daemon (debug_mode=0, cmdline_verbose=0, cfgfile=0x48bc49 "/etc/unbound/unbound.conf")
    at daemon/unbound.c:671
        cfg = <optimized out>
        daemon = <optimized out>
        done_setup = 1
#10 main (argc=<optimized out>, argv=<optimized out>) at daemon/unbound.c:766
        c = <optimized out>
        cfgfile = 0x48bc49 "/etc/unbound/unbound.conf"
        winopt = <optimized out>
        cmdline_verbose = 0
        debug_mode = 0

The local_zones struct 'zones' passed into local_zones_add_RR appears as such:

(gdb) print *(struct local_zones *)0x2026d70
$2 = {lock = {__data = {__lock = 0, __nr_readers = 1, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0,
      __nr_writers_queued = 1, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0},
    __size = "\000\000\000\000\001", '\000' <repeats 15 times>, "\001", '\000' <repeats 34 times>, __align = 4294967296}, ztree = {
    root = 0x208e000, count = 103, cmp = 0x4489c0 <local_zone_cmp>}}

In another instance of this I saw this hang in daemon/remote.c:

(gdb) bt full
#0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83
No locals.
#1 0x0000000000428d00 in do_zone_remove (arg=<optimized out>, worker=0x15a6e90, ssl=0x21996e0) at daemon/remote.c:1131
        lockret_err = <optimized out>
        nm = <optimized out>
        nmlabs = <optimized out>
        nmlen = <optimized out>
        z = 0x3cab0594990
#2 execute_cmd.7956 (rc=0x1551a20, ssl=0x21996e0, cmd=0x3cab0594990 " local_zone_remove p-hex.com", worker=0x15a6e90)
    at daemon/remote.c:2462
        p = 0x3cab0594991 "local_zone_remove p-hex.com"
#3 0x000000000046ea9b in handle_req.isra.19 (rc=0x1551a20, ssl=0x21996e0) at daemon/remote.c:2555
        r = <optimized out>
        pre = "UBCT1 \000\000\020p"
        magic = "UBCT1 "
        buf = " local_zone_remove p-hex.com\000\000\000\000\f\000\000\000\000\000\000\000\240\236\031\002\000\000\000\000\f\000\000\000\000\000\000\000\340LY\260\312\003\000\000\005\000\000\000\000\000\000\000@j.\252\212\002\000\000@\326\031\002\000\000\000\000\355\063\370\251\212\002\000\000\320JY\260\312\003\000\000\002\005\001\252\212\002\000\000\240\211\030\002\000\000\000\000\250Rw\252\212\002\000\000\326\301b\330֡C\313\340LY\260\312\003\000\000\f\000\000\000\326QS\330\000\001\004\000\000\000\000\000 KY\260\312\003\000\000T\232\031\002\000\000\000\000\060\000\000\000\060\000\000\000\364\060y\252\212\002\000\000`LY\260\312\003", '\000' <repeats 18 times>, "\017\000\000\000\060\000\000\000\017\000\000\000\000\000\000\000\060", '\000' <repeats 15 times>, "0\000\000\000\005"...
#4 0x000000000046ec09 in remote_control_callback (c=<optimized out>, arg=0x21884c0, err=<optimized out>, rep=<optimized out>)
    at daemon/remote.c:2624
        s = 0x21884c0
        rc = 0x1551a20
        r = 1
#5 0x0000028aaa5149cc in event_base_loop () from /usr/lib/libevent-2.0.so.5
No symbol table info available.
#6 0x0000000000430b9c in comm_base_dispatch (b=<optimized out>) at util/netevent.c:305
        retval = <optimized out>
#7 0x0000000000467ac6 in worker_work (worker=<optimized out>) at daemon/worker.c:1311
No locals.
#8 daemon_fork (daemon=0x1526270) at daemon/daemon.c:566
No locals.
#9 0x0000000000411eaa in run_daemon (debug_mode=0, cmdline_verbose=0, cfgfile=0x48bc49 "/etc/unbound/unbound.conf")
    at daemon/unbound.c:671
        cfg = <optimized out>
        daemon = <optimized out>
        done_setup = 1
#10 main (argc=<optimized out>, argv=<optimized out>) at daemon/unbound.c:766
        c = <optimized out>
        cfgfile = 0x48bc49 "/etc/unbound/unbound.conf"
        winopt = <optimized out>
        cmdline_verbose = 0
        debug_mode = 0

In both of these cases it appears to be hanging on the 'lock_rw_t lock' member of the local_zones struct. Below is the relevant portions of the config (excluding only access control):

root@server:~# sed -e '/#.*$/d' -e '/^$/d' /etc/unbound/unbound.conf
remote-control:
  control-enable: yes
server:
  verbosity: 4
  statistics-interval: 0
  statistics-cumulative: no
  extended-statistics: no
  num-threads: 1
  interface: 10.242.128.232
  port: 53
  do-ip4: yes
  do-udp: yes
  do-tcp: yes
  chroot: ""
  directory: "/etc/unbound"
  logfile: "/var/log/unbound.log"
  use-syslog: "no"
  pidfile: "/var/run/unbound.pid"
  auto-trust-anchor-file: /var/lib/unbound/root.key

root@server:~# unbound -h
usage: unbound [options]
  start unbound daemon DNS resolver.
-h this help
-c file config file to read instead of /etc/unbound/unbound.conf
  file format is described in unbound.conf(5).
-d do not fork into the background.
-v verbose (more times to increase verbosity)
Version 1.5.4
linked libs: libevent 2.0.16-stable (it uses epoll), OpenSSL 1.0.1 14 Mar 2012
linked modules: dns64 validator iterator
BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs@nlnetlabs.nl

I have found this problems occurs whether num-threads is set to 1 or 8 (number of cores in my system). There are roughly 15000 local-data entries in this machine, from half a dozen local-zone typetransparent entries. Unbound logging doesn't indicate any failure (the last entry is the one generated from the unbound-control call). This seems to occur most often when adding new local-data entries, but I have also seen this occur when calling 'local_zone_remove' (in this case it was also hanged on pthread_rwlock_wrlock). I have not been able to replicate this with the version of Unbound made available from the Ubuntu precise package (1.4.16). Please let me know what further information I can provide to assist in debugging this problem. Thank you!

Hi Poprocks,

Hello,

I have encountered an issue in which unbound appears to hang when
issuing commands via unbound-control. I am running unbound 1.5.4
built on an Ubuntu 12.04 system. I've configured unbound with the
same options available from the unbound package for Ubuntu
precise:

Thank you for the report with all the details. I could not reproduce
it, but code inspection reveals a deadlock that is caused when you
interrupt the printout of unbound-control list_local_data.

You have so much local data (10000s) that the interrupt carries back
to the daemon, and then there is a bug in the code path that deals
with the interrupted TLS stream. That does not properly unlock the
zones.lock and thus the next lock attempt hangs.

It is fixed in the code repository for the next release. Or, as a
workaround, do not ^C the printout of unbound-control list_local_data.

Best regards,
   Wouter

configured for x86_64-unknown-linux-gnu on Wed Feb 8 05:48:44 UTC
2012 with options: '--prefix=/usr' '--sysconfdir=/etc'
'--disable-rpath' '--with-pidfile=/var/run/unbound.pid'
'--with-libevent' '--with-pythonmodule' '--with-pyunbound'

This issue is somewhat difficult to reproduce, as it doesn't
manifest itself regularly (perhaps 1 in 100 unbound-controll
calls), but I have captured what info I can. I see first that the
unbound-control shell call hangs; strace of the unbound-control
process shows it is blocked on a read() call, reading from the
unbound daemon. Unbound itself is hanging at the following
syscall:

root@server:~# strace -p 538 Process 538 attached - interrupt to
quit futex(0x2f8ad3c, FUTEX_WAIT_PRIVATE, 0, NULL

Below is a backtrace, showing what appears to be a deadlock in
local_zones_add_RR:

(gdb) bt full #0 pthread_rwlock_wrlock () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:83
No locals. #1 0x000000000040eccd in local_zones_add_RR
(zones=0x2026d70, rr=0x38c7dcec35c "privfoo.p-hex.com IN A
10.10.10.68") at services/localzone.c:1352 lockret_err = <optimized
> rr_name = 0x2a497d0 "\aprivfoo\005p-hex\003com" rr_class = 1
len = 19 labs = 4 z = <optimized out> r = <optimized out> #2
0x0000000000429011 in execute_cmd.7956 (rc=0x2021930,
ssl=0x2a45430, cmd=<optimized out>, worker=0x1ef3010) at
daemon/remote.c:1146 arg = 0x38c7dcec35c "privfoo.p-hex.com IN A
10.10.10.68" worker = 0x1ef3010 ssl = 0x2a45430 p = 0x38c7dcec351
"local_data privfoo.p-hex.com IN A 10.10.10.68" #3
0x000000000046ea9b in handle_req.isra.19 (rc=0x2021930,
ssl=0x2a45430) at daemon/remote.c:2555 r = <optimized out> pre =
"UBCT1 \000\000", <incomplete sequence \352> magic = "UBCT1 " buf =
" local_data privfoo.p-hex.com IN A
10.10.10.68\000\000\f\000\000\000\000\000\000\000\240\306\316}\214\003

\000\000\005\000\000\000\000\000\000\000@\252\f\260<\003\000\000pi\244\0
02\000\000\000\000\355s?<\003\000\000\220\304\316}\214\003\000\000\002E?
<\003\000\000\200\257\244\002\000\000\000\000\250\222U\260<\003\000\000\
326\301b\330?C?\306\316}\214\003\000\000\f\000\000\000\326QS\330\000\001
\004\000\000\000\000\000\340\304\316}\214\003\000\000\244W\244\002\000\0
00\000\000\060\000\000\000\060\000\000\000\364pW\260<\003\000\000

Hello!