Unbound 1.4.18 crashes under high query load

Hi,

While trying to debug my previous Unbound crash issue (see the thread
from yesterday), this time on Debian sid using Unbound 1.4.18, I
encountered another OpenSSL related issue.

It looks like a double free of OpenSSL's EVP_PKEY structure (or
alternatively memory corruption). This issue happens also under heavy
query load but it happens much more frequently than my previous issue.
Also this issue comes up even when "num-threads" is 1. Thus it is not
related to OpenSSL thread safety issues.

See the gdb backtrace at the bottom of the message.

Additionally I noticed a minor issue: In validator/val_secalgo.c the
function setup_key_digest() may return failure if a call to
EVP_PKEY_new() returns NULL. In that case the calling function
verify_canonrrset() will try to do EVP_PKEY_free() with the NULL value.
I am unsure if EVP_PKEY_free() has internal safeguards against this
problem (if it does have, it is not documented). I have been able to
avoid touching OpenSSL so far, thus I am not familiar with it.

Any thoughts? This problem is also difficult to debug because a high
query load is required to trigger it. Therefore setting breakpoints and
single-stepping is out of question. In my test setup I am testing by
resolving names from Alexa top 1 million web sites list. I am feeding
that list to a Go program which constantly runs 100 DNS lookups in
parallel and sends those queries to Unbound -- which always eventually
crashes.

Please tell me if I can assist further in resolving this. I like
Unbound, I wish it was just a bit more robust...

Program received signal SIGABRT, Aborted.
0x00007facd2d18475 in *__GI_raise (sig=<optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007facd2d18475 in *__GI_raise (sig=<optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007facd2d1b6f0 in *__GI_abort () at abort.c:92
#2 0x00007facd2d5232b in __libc_message (do_abort=<optimized out>,
    fmt=<optimized out>) at ../sysdeps/unix/sysv/linux/libc_fatal.c:189
#3 0x00007facd2d5bb76 in malloc_printerr (action=3,
    str=0x7facd2e32768 "double free or corruption (!prev)",
    ptr=<optimized out>) at malloc.c:6283
#4 0x00007facd2d608ac in *__GI___libc_free (mem=<optimized out>)
    at malloc.c:3738
#5 0x00007facd3301f1d in CRYPTO_free (str=0x82c640) at mem.c:397
#6 0x00007facd336c6a8 in engine_free_util (e=e@entry=0x82c640,
    locked=locked@entry=0) at eng_lib.c:136
#7 0x00007facd336d2c2 in engine_unlocked_finish (e=e@entry=0x82c640,
    unlock_for_handlers=unlock_for_handlers@entry=1) at eng_init.c:112
#8 0x00007facd336d401 in ENGINE_finish (e=0x82c640) at eng_init.c:146
#9 0x00007facd335e660 in RSA_free (r=0x7facccd33a50) at rsa_lib.c:238
#10 0x00007facd338854b in EVP_PKEY_free_it (x=x@entry=0x7facccb81550)
    at p_lib.c:415
#11 0x00007facd3388cd8 in EVP_PKEY_free (x=0x7facccb81550) at p_lib.c:405
#12 0x0000000000437fde in verify_canonrrset (buf=0x7faccc611d20, algo=8,
    sigblock=0x116dfd5
"D\243[\270r\262\271\071\362\355\227\364\310ф#\320\341˼\177\233`\325Or\346\003\035\345z\206r\374\207\245\\\367\222\062>Z-l\256\302\316\323c\217\061\177\225J#\377\370\327@s?\373\030\006b\035Qy@\004,\310\310\060\363+s\232\364Ȼ\ti\273%6$އ_\330\301w\234lf\212+:bquu,zӠ\n\364˾D\032G\001I\245m\bS[p\255\032\270rv1\227",
sigblock_len=128,
    key=0x7facc86cb36a "\001\003\260\304\071[r\365\314\tB\033t@\340\310b
\277nzL]\210\231գ\250\005\244\255\276N\340\365\332j+^잫\035\032=&'\373t
\317\353\231\336\372\220\017!\262&3kB\273Q\232\n\243v\225!Y\261\020\260\371\f\365\225\236\367\210\347\210\332\177\224\217w\266\303\356\303?m\201\234̐y\006\204\256\004\016b\"S|\220\071\323\357\270\316\020\"\331\352L7\376w\035\250o\243\275\237\244\001\345\001\006\001\001\003\b\001\003\303\316WM\230\313\331\025~\rp\322t\270I\312\016\016\355\232\377\305\334̐GIi\006e\\5\313\b\263<M\027\033\001|\243V\364\226\002b\252b\223\315\372\350\261;U\262\034\065\034ߧ"...,
keylen=130,
    reason=0x7fffdf646b90) at validator/val_secalgo.c:541
#13 0x000000000042f990 in dnskey_verify_rrset_sig (region=0x7faccc460e80,
    buf=0x7faccc611d20, ve=0x8637a0, now=1350998202, rrset=0x116dbf8,
    dnskey=0x7fffdf646950, dnskey_idx=0, sig_idx=0, sortree=0x7fffdf646768,
    buf_canon=0x7fffdf6466ec, reason=0x7fffdf646b90)
    at validator/val_sigcrypt.c:1349
#14 0x0000000000439416 in dnskeyset_verify_rrset_sig (env=0x8bfd78,
    ve=0x8637a0, now=1350998202, rrset=0x116dbf8, dnskey=0x7fffdf646950,
    sig_idx=0, sortree=0x7fffdf646768, reason=0x7fffdf646b90)
    at validator/val_sigcrypt.c:607
#15 0x0000000000438e9a in dnskeyset_verify_rrset (env=0x8bfd78, ve=0x8637a0,
    rrset=0x116dbf8, dnskey=0x7fffdf646950, sigalg=0x7facc86cb610 "\b",
    reason=0x7fffdf646b90) at validator/val_sigcrypt.c:504
#16 0x00000000004304de in val_verify_rrset (env=0x8bfd78, ve=0x8637a0,
    rrset=0x116dbf8, keys=0x7fffdf646950, sigalg=0x7facc86cb610 "\b",
    reason=0x7fffdf646b90) at validator/val_utils.c:334
#17 0x0000000000430754 in val_verify_rrset_entry (env=0x8bfd78, ve=0x8637a0,
    rrset=0x116dbf8, kkey=0x7facc86cb240, reason=0x7fffdf646b90)
    at validator/val_utils.c:380
#18 0x0000000000435cc8 in list_is_secure (env=0x8bfd78, ve=0x8637a0,
    list=0x116dad0, num=3, kkey=0x7facc86cb240, reason=0x7fffdf646b90)
    at validator/val_nsec3.c:1351
#19 0x0000000000435e06 in nsec3_prove_nods (env=0x8bfd78, ve=0x8637a0,
    list=0x116dad0, num=3, qinfo=0x116b5c0, kkey=0x7facc86cb240,
    reason=0x7fffdf646b90) at validator/val_nsec3.c:1378
#20 0x00000000004434e4 in ds_response_to_ke (qstate=0x7facc86c7860,
    vq=0x7facc86cb128, id=0, rcode=0, msg=0x116da60, qinfo=0x116b5c0,
    ke=0x7fffdf646c20) at validator/validator.c:2438
#21 0x0000000000443914 in process_ds_response (qstate=0x7facc86c7860,
    vq=0x7facc86cb128, id=0, rcode=0, msg=0x116da60, qinfo=0x116b5c0,
    origin=0x116e3f8) at validator/validator.c:2555
#22 0x000000000044464c in val_inform_super (qstate=0x116b5c0, id=0,
    super=0x7facc86c7860) at validator/validator.c:2895
#23 0x0000000000467178 in mesh_walk_supers (mesh=0x7faccc6014e0,
    mstate=0x116b570) at services/mesh.c:924
#24 0x000000000046768d in mesh_continue (mesh=0x7faccc6014e0,
    mstate=0x116b570, s=module_finished, ev=0x7fffdf646cfc)
    at services/mesh.c:1041
#25 0x0000000000467843 in mesh_run (mesh=0x7faccc6014e0, mstate=0x116b570,
    ev=module_event_moddone, e=0x0) at services/mesh.c:1072
#26 0x000000000046f570 in mesh_report_reply (mesh=0x7faccc6014e0,
e=0x116da40,
    reply=0x7fffdf647000, what=0) at services/mesh.c:488
#27 0x00000000004912a5 in worker_handle_service_reply (c=0x7faccc3f6f40,
    arg=0x116da40, error=0, reply_info=0x7fffdf647000) at
daemon/worker.c:287
#28 0x000000000040b26f in serviced_callbacks (sq=0x7facccb14190, error=0,
    c=0x7faccc3f6f40, rep=0x7fffdf647000) at services/outside_network.c:1511
#29 0x000000000040c315 in serviced_udp_callback (c=0x7faccc3f6f40,
    arg=0x7facccb14190, error=0, rep=0x7fffdf647000)
    at services/outside_network.c:1782
#30 0x00000000004119bb in outnet_udp_cb (c=0x7faccc3f6f40,
arg=0x7faccc154d60,
    error=0, reply_info=0x7fffdf647000) at services/outside_network.c:462
#31 0x0000000000416e79 in comm_point_udp_callback (fd=354, event=2,
    arg=0x7faccc3f6f40) at util/netevent.c:656
#32 0x00007facd3d86ccc in event_base_loop ()
   from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#33 0x00000000004160ed in comm_base_dispatch (b=0x8639d0)
    at util/netevent.c:265
#34 0x0000000000494310 in worker_work (worker=0x8be990) at
daemon/worker.c:1208
#35 0x00000000004a038d in daemon_fork (daemon=0x814010) at
daemon/daemon.c:487
#36 0x0000000000497988 in run_daemon (
    cfgfile=0x4c31b9 "/etc/unbound/unbound.conf", cmdline_verbose=0,
    debug_mode=0) at daemon/unbound.c:660
#37 0x0000000000497b88 in main (argc=0, argv=0x7fffdf647360)
    at daemon/unbound.c:754

Hi Janne,

Hi,

While trying to debug my previous Unbound crash issue (see the
thread from yesterday), this time on Debian sid using Unbound
1.4.18, I encountered another OpenSSL related issue.

This is the same bug with openssl-locking. The newer openssl requires
the locking, where the previous versions did not need locking for the
functions that unbound uses. This is causing the double-free and
other failure that you noted. It is fixed in svn trunk of unbound.

The crash with num-threads 1 is odd, can you reproduce that with the
svn-trunk version?

Best regards,
   Wouter

It looks like a double free of OpenSSL's EVP_PKEY structure (or
alternatively memory corruption). This issue happens also under
heavy query load but it happens much more frequently than my
previous issue. Also this issue comes up even when "num-threads" is
1. Thus it is not related to OpenSSL thread safety issues.

See the gdb backtrace at the bottom of the message.

Additionally I noticed a minor issue: In validator/val_secalgo.c
the function setup_key_digest() may return failure if a call to
EVP_PKEY_new() returns NULL. In that case the calling function
verify_canonrrset() will try to do EVP_PKEY_free() with the NULL
value. I am unsure if EVP_PKEY_free() has internal safeguards
against this problem (if it does have, it is not documented). I
have been able to avoid touching OpenSSL so far, thus I am not
familiar with it.

Any thoughts? This problem is also difficult to debug because a
high query load is required to trigger it. Therefore setting
breakpoints and single-stepping is out of question. In my test
setup I am testing by resolving names from Alexa top 1 million web
sites list. I am feeding that list to a Go program which constantly
runs 100 DNS lookups in parallel and sends those queries to Unbound
-- which always eventually crashes.

Please tell me if I can assist further in resolving this. I like
Unbound, I wish it was just a bit more robust...

Program received signal SIGABRT, Aborted. 0x00007facd2d18475 in
*__GI_raise (sig=<optimized out>) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64 64
../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or
directory. (gdb) bt #0 0x00007facd2d18475 in *__GI_raise
(sig=<optimized out>) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007facd2d1b6f0
in *__GI_abort () at abort.c:92 #2 0x00007facd2d5232b in
__libc_message (do_abort=<optimized out>, fmt=<optimized out>) at
../sysdeps/unix/sysv/linux/libc_fatal.c:189 #3 0x00007facd2d5bb76
in malloc_printerr (action=3, str=0x7facd2e32768 "double free or
corruption (!prev)", ptr=<optimized out>) at malloc.c:6283 #4
0x00007facd2d608ac in *__GI___libc_free (mem=<optimized out>) at
malloc.c:3738 #5 0x00007facd3301f1d in CRYPTO_free (str=0x82c640)
at mem.c:397 #6 0x00007facd336c6a8 in engine_free_util
(e=e@entry=0x82c640, locked=locked@entry=0) at eng_lib.c:136 #7
0x00007facd336d2c2 in engine_unlocked_finish (e=e@entry=0x82c640,
unlock_for_handlers=unlock_for_handlers@entry=1) at eng_init.c:112
#8 0x00007facd336d401 in ENGINE_finish (e=0x82c640) at
eng_init.c:146 #9 0x00007facd335e660 in RSA_free
(r=0x7facccd33a50) at rsa_lib.c:238 #10 0x00007facd338854b in
EVP_PKEY_free_it (x=x@entry=0x7facccb81550) at p_lib.c:415 #11
0x00007facd3388cd8 in EVP_PKEY_free (x=0x7facccb81550) at
p_lib.c:405 #12 0x0000000000437fde in verify_canonrrset
(buf=0x7faccc611d20, algo=8, sigblock=0x116dfd5
"D\243[\270r\262\271\071\362\355\227\364\310ф#\320\341˼\177\233`\325Or\346\003\035\345z\206r\374\207\245\\\367\222\062>Z-l\256\302\316\323c\217\061\177\225J#\377\370\327@s?\373\030\006b\035Qy@\004,\310\310\060\363+s\232\364Ȼ\ti\273%6$އ_\330\301w\234lf\212+:bquu,zӠ\n\364˾D\032G\001I\245m\bS[p\255\032\270rv1\227",

sigblock_len=128,

key=0x7facc86cb36a
"\001\003\260\304\071[r\365\314\tB\033t@\340\310b \277nzL]\210\231գ
\250\005\244\255\276N\340\365\332j+^잫\035\032=&'\373t
\317\353\231\336\372\220\017!\262&3kB\273Q\232\n\243v\225!Y\261\020\260\371\f\365\225\236\367\210\347\210\332\177\224\217w\266\303\356\303?m\201\234̐y\006\204\256\004\016b\"S|\220\071\323\357\270\316\020\"\331\352L7\376w\035\250o\243\275\237\244\001\345\001\006\001\001\003\b\001\003\303\316WM\230\313\331\025~\rp\322t\270I\312\016\016\355\232\377\305\334̐GIi\006e\\5\313\b\263<M\027\033\001|\243V\364\226\002b\252b\223\315\372\350\261;U\262\034\065\034ߧ"...,

keylen=130,

This is the same bug with openssl-locking. The newer openssl
requires the locking, where the previous versions did not need
locking for the functions that unbound uses. This is causing the
double-free and other failure that you noted. It is fixed in svn
trunk of unbound.

Thank you. The backtrace looked somewhat different and I thought it
got triggered with num-threads == 1 so I thought it would be a
different issue.

The crash with num-threads 1 is odd, can you reproduce that with
the svn-trunk version?

I now believe that this must have been a mistake on my part. I can not
reproduce this with num-threads == 1 any more. I must have originally
forgotten to restart Unbound after changing num-threads. Sorry about that.

I am happy to confirm that the patch at the following URL seems to
have corrected the problem:

https://unbound.net/pipermail/unbound-users/2012-September/002571.html

I will continue testing over the weekend to make sure the problem is
gone. I have applied the patch against Unbound 1.4.18 and 1.4.16 (the
latter with minor tweaks) and both seem to be running fine so far.

I have updated my Ubuntu bug report accordingly:

https://bugs.launchpad.net/ubuntu/+source/unbound/+bug/1070247

Best Regards and Thank You,