1.4.1 crashing

Hi,

we are switching to 1.4.1 version with this patch (FreeBSD8):
http://www.freebsd.org/cgi/cvsweb.cgi/ports/dns/unbound/files/patch-fix-ipv6?rev=1.1

and unbound is crashing:
pid 1597 (unbound), uid 59: exited on signal 10
pid 18943 (unbound), uid 59: exited on signal 11

I have tried to run with:
- 1 and 2 threads
- with/without ipv6 patch, no luck.

now I'm running it without libevent (libevent-1.4.13).

From statistics I can see, that it crashes when certiain amount of

memory is eaten, maybe when it's starting cache cleaning ?

Anyone have seen this?

Hello,

Yes, we also have crashes (but with libev and on FreeBSD7 -8 crashes as well).
Please try to obtain a crashdump, so we can see whether they are the same.

Wouter wrote that I should run unbound in valgrind, but here it's not an easy task...

This is my output:
gdb jail/dns/bin/unbound unbound.core
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `unbound'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000063ff88 in free ()
[New Thread 8009030b0 (LWP 100172)]
(gdb) bt
#0 0x000000000063ff88 in free ()
#1 0x0000000000476756 in lruhash_insert (table=0x800926280, hash=Variable "hash" is not available.
)
   at util/storage/lruhash.c:333
#2 0x0000000000456cea in rrset_cache_update (r=0x800934c00, ref=0x813bbb688,
   alloc=0x800bf0078, timenow=1263147325) at services/cache/rrset.c:210
#3 0x00000000004556aa in dns_cache_store_msg (env=0x800bf13f0,
   qinfo=0x7fffffffe610, hash=4162435340, rep=0x813bbb650, leeway=Variable "leeway" is not available.
)
   at services/cache/dns.c:66
#4 0x0000000000455887 in dns_cache_store (env=0x800bf13f0,
   msgqinf=0x81215ab10, msgrep=Variable "msgrep" is not available.
) at services/cache/dns.c:760
#5 0x000000000043b39f in processQueryResponse (qstate=0x812159080,
   iq=0x8121592b0, id=2) at iterator/iterator.c:1554
#6 0x000000000043bc08 in iter_handle (qstate=0x812159080, iq=0x8121592b0,
   ie=0x8009207c0, id=2) at iterator/iterator.c:2186
#7 0x000000000043ce14 in iter_operate (qstate=0x812159080,
   event=module_event_reply, id=2, outbound=0x81215aaf0)
   at iterator/iterator.c:2321
#8 0x000000000045b991 in mesh_run (mesh=0x800ab9820, mstate=0x812159030,
   ev=module_event_reply, e=0x81215aaf0) at services/mesh.c:958
#9 0x00000000004332a6 in worker_handle_service_reply (c=0x8011bc0c0,
   arg=0x81215aaf0, error=0, reply_info=0x7fffffffeaa0) at daemon/worker.c:280
#10 0x000000000045f6a2 in serviced_callbacks (sq=0x808a11f00, error=0,
   c=0x8011bc0c0, rep=0x7fffffffeaa0) at services/outside_network.c:1381
---Type <return> to continue, or q <return> to quit---
#11 0x000000000045fb8b in serviced_udp_callback (c=0x8011bc0c0,
   arg=0x808a11f00, error=0, rep=0x7fffffffeaa0)
   at services/outside_network.c:1564
#12 0x0000000000460383 in outnet_udp_cb (c=0x8011bc0c0, arg=0x800bf60c0, error=Variable "error" is not available.

) at services/outside_network.c:374
#13 0x0000000000473dc9 in comm_point_udp_callback (fd=57, event=Variable "event" is not available.
)
   at util/netevent.c:571
#14 0x0000000000400456 in ev_invoke_pending ()
#15 0x00000000004050e3 in ev_loop ()
#16 0x0000000000405e79 in event_base_loop ()
#17 0x00000000004734ec in comm_base_dispatch (b=Variable "b" is not available.
) at util/netevent.c:218
#18 0x000000000042d033 in daemon_fork (daemon=0x800909040)
   at daemon/daemon.c:457
#19 0x00000000004318a4 in main (argc=Variable "argc" is not available.
) at daemon/unbound.c:569
(gdb)

Artis Caune wrote:

unbound without libevent is crashing too.
Looks like 1.3.4 is working fine.

The stacktrace is nice, but the lruhash code did not change between
1.3.4 and 1.4.1.

Artis noted that it seems to be running fine with "validator iterator"
module-config statement (not without the validator). Could this be the
same for you?

Other than that I have no clues except it seems to fail for both of you
FreeBSD. But my tests on FreeBSD8 all work well.
libevent/libev/builtin does not make a difference. Could check
with/without-pthreads. Could try without a cache (0 sizes, the malloc
behaviour should be very simple). valgrind is typically very good at
spotting such heap-corruption or double free-s (if it works well enough
for you).

But if those stackdumps are the same that would be interesting...

Best regards,
   Wouter

unfortunately one instance just crashed, it was running 10+ hours.
Looks like with validator it just takes longer.

Hi, I committed r1959. This has a different structure alignment, which
had changed in svn trunk (but is not in 1.4.0 or in the patch itself).
This could help out with bus-errors, perhaps with alignment problems for
malloc itself, especially on 64bit. I am not sure but it certainly
changed and I think could create this sort of result. Could you try that?

Perhaps the misaligned read only causes trouble on a page boundary or
something like that and this is what trips it up only occasionally?

Best regards,
   Wouter

Hello,

It still crashes for us. :frowning:
I thought about saving the traffic and trying to replay that to another (idle) server, but couldn’t yet find the time to do that…

W.C.A. Wijngaards wrote:

same here.

Hi,

I have been trying to reproduce this, especially the bus signal that you
reported. But have not been able to - I have created some extra
portability stuff, but it is unlikely that this caused your troubles
(its all in svn).

Since bus errors are hardware and OS specific - and most tend to
'silently' handle them for you - I would like to know if (per chance)
you both have a very specific hardware/OS combination.

We are working towards reproducing but haven't succeeded. Trying
different ideas and getting more info is it then.

Best regards,
   Wouter