unbound-1.19.0 alloc_reg_obtain() core dumps

Hello,

I am working for Cloudflare where we seem to have a problem with
unbound-1.19.0. Estimated time in between crashes is around 450 days on a
single server. Cloudflare has lots of servers, and reasonable queries going
through systems, which means we see these crashes frequently enough that
this is a real problem.

By quickly looking the backtraces it looks like allocator is receiving a
null pointer and failing to follow ->next via that pointer. Something
somewhere earlier must have gone a bit wrong the pointer to be what it is.
This is where better expertise would be helpful.

To assist debug here are some backtrace samples. They do not have
absolutely all frames, but hopefully enough context to figure out how to
move forward with debugging if not even fix the issue.

Some what interestingly the second to last frame is on line
services/mesh.c:568 in three first samples but daemon/worker.c:2395 in last
one. I wonder if this a concurrency bug.

-- the bt -full samples --

#0 0x00005618c813fb21 in alloc_reg_obtain (alloc=0x5618ca1f3c90) at
util/alloc.c:333
        r = 0x0
#1 mesh_state_create (env=0x5618ca68d218, qinfo=0x7fffef530ba0,
cinfo=0x0, qflags=256, prime=0, valrec=0) at services/mesh.c:824
        region = <optimized out>
        mstate = <optimized out>
        i = <optimized out>
#2 0x00005618c81d2789 in mesh_new_callback.constprop.0
(mesh=0x5618cd665270, qinfo=qinfo@entry=0x7fffef530ba0,
edns=edns@entry=0x7fffef530e40, buf=buf@entry=0x5618cd675b20,
cb=cb@entry=0x5618c817b1f0 <auth_xfer_probe_lookup_callback>,
cb_arg=cb_arg@entry=0x5618ca1f29e0, rpz_passthru=0, qid=0, qflags=256)
at services/mesh.c:568
        s = <optimized out>
        unique = 0
        timeout = 0
        was_detached = 0
        was_noreply = 0
        added = 0
#3 0x00005618c817a683 in xfr_probe_lookup_host (env=0x5618ca68d218,
xfr=0x5618ca1f29e0) at services/authzone.c:6630
        master = 0x5618ca1f2d10
        qinfo = {qname = 0x7fffef530c40 "\001f\froot-servers\003net",
qname_len = 20, qtype = 28, qclass = 1, local_alias = 0x0}
        qflags = 256
        dname =
"\001f\froot-servers\003net\000\001\000\000\000@\fS\357\377\177\000\000\000\000\000\000\000\000\000\000\270\322\205\312\030V\000\000\024\000\000\000\000\000\000\000\004\000\000\000\030V",
'\000' <repeats 18 times>,
"\001\000\205\312\030V\000\000\230\320\205\312\030V\000\000`\017S\357\377\177\000\000X\323\205\312\030V\000\000\230\320\205\312\030V\000\000vl\023\310\030V",
'\000' <repeats 90 times>...
        edns = {ext_rcode = 0 '\000', edns_version = 0 '\000', bits =
32768, udp_size = 65535, opt_list_in = 0x0, opt_list_out = 0x0,
opt_list_inplace_cb_out = 0x0, padding_block_size = 0, edns_present =
1, cookie_present = 0, cookie_valid = 0, cookie_client = 0}
        buf = 0x5618cd675b20
        addr = {ss_family = 2,
          __ss_padding = "\0005", '\000' <repeats 20 times>,
"<\346\022t\003\000\000\000\004\000\000\000\003\000\000\0004\fS\357\377\177\000\000H\320\205\312\030V\000\000\220ch\315\030V\000\000\003\000\000\000\000\000\000\000`\017S\357\377\177\000\000h\017S\357\377\177\000\000X\323\205\312\030V\000\000\000\000\000\000\000\000\000\000\005\b\023\310\030V\000\000\000\000\000\000\002\000\000",
__ss_align = 94664525374352}
        addrlen = 16
        addr = <optimized out>
        addrlen = <optimized out>
        master = <optimized out>
        qinfo = <optimized out>
        qflags = <optimized out>
        dname = <optimized out>
        edns = <optimized out>
        buf = <optimized out>
        buf1 = <optimized out>
        buf2 = <optimized out>
        lockret_err = <optimized out>
        lockret_err = <optimized out>
        lockret_err = <optimized out>
#4 xfr_probe_send_or_end (xfr=0x5618ca1f29e0, env=0x5618ca68d218) at
services/authzone.c:6646
No locals.
#5 0x00005618c81394b0 in mesh_do_callback (m=m@entry=0x5618ca85d048,
rcode=<optimized out>, rep=rep@entry=0x5618ca85d4c8, r=0x5618ca85d2d0,
start_time=start_time@entry=0x7fffef5310f0) at services/mesh.c:1215
        udp_size = 65535
        secure = <optimized out>
        reason = <optimized out>
        was_ratelimited = 0
        __func__ = "mesh_do_callback"
#6 0x00005618c813a70b in mesh_query_done
(mstate=mstate@entry=0x5618ca85d048) at services/mesh.c:1553
        r = <optimized out>
        prev = <optimized out>
        prev_buffer = <optimized out>
        c = <optimized out>
        rep = <optimized out>
        tv = {tv_sec = 0, tv_usec = 0}
        i = <optimized out>
#7 0x00005618c813ed98 in mesh_continue (ev=<synthetic pointer>,
s=module_finished, mstate=0x5618ca85d048, mesh=0x5618cd665270) at
services/mesh.c:1813
        qinfo = 0x0
        opt_list = 0x0
        qflags = <optimized out>
        rpz_p = 0
        addr = {ss_family = 0, __ss_padding = '\000' <repeats 117

, __ss_align = 0}

        __func__ = <optimized out>
#8 mesh_run (mesh=0x5618cd665270, mstate=0x5618ca85d048,
ev=<optimized out>, e=0x0) at services/mesh.c:1874
        s = module_finished
        __func__ = "mesh_run"
#9 0x00005618c81d27af in mesh_new_callback.constprop.0
(mesh=0x5618cd665270, qinfo=qinfo@entry=0x7fffef5312b0,
edns=edns@entry=0x7fffef531550, buf=buf@entry=0x5618cd675b20,
cb=cb@entry=0x5618c817b1f0 <auth_xfer_probe_lookup_callback>,
cb_arg=cb_arg@entry=0x5618ca1f29e0, rpz_passthru=0, qid=0, qflags=256)
at services/mesh.c:623
        s = 0x5618ca85d048
        unique = <optimized out>
        timeout = 0
        was_detached = 1
        was_noreply = 1
        added = 1
#10 0x00005618c817a683 in xfr_probe_lookup_host (env=0x5618ca68d218,
xfr=0x5618ca1f29e0) at services/authzone.c:6630
        master = 0x5618ca1f2d10
        qinfo = {qname = 0x7fffef531350 "\001f\froot-servers\003net",
qname_len = 20, qtype = 1, qclass = 1, local_alias = 0x0}
        qflags = 256
        dname =
"\001f\froot-servers\003net\000\001\000\000\000P\023S\357\377\177\000\000\000\000\000\000\000\000\000\000\260\321\202\312\030V\000\000\024\000\000\000\000\000\000\000\004\000\000\000\030V",
'\000' <repeats 18 times>,
"\001\000\202\312\030V\000\000\220\317\202\312\030V\000\000p\026S\357\377\177\000\000P\322\202\312\030V\000\000\220\317\202\312\030V\000\000vl\023\310\030V",
'\000' <repeats 90 times>...
        edns = {ext_rcode = 0 '\000', edns_version = 0 '\000', bits =
32768, udp_size = 65535, opt_list_in = 0x0, opt_list_out = 0x0,
opt_list_inplace_cb_out = 0x0, padding_block_size = 0, edns_present =
1, cookie_present = 0, cookie_valid = 0, cookie_client = 0}
        buf = 0x5618cd675b20
        addr = {ss_family = 2,
          __ss_padding = "\0005", '\000' <repeats 20 times>,
"<\346\022t\003\000\000\000\004\000\000\000\003\000\000\000D\023S\357\377\177\000\000@\317\202\312\030V\000\000\220ch\315\030V\000\000\003\000\000\000\000\000\000\000p\026S\357\377\177\000\000x\026S\357\377\177\000\000P\322\202\312\030V\000\000\000\000\000\000\000\000\000\000\005\b\023\310\030V\000\000\000\000\000\000\002\000\000",
__ss_align = 94664525374352}
[...]

(gdb) bt -full
#0 0x000055898c582b21 in alloc_reg_obtain (alloc=0x55898dc41ca0) at
util/alloc.c:333
        r = 0x0
#1 mesh_state_create (env=0x55898e0db228, qinfo=0x7ffc904b6d30,
cinfo=0x0, qflags=256, prime=0, valrec=0) at services/mesh.c:824
        region = <optimized out>
        mstate = <optimized out>
        i = <optimized out>
#2 0x000055898c615789 in mesh_new_callback.constprop.0
(mesh=0x5589910b3280, qinfo=qinfo@entry=0x7ffc904b6d30,
edns=edns@entry=0x7ffc904b6fd0, buf=buf@entry=0x5589910c3b30,
cb=cb@entry=0x55898c5be1f0 <auth_xfer_probe_lookup_callback>,
cb_arg=cb_arg@entry=0x55898dc409f0, rpz_passthru=0, qid=0, qflags=256)
at services/mesh.c:568
        s = <optimized out>
        unique = 0
        timeout = 0
        was_detached = 0
        was_noreply = 0
        added = 0
#3 0x000055898c5bd683 in xfr_probe_lookup_host (env=0x55898e0db228,
xfr=0x55898dc409f0) at services/authzone.c:6630
        master = 0x55898dc40cc0
        qinfo = {qname = 0x7ffc904b6dd0 "\001e\froot-servers\003net",
qname_len = 20, qtype = 28, qclass = 1, local_alias = 0x0}
        qflags = 256
        dname =
"\001e\froot-servers\003net\000\001\000\000\000\320mK\220\374\177\000\000\000\000\000\000\000\000\000\000\260\270+\216\211U\000\000\024\000\000\000\000\000\000\000\004\000\000\000\211U",
'\000' <repeats 18 times>,
"\001\000+\216\211U\000\000\220\266+\216\211U\000\000\360pK\220\374\177\000\000P\271+\216\211U\000\000\220\266+\216\211U\000\000v\234W\214\211U",
'\000' <repeats 90 times>...
        edns = {ext_rcode = 0 '\000', edns_version = 0 '\000', bits =
32768, udp_size = 65535, opt_list_in = 0x0, opt_list_out = 0x0,
opt_list_inplace_cb_out = 0x0, padding_block_size = 0, edns_present =
1, cookie_present = 0, cookie_valid = 0, cookie_client = 1}
        buf = 0x5589910c3b30
        addr = {ss_family = 2,
          __ss_padding = "\0005", '\000' <repeats 20 times>,
"<\026Wt\003\000\000\000\004\000\000\000\003\000\000\000\304mK\220\374\177\000\000@\266+\216\211U\000\000\240C\r\221\211U\000\000\003\000\000\000\000\000\000\000\360pK\220\374\177\000\000\370pK\220\374\177\000\000P\271+\216\211U\000\000\000\000\000\000\000\000\000\000\0058W\214\211U\000\000\000\000\000\000\002\000\000",
__ss_align = 94049332446112}
        addrlen = 16
        addr = <optimized out>
        addrlen = <optimized out>
        master = <optimized out>
        qinfo = <optimized out>
        qflags = <optimized out>
        dname = <optimized out>
        edns = <optimized out>
        buf = <optimized out>
        buf1 = <optimized out>
        buf2 = <optimized out>
        lockret_err = <optimized out>
        lockret_err = <optimized out>
        lockret_err = <optimized out>
#4 xfr_probe_send_or_end (xfr=0x55898dc409f0, env=0x55898e0db228) at
services/authzone.c:6646
No locals.
#5 0x000055898c57c4b0 in mesh_do_callback (m=m@entry=0x55898e2bb640,
rcode=<optimized out>, rep=rep@entry=0x55898e2bbac0, r=0x55898e2bb8c8,
start_time=start_time@entry=0x7ffc904b7280) at services/mesh.c:1215
        udp_size = 65535
        secure = <optimized out>
        reason = <optimized out>
        was_ratelimited = 0
        __func__ = "mesh_do_callback"
#6 0x000055898c57d70b in mesh_query_done
(mstate=mstate@entry=0x55898e2bb640) at services/mesh.c:1553
        r = <optimized out>
        prev = <optimized out>
        prev_buffer = <optimized out>
        c = <optimized out>
        rep = <optimized out>
        tv = {tv_sec = 0, tv_usec = 0}
        i = <optimized out>
#7 0x000055898c581d98 in mesh_continue (ev=<synthetic pointer>,
s=module_finished, mstate=0x55898e2bb640, mesh=0x5589910b3280) at
services/mesh.c:1813
        qinfo = 0x0
        opt_list = 0x0
        qflags = <optimized out>
        rpz_p = 0
        addr = {ss_family = 0, __ss_padding = '\000' <repeats 117

, __ss_align = 0}

        __func__ = <optimized out>
#8 mesh_run (mesh=0x5589910b3280, mstate=0x55898e2bb640,
ev=<optimized out>, e=0x0) at services/mesh.c:1874
        s = module_finished
        __func__ = "mesh_run"
#9 0x000055898c6157af in mesh_new_callback.constprop.0
(mesh=0x5589910b3280, qinfo=qinfo@entry=0x7ffc904b7440,
edns=edns@entry=0x7ffc904b76e0, buf=buf@entry=0x5589910c3b30,
cb=cb@entry=0x55898c5be1f0 <auth_xfer_probe_lookup_callback>,
cb_arg=cb_arg@entry=0x55898dc409f0, rpz_passthru=0, qid=0, qflags=256)
at services/mesh.c:623
        s = 0x55898e2bb640
        unique = <optimized out>
        timeout = 0
        was_detached = 1
        was_noreply = 1
        added = 1
[...]

(gdb) bt full
#0 0x000055807cd46b21 in alloc_reg_obtain (alloc=0x55807e9089b0) at
util/alloc.c:333
        r = 0x0
#1 mesh_state_create (env=0x55807ed4ed08, qinfo=0x7fffa45fadd0,
cinfo=0x0, qflags=256, prime=0, valrec=0) at services/mesh.c:824
        region = <optimized out>
        mstate = <optimized out>
        i = <optimized out>
#2 0x000055807cdd9789 in mesh_new_callback.constprop.0
(mesh=0x558081d26e00, qinfo=qinfo@entry=0x7fffa45fadd0,
edns=edns@entry=0x7fffa45fb070, buf=buf@entry=0x558081d376b0,
cb=cb@entry=0x55807cd821f0 <auth_xfer_probe_lookup_callback>,
cb_arg=cb_arg@entry=0x55807e914020, rpz_passthru=0, qid=0, qflags=256)
at services/mesh.c:568
        s = <optimized out>
        unique = 0
        timeout = 0
        was_detached = 0
        was_noreply = 0
        added = 0
#3 0x000055807cd81683 in xfr_probe_lookup_host (env=0x55807ed4ed08,
xfr=0x55807e914020) at services/authzone.c:6630
        master = 0x55807e9073c0
        qinfo = {qname = 0x7fffa45fae70 "\001f\froot-servers\003net",
qname_len = 20, qtype = 28, qclass = 1, local_alias = 0x0}
        qflags = 256
        dname =
"\001f\froot-servers\003net\000\001\000\000\000p\256_\244\377\177\000\000\000\000\000\000\000\000\000\000\330\255\362~\200U\000\000\024\000\000\000\000\000\000\000\004\000\000\000\200U",
'\000' <repeats 18 times>,
"\001\000\362~\200U\000\000\270\253\362~\200U\000\000\220\261_\244\377\177\000\000x\256\362~\200U\000\000\270\253\362~\200U\000\000v\334\323|\200U",
'\000' <repeats 90 times>...
        edns = {ext_rcode = 0 '\000', edns_version = 0 '\000', bits =
32768, udp_size = 65535, opt_list_in = 0x0, opt_list_out = 0x0,
opt_list_inplace_cb_out = 0x0, padding_block_size = 0, edns_present =
1, cookie_present = 0, cookie_valid = 0, cookie_client = 0}
        buf = 0x558081d376b0
        addr = {ss_family = 2,
          __ss_padding = "\0005", '\000' <repeats 20 times>,
"<V\323t\003\000\000\000\004\000\000\000\003\000\000\000d\256_\244\377\177\000\000h\253\362~\200U\000\000
\177\324\201\200U\000\000\003\000\000\000\000\000\000\000\220\261_\244\377\177\000\000\230\261_\244\377\177\000\000x\256\362~\200U\000\000\000\000\000\000\000\000\000\000\005x\323|\200U\000\000\000\000\000\000\002\000\000",
__ss_align = 94010422361888}
        addrlen = 16
        addr = <optimized out>
        addrlen = <optimized out>
        master = <optimized out>
        qinfo = <optimized out>
        qflags = <optimized out>
        dname = <optimized out>
        edns = <optimized out>
        buf = <optimized out>
        buf1 = <optimized out>
        buf2 = <optimized out>
        lockret_err = <optimized out>
        lockret_err = <optimized out>
        lockret_err = <optimized out>
#4 xfr_probe_send_or_end (xfr=0x55807e914020, env=0x55807ed4ed08) at
services/authzone.c:6646
No locals.
#5 0x000055807cd404b0 in mesh_do_callback (m=m@entry=0x55807ef2ab68,
rcode=<optimized out>, rep=rep@entry=0x55807ef2afe8, r=0x55807ef2adf0,
start_time=start_time@entry=0x7fffa45fb320) at services/mesh.c:1215
        udp_size = 65535
        secure = <optimized out>
        reason = <optimized out>
        was_ratelimited = 0
        __func__ = "mesh_do_callback"
#6 0x000055807cd4170b in mesh_query_done
(mstate=mstate@entry=0x55807ef2ab68) at services/mesh.c:1553
        r = <optimized out>
        prev = <optimized out>
        prev_buffer = <optimized out>
        c = <optimized out>
        rep = <optimized out>
        tv = {tv_sec = 0, tv_usec = 0}
        i = <optimized out>
#7 0x000055807cd45d98 in mesh_continue (ev=<synthetic pointer>,
s=module_finished, mstate=0x55807ef2ab68, mesh=0x558081d26e00) at
services/mesh.c:1813
        qinfo = 0x0
        opt_list = 0x0
        qflags = <optimized out>
        rpz_p = 0
        addr = {ss_family = 0, __ss_padding = '\000' <repeats 117

, __ss_align = 0}

        __func__ = <optimized out>
#8 mesh_run (mesh=0x558081d26e00, mstate=0x55807ef2ab68,
ev=<optimized out>, e=0x0) at services/mesh.c:1874
        s = module_finished
        __func__ = "mesh_run"
#9 0x000055807cdd97af in mesh_new_callback.constprop.0
(mesh=0x558081d26e00, qinfo=qinfo@entry=0x7fffa45fb4e0,
edns=edns@entry=0x7fffa45fb780, buf=buf@entry=0x558081d376b0,
cb=cb@entry=0x55807cd821f0 <auth_xfer_probe_lookup_callback>,
cb_arg=cb_arg@entry=0x55807e914020, rpz_passthru=0, qid=0, qflags=256)
at services/mesh.c:623
        s = 0x55807ef2ab68
        unique = <optimized out>
        timeout = 0
        was_detached = 1
        was_noreply = 1
        added = 1
#10 0x000055807cd81683 in xfr_probe_lookup_host (env=0x55807ed4ed08,
xfr=0x55807e914020) at services/authzone.c:6630
        master = 0x55807e9073c0
        qinfo = {qname = 0x7fffa45fb580 "\001f\froot-servers\003net",
qname_len = 20, qtype = 1, qclass = 1, local_alias = 0x0}
        qflags = 256
        dname =
"\001f\froot-servers\003net\000\001\000\377\377\200\265_\244\377\177\000\000\000\000\000\000\f\000\000\000\340\354\357~\200U\000\000\024\000\000\000\000\000\000\000\004\000\000\000\200U\000\000@\266_\244\377\177\000\000\377\275\363L\202\177\000\000\001\000\357~\200U\000\000\300\352\357~\200U\000\000\240\270_\244\377\177\000\000\200\355\357~\200U\000\000\300\352\357~\200U\000\000v\334\323|\200U\000\000\034j\371L\202\177\000\000\005\000\000\000\000\000\000\000+\364\326\201\200U\000\000\020\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377+\364\326\201\200U\000\000\340\327\006M\202\177\000\000\020\000\000\000\000\000\000\000\034j\371L\202\177\000\000"...
        edns = {ext_rcode = 0 '\000', edns_version = 0 '\000', bits =
32768, udp_size = 65535, opt_list_in = 0x0, opt_list_out = 0x0,
opt_list_inplace_cb_out = 0x0, padding_block_size = 0, edns_present =
1, cookie_present = 0, cookie_valid = 0, cookie_client = 0}
        buf = 0x558081d376b0
        addr = {ss_family = 2,
          __ss_padding = "\0005", '\000' <repeats 20 times>,
"<V\323t\003\000\000\000\004\000\000\000\003\000\000\000t\265_\244\377\177\000\000p\352\357~\200U\000\000
\177\324\201\200U\000\000\003\000\000\000\000\000\000\000\240\270_\244\377\177\000\000\250\270_\244\377\177\000\000\200\355\357~\200U\000\000\000\000\000\000\000\000\000\000\005x\323|\200U\000\000O\022\367L\002\000\000",
__ss_align = 94010422361888}
        addrlen = 16
        addr = <optimized out>
        addrlen = <optimized out>
        master = <optimized out>
        qinfo = <optimized out>
        qflags = <optimized out>
        dname = <optimized out>
[...]

#0 0x00005561119b9be4 in alloc_reg_obtain (alloc=0x556112b209b0) at
util/alloc.c:333
        r = 0x0
#1 outnet_serviced_query (callback=<optimized out>,
was_ratelimited=0x7ffe214998d8, env=0x556112f66d08,
buff=0x5561156a6760, callback_arg=0x556113112460,
qstate=0x55611310eaa0, zonelen=18, zone=0x55611310eed8
"\froot-servers\003net", addrlen=28, addr=0x7ffe214998e0,
tls_auth_name=0x0, ssl_upstream=0, tcp_upstream=0, check_ratelimit=1,
nocaps=0, want_dnssec=0, dnssec=32784,
    flags=0, qinfo=0x55611310ee40, outnet=<optimized out>) at
services/outside_network.c:3410
        sq = <optimized out>
        region = <optimized out>
        timenow = 0
        cb = <optimized out>
        client_string_addr = <optimized out>
        backed_up_opt_list = 0x0
        per_upstream_opt_list = 0x0
        sq = <optimized out>
        cb = <optimized out>
        client_string_addr = <optimized out>
        region = <optimized out>
        backed_up_opt_list = <optimized out>
        per_upstream_opt_list = <optimized out>
        timenow = <optimized out>
#2 worker_send_query (qinfo=0x55611310ee40, flags=<optimized out>,
dnssec=32784, want_dnssec=0, nocaps=0, check_ratelimit=1,
addr=0x7ffe214998e0, addrlen=28, zone=0x55611310eed8
"\froot-servers\003net", zonelen=18, tcp_upstream=0, ssl_upstream=0,
tls_auth_name=0x0, q=0x55611310eaa0, was_ratelimited=0x7ffe214998d8)
at daemon/worker.c:2395
        worker = <optimized out>
        e = 0x556113112460
#3 0x00005561119d46ce in processQueryTargets (qstate=<optimized out>,
iq=0x55611310ed28, ie=<optimized out>, id=<optimized out>) at
iterator/iterator.c:2906
        tf_policy = <optimized out>
        target = <optimized out>
        outq = <optimized out>
        real_addr = {ss_family = 10, __ss_padding =
"\0005\000\000\000\000
\001\005\003\f'\000\000\000\000\000\000\000\002\0000\000\000\000\000\000\000\000\000\000\020
\001\005\003\r-\000\000\000\000\000\000\000\000\0000\021\023aU\000\000\000\020\021\023aU\000\000\220\f\021\023aU\000\000\220\f\021\023aU\000\000\002\000\0005\3000O\036",
'\000' <repeats 31 times>, __ss_align = 0}
        real_addrlen = 28
        auth_fallback = 1
        qout_orig = <optimized out>
        qout_orig_len = <optimized out>
        sq_check_ratelimit = 1
        sq_was_ratelimited = 0
        can_do_promisc = <optimized out>
        __func__ = "processQueryTargets"
#4 0x00005561119da97e in iter_handle (qstate=<optimized out>,
iq=0x55611310ed28, ie=0x556112b2b3d0, id=1) at
iterator/iterator.c:4031
        cont = 1
#5 0x00005561119e6760 in mesh_run (mesh=0x556115f3ee00,
mstate=0x55611310ea50, ev=<optimized out>, e=<optimized out>) at
services/mesh.c:1864
        s = <optimized out>
        __func__ = "mesh_run"
#6 0x0000556111a7a7af in mesh_new_callback.constprop.0
(mesh=0x556115f3ee00, qinfo=qinfo@entry=0x7ffe2149a140,
edns=edns@entry=0x7ffe2149a3e0, buf=buf@entry=0x556115f4f6b0,
cb=cb@entry=0x556111a231f0 <auth_xfer_probe_lookup_callback>,
cb_arg=cb_arg@entry=0x556112b1f6b0, rpz_passthru=0, qid=0, qflags=256)
at services/mesh.c:623
        s = 0x55611310ea50
        unique = <optimized out>
        timeout = 0
        was_detached = 1
        was_noreply = 1
        added = 1
#7 0x0000556111a22683 in xfr_probe_lookup_host (env=0x556112f66d08,
xfr=0x556112b1f6b0) at services/authzone.c:6630
        master = 0x556112b1f980
        qinfo = {qname = 0x7ffe2149a1e0 "\001e\froot-servers\003net",
qname_len = 20, qtype = 1, qclass = 1, local_alias = 0x0}
        qflags = 256
        dname =
"\001e\froot-servers\003net\000aU\000\000S%\241\021aU\000\000\340\355\362\025aU\000\000
\356\362\025\000\000\000\000`\244I!\376\177", '\000' <repeats 18

, " \356\362\025aU\000\000\020\257\020\023aU\000\000

\242I!\376\177\000\000\260\366\364\025aU\000\0000\242I!\376\177\000\000\000\000\000\000\000\000\000\0008\261\020\023aU\000\000\000\000\000\000\000\000\000\000\020\257\020\023aU\000\000
\356\362\025aU\000\000\027)\241\021aU\000\000
\356\362\025aU\000\000`\262\020\023aU\000\000X\262\020\023aU\000\000\227K\241\021aU\000\000,\243I!\376\177\000\000"...
        edns = {ext_rcode = 0 '\000', edns_version = 0 '\000', bits =
32768, udp_size = 65535, opt_list_in = 0x0, opt_list_out = 0x0,
opt_list_inplace_cb_out = 0x0, padding_block_size = 0, edns_present =
1, cookie_present = 0, cookie_valid = 0, cookie_client = 0}
        buf = 0x556115f4f6b0
        addr = {ss_family = 2,
          __ss_padding = "\0005", '\000' <repeats 12 times>, "
\356\362\025aU\000\000\300\356\362\025aU\000\000
\356\362\025aU\000\0000\242I!\376\177\000\0000\242I!\376\177\000\000\200\306\257\021aU\000\000(\242I!\376\177\000\000\t#\241\021aU\000\000`\244I!\376\177\000\000
\356\362\025aU\000\000`\244I!\376\177\000\000Zy\237\021aU\000\000\000\000\000\000\000\000\000",
          __ss_align = 93875420049720}
        addrlen = 16
        addr = <optimized out>
        addrlen = <optimized out>
        master = <optimized out>
        qinfo = <optimized out>
        qflags = <optimized out>
        dname = <optimized out>
        edns = <optimized out>
        buf = <optimized out>
        buf1 = <optimized out>
        buf2 = <optimized out>
        lockret_err = <optimized out>
        lockret_err = <optimized out>
        lockret_err = <optimized out>
#8 xfr_probe_send_or_end (xfr=0x556112b1f6b0, env=0x556112f66d08) at
services/authzone.c:6646
No locals.

#9 0x00005561119e14b0 in mesh_do_callback (m=m@en=0x556113102a20,
# rcode=<optimized out>, rep=rep@entry=0x556113106688, r=0x556113102ca8,
# start_time=start_time@entry=0x7ffe2149a690) at services/mesh.c:1215

        udp_size = 65535
        secure = <optimized out>
        reason = <optimized out>
        was_ratelimited = 0
        __func__ = "mesh_do_callback"
#10 0x00005561119e270b in mesh_query_done
(mstate=mstate@entry=0x556113102a20) at services/mesh.c:1553
        r = <optimized out>
        prev = <optimized out>
        prev_buffer = <optimized out>
        c = <optimized out>
        rep = <optimized out>
        tv = {tv_sec = 0, tv_usec = 0}
        i = <optimized out>
#11 0x00005561119e6d98 in mesh_continue (ev=<synthetic pointer>,
s=module_finished, mstate=0x556113102a20, mesh=0x556115f3ee00) at
services/mesh.c:1813
        qinfo = 0x0
        opt_list = 0x0
        qflags = <optimized out>
        rpz_p = 0
        addr = {ss_family = 0, __ss_padding = '\000' <repeats 117

, __ss_align = 0}

        __func__ = <optimized out>
[...]

Cheers, Sami

Good day,

Slightly different looking core today.

(gdb) bt -full
#0 process_dnskey_response (origin=0x563b8cf29aa0,
qinfo=0x563b8cf25c50, msg=<optimized out>, rcode=<optimized out>,
id=<optimized out>, vq=0x563b8cb0fb48, qstate=0x563b8cf71d80) at
validator/validator.c:2861
        old = 0x563b8cb0fd20
        dnskey = <optimized out>
        reason = 0x0
        ve = 0x563b8c6f9c00
        downprot = <optimized out>
        reason_bogus = LDNS_EDE_DNSSEC_BOGUS
        ve = <optimized out>
        old = <optimized out>
        dnskey = <optimized out>
        downprot = <optimized out>
        reason = <optimized out>
        reason_bogus = <optimized out>
#1 val_inform_super (qstate=0x563b8cf25c50, id=<optimized out>,
super=0x563b8cf71d80) at validator/validator.c:2973
        vq = 0x563b8cb0fb48
# and so on

(gdb) print *(struct query_info*)0x563b8cf25c50
$4 = {qname = 0x563b8cf25e70 "\003net", qname_len = 5, qtype = 48,
qclass = 1, local_alias = 0x0}

(gdb) print *(struct val_qstate*)0x563b8cb0fb48
$1 = {state = VAL_FINDKEY_STATE, orig_msg = 0x563b8cc851c0,
restart_count = 0, chain_blacklist = 0x0, qchase = {qname =
0x563b8cf71fa0 "\001a\froot-servers\003net", qname_len = 20, qtype =
28, qclass = 1, local_alias = 0x0}, chase_reply = 0x563b8cb0fbf0,
rrset_skip = 0, trust_anchor_name = 0x0, trust_anchor_labs = 0,
trust_anchor_len = 0, ds_rrset = 0x563b8cb107c0,
  empty_DS_name = 0x0, empty_DS_len = 0, key_entry = 0x563b8cf71d30,
subtype = VAL_CLASS_UNTYPED, signer_name = 0x0, signer_len = 0,
wait_prime_ta = 0}

(gdb) print *(struct key_entry_key*)0x563b8cf71d30
$3 = {entry = {lock = {__data = {__readers = 0, __writers = 0,
__wrphase_futex = 0, __writers_futex = 0, __pad3 = 0, __pad4 = 0,
__cur_writer = 0, __shared = 0, __rwelision = 0 '\000', __pad1 =
"\000\000\000\000\000\000", __pad2 = 0, __flags = 0}, __size = '\000'
<repeats 55 times>, __align = 0}, overflow_next = 0x0, lru_next = 0x0,
lru_prev = 0x0, hash = 0,
    key = 0x563b8cf71d30, data = 0x563b8cf71db8}, name =
0x563b8cf71db0 "\003net", namelen = 5, key_class = 1}

(gdb) print *(struct val_env*)0x563b8c6f9c00
$2 = {kcache = 0x563b8c6f9cb0, neg_cache = 0x563b8c90f350,
date_override = 0, skew_min = 3600, skew_max = 86400, max_restart = 5,
bogus_ttl = 60, nsec3_keyiter_count = 3, nsec3_keysize =
0x563b8c90f310, nsec3_maxiter = 0x563b8c90f330, bogus_lock = {__data =
{__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0,
__spins = 0, __elision = 0, __list = {__prev = 0x0,
        __next = 0x0}}, __size = '\000' <repeats 39 times>, __align =
0}, num_rrset_bogus = 0}

I hope that helps figuring out what is going on.

Hi Sami,

I'll have a look but probably next week.
I am a little confused by your previous wording:
"... have a problem with unbound-1.19.0. Estimated time in between crashes is around 450 days on a single server."
Did these start specifically with 1.19.0?
How often do you see those crashes (per single server)?

Best regards,
-- Yorgos

I'll have a look but probably next week.

Thank you.

There are more cores. Many repeats of the same old same old. One is
something I have not seen before.

Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) info threads
  Id Target Id Frame
* 1 Thread 0x7fc0b79aba40 (LWP 2878537) 0x0000000000000000 in ?? ()
  2 Thread 0x7fc0b67546c0 (LWP 2878538) warning: Section
`.reg-xstate/2878538' in core file too small.
0x00007fc0b6b27e26 in splice (fd_in=150, off_in=0x7fc0b0000ef0,
fd_out=32, off_out=0x7fc0b6b27e26 <splice+38>, len=0,
flags=3061135592) at ../sysdeps/unix/sysv/linux/splice.c:25
  3 Thread 0x7fc08ffff6c0 (LWP 2878549) warning: Section
`.reg-xstate/2878549' in core file too small.
0x00007fc0b6b27e26 in splice (fd_in=186, off_in=0x7fc074000ef0,
fd_out=32, off_out=0x7fc0b6b27e26 <splice+38>, len=0,
flags=2415913192) at ../sysdeps/unix/sysv/linux/splice.c:25
[snip snip all thread busy with splice]
  23 Thread 0x7fc0ad7fa6c0 (LWP 2878547) warning: Section
`.reg-xstate/2878547' in core file too small.
0x00007fc0b6b27e26 in splice (fd_in=180, off_in=0x7fc07c000ef0,
fd_out=32, off_out=0x7fc0b6b27e26 <splice+38>, len=0,
flags=2910820584) at ../sysdeps/unix/sysv/linux/splice.c:25
  24 Thread 0x7fc0797fa6c0 (LWP 2878561) warning: Section
`.reg-xstate/2878561' in core file too small.
0x00007fc0b6b27e26 in splice (fd_in=224, off_in=0x7fc04c000e70,
fd_out=32, off_out=0x7fc0b6b27e26 <splice+38>, len=0, flags=0) at
../sysdeps/unix/sysv/linux/splice.c:25
  25 Thread 0x7fc08effd6c0 (LWP 2878551) warning: Section
`.reg-xstate/2878551' in core file too small.
0x00007fc0b6b27e26 in splice (fd_in=193, off_in=0x7fc064000ef0,
fd_out=32, off_out=0x7fc0b6b27e26 <splice+38>, len=0,
flags=2399127784) at ../sysdeps/unix/sysv/linux/splice.c:25

(gdb) bt -full
#0 0x0000000000000000 in ?? ()
No symbol table info available.
#1 0x0000000000000000 in ?? ()
No symbol table info available.

Looking 'layout asm' I get impression 0x7fc0b79aba40 is invalid stack pointer.

And why threads were in middle of running splice()? Was the service in
middle of systemd upgrade? That could explain how next entry in
reg_list is null.

Meanwhile I need to look if there is a reason in operating environment
why systemctl restarts happen (when I do not expect them).

I am a little confused by your previous wording:
"... have a problem with unbound-1.19.0. Estimated time in between
crashes is around 450 days on a single server."
Did these start specifically with 1.19.0?

Sorry. I should have been more clear. Couple weeks ago I was informed
unbound is crashing, version at the time was 1.17.1. Unfortunately I
did not have access to core files at the time so I could not look into
backtraces. But I thought that's an old version, let me just upgrade
and not think about this too hard.

After upgrade cores kept on coming. That is when I arranged access to
debugging facilities.

How often do you see those crashes (per single server)?

I have not seen crash repeated on a server (yet). For some reason
cores are coming from servers in India more than anywhere else. The
450 day interval on a single server is calculated by taking rate of
crashes and compared it on number of servers running unbound. Where I
was trying to get to is; this looks pretty rare condition, but with
enough servers are things happen often.

Have a nice weekend, Sami

Some debugging observations since previous message.

* The cores appear to come only from servers that connectivity problems.
* Configuration has defines both prefetch and serve-expired.
* Sometime last message before crash is "all servers for this domain failed,
  at zone" other times it is "all servers for this domain failed, at zone".
  ACK, this might not mean anything at all.

While ago I received this backtrace, that looks different than all the other
crashes so far. The infra_lookup_nottl() does not manipulate infra pointer,
but when function received the pointer it was different than at time of use.
Unfortunately '(gdb) info threads' does not provide hints what else was
happening.

-- snip
Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) bt -full
#0 0x000055b596e194f9 in infra_lookup_nottl (infra=0x55b597000000,
infra@entry=0x55b597fbe230, addr=0x0, addr@entry=0x55b5986b2198,
addrlen=1633904901, addrlen@entry=16, name=0x6772 <error: Cannot
access memory at address 0x6772>, name@entry=0x55b5988cc338 "rg",
namelen=namelen@entry=11, wr=-1763693056, wr@entry=0) at
services/cache/infra.c:383
        k = {addr = {ss_family = 2,
            __ss_padding =
"\0005\307\004\2125\000\000\000\000\000\000\000\000\n\000\0005\000\000\000\000
e\346\226\265U\000\000\003\000\000\000\000\000\000\000@\205m\233\265U\000\000
\303\213\230\265U\000\000\320\302\213\230\265U\000\000\320\302\213\230\265U\000\000&\230\343\226\265U\000\000\000\363\222\272\374\177\000\000`\342\213\230\265U\000\000\000\000\000\000\000\000\000\000\003\000\000\000\001\000\000\000\020\000\000\0000\000\000",
__ss_align = 140723438679024}, addrlen = 16, zonename = 0x55b5988cc338
"rg", namelen = 11, entry = {lock = {__data = {__readers = 2531963245,
__writers = 21941, __wrphase_futex = 2532224530, __writers_futex =
21941, __pad3 = 0, __pad4 = 0, __cur_writer = 16, __shared = 48,
__rwelision = 48 '0',
                __pad1 = "\364\222\272\374\177\000", __pad2 =
140723438678896, __flags = 3469711703}, __size =
"m\261\352\226\265U\000\000\022\256\356\226\265U\000\000\000\000\000\000\000\000\000\000\020\000\000\0000\000\000\0000\364\222\272\374\177\000\000p\363\222\272\374\177\000\000W\231\317\316\222\177\000",
__align = 94238409404781}, overflow_next = 0x55b5986e8bc8,
            lru_next = 0x196e4a95a, lru_prev = 0xe5, hash =
1679659776, key = 0x3000000010, data = 0x7ffcba92f470}}
#1 0x000055b596e19578 in infra_host (infra=0x55b597fbe230,
addr=0x55b5986b2198, addrlen=16, nm=0x55b5988cc338 "rg", nmlen=11,
timenow=1707222584, edns_vs=0x7ffcba92f468,
edns_lame_known=0x7ffcba92f467 "", to=0x7ffcba92f46c) at
services/cache/infra.c:451
        e = <optimized out>
        data = <optimized out>
        wr = <optimized out>
#2 0x000055b596ead09b in serviced_udp_send
(sq=sq@entry=0x55b5986b2140, buff=buff@entry=0x55b59ae3fea0) at
services/outside_network.c:2890
        rtt = 0
        vs = 2
        edns_lame_known = 0 '\000'
        now = <optimized out>
#3 0x000055b596eb052b in serviced_timer_cb (arg=0x55b5986b2140) at
services/outside_network.c:2585
        sq = 0x55b5986b2140
        outnet = <optimized out>
#4 0x00007f92cecf0482 in ?? ()
No symbol table info available.
...
#15 0x0000000000000000 in ?? ()
No symbol table info available.

(gdb) frame 0
#0 0x000055b596e194f9 in infra_lookup_nottl (infra=0x55b597000000,
infra@entry=0x55b597fbe230, addr=0x0, addr@entry=0x55b5986b2198,
addrlen=1633904901, addrlen@entry=16, name=0x6772 <error: Cannot
access memory at address 0x6772>, name@entry=0x55b5988cc338 "rg",
namelen=namelen@entry=11, wr=-1763693056, wr@entry=0) at
services/cache/infra.c:383
383 in services/cache/infra.c
(gdb) print *infra
Cannot access memory at address 0x55b597000000
(gdb) print *name
Cannot access memory at address 0x6772

(gdb) frame 1
#1 0x000055b596e19578 in infra_host (infra=0x55b597fbe230,
addr=0x55b5986b2198, addrlen=16, nm=0x55b5988cc338 "rg", nmlen=11,
timenow=1707222584, edns_vs=0x7ffcba92f468,
edns_lame_known=0x7ffcba92f467 "", to=0x7ffcba92f46c) at
services/cache/infra.c:451
451 in services/cache/infra.c
(gdb) print *infra
$2 = {hosts = 0x55b597ba7cb0, host_ttl = 86400, infra_keep_probing =
1, domain_rates = 0x55b597fd3680, domain_limits = {root =
0x55b596f4f680 <rbtree_null_node>, count = 0, cmp = 0x55b596e431c0
<name_tree_compare>}, client_ip_rates = 0x55b597fd4150}
(gdb) print *infra->hosts
$3 = {size = 128, mask = 4261412864, shift = 25, array = 0x55b597fbe270}
(gdb) print **infra->hosts->array
$4 = {lock = 1, sizefunc = 0x55b596e14a00 <infra_sizefunc>, compfunc =
0x55b596e18cc0 <infra_compfunc>, delkeyfunc = 0x55b596e15140
<infra_delkeyfunc>, deldatafunc = 0x55b596e14d60 <infra_deldatafunc>,
markdelfunc = 0x0, cb_arg = 0x0, size = 32, size_mask = 31, array =
0x55b597fbe710, lru_start = 0x55b598274c28, lru_end = 0x55b598274c28,
num = 1, space_used = 307,
  space_max = 242187, max_collisions = 0}
-- snip

Cheers Sami

Hello again,

Crashes keep on happening. Instead of reporting the same backtraces here
are some that I think I have not reported yet. Different debugging attempts
are marked with with '== <date> ==' header so that it is easy to tell which
backtraces and value print outs are related to each other.

== 2024-02-09t03-11-37 ==

Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) bt -full
#0 0x00005564b581c4f9 in infra_lookup_nottl (infra=0x5564b7646663,
infra@entry=0x5564b7bd9230, addr=0x0, addr@entry=0x5564b8042518,
addrlen=0, addrlen@entry=16, name=0x0, name@entry=0x5564b844b0d8 "rg",
namelen=namelen@entry=11, wr=-1207894016, wr@entry=1)
    at services/cache/infra.c:383
        k = {addr = {ss_family = 2,
            __ss_padding =
"\0005\307\004\2125\000\000\000\000\000\000\000\000\003\000\000\000\000\000\000\000P5/\273dU\000\000\200\262J\270dU\000\0000\262J\270dU\000\0000\262J\270dU\000\000&\310\203\265dU\000\000\240\210\241\226\374\177\000\000\360\323J\270dU\000\000\030\351g\347\353\177\000\000\300\210\241\226\374\177\000\000\020\000\000\0000\000\000\000\220\211\241\226\374\177\000\000\320\210\241\226\374\177\000",
__ss_align = 0}, addrlen = 16, zonename = 0x5564b844b0d8 "rg", namelen
= 11, entry = {lock = {__data = {__readers = 0, __writers = 0,
                __wrphase_futex = 16, __writers_futex = 48, __pad3 =
2527168976, __pad4 = 32764, __cur_writer = -1767798512, __shared =
32764, __rwelision = 87 'W', __pad1 = "\331\t\350\353\177\000", __pad2
= 93891074126344, __flags = 3045382490},
              __size =
"\000\000\000\000\000\000\000\000\020\000\000\0000\000\000\000\320\211\241\226\374\177\000\000\020\211\241\226\374\177\000\000W\331\t\350\353\177\000\000\b\362\036\270dU\000\000Z\331\204\265\001\000\000",
__align = 0}, overflow_next = 0xe3,
            lru_next = 0xe7f29b0d76027000, lru_prev = 0x3000000010,
hash = 2527169040, key = 0x7ffc96a18950, data = 0x0}}
#1 0x00005564b581ca2b in infra_rtt_update (infra=0x5564b7bd9230,
addr=0x5564b8042518, addrlen=16, nm=0x5564b844b0d8 "rg", nmlen=11,
qtype=1, roundtrip=-1, orig_rtt=12000, timenow=1707448279) at
services/cache/infra.c:579
        e = <optimized out>
        data = <optimized out>
        needtoinsert = <optimized out>
        expired = <optimized out>
        rto = <optimized out>
        oldprobedelay = <optimized out>
#2 0x00005564b58b3dea in serviced_udp_callback (c=0x5564bb2aa9e0,
arg=0x5564b80424c0, error=error@entry=-2, rep=rep@entry=0x0) at
services/outside_network.c:3270
        sq = 0x5564b80424c0
        outnet = 0x5564baa5abe0
        now = <optimized out>
        p = <optimized out>
#3 0x00005564b58afcf6 in pending_udp_timer_cb (arg=0x5564b81b36a0) at
services/outside_network.c:1592
        p = 0x5564b81b36a0
        outnet = 0x5564baa5abe0
        __func__ = "pending_udp_timer_cb"
#4 0x00007febe8094482 in ?? ()
No symbol table info available.
#5 0x00005564ba902cf0 in ?? ()
No symbol table info available.
...
#15 0x0000000000000000 in ?? ()
No symbol table info available.

(gdb) print *addr@entry
$1 = {ss_family = 2, __ss_padding = "\0005\307\004\2125", '\000'
<repeats 15 times>, "S", '\000' <repeats 95 times>, __ss_align = 0}
(gdb) print *addr
Cannot access memory at address 0x0
(gdb) print *infra
Cannot access memory at address 0x5564b7646663
(gdb) print k
$2 = {addr = {ss_family = 2,
    __ss_padding =
"\0005\307\004\2125\000\000\000\000\000\000\000\000\003\000\000\000\000\000\000\000P5/\273dU\000\000\200\262J\270dU\000\0000\262J\270dU\000\0000\262J\270dU\000\000&\310\203\265dU\000\000\240\210\241\226\374\177\000\000\360\323J\270dU\000\000\030\351g\347\353\177\000\000\300\210\241\226\374\177\000\000\020\000\000\0000\000\000\000\220\211\241\226\374\177\000\000\320\210\241\226\374\177\000",
__ss_align = 0}, addrlen = 16, zonename = 0x5564b844b0d8 "rg", namelen
= 11, entry = {lock = {__data = {__readers = 0, __writers = 0,
__wrphase_futex = 16,
        __writers_futex = 48, __pad3 = 2527168976, __pad4 = 32764,
__cur_writer = -1767798512, __shared = 32764, __rwelision = 87 'W',
__pad1 = "\331\t\350\353\177\000", __pad2 = 93891074126344, __flags =
3045382490},
      __size = "\000\000\000\000\000\000\000\000\020\000\000\0000\000\000\000\320\211\241\226\374\177\000\000\020\211\241\226\374\177\000\000W\331\t\350\353\177\000\000\b\362\036\270dU\000\000Z\331\204\265\001\000\000",
__align = 0}, overflow_next = 0xe3,
    lru_next = 0xe7f29b0d76027000, lru_prev = 0x3000000010, hash =
2527169040, key = 0x7ffc96a18950, data = 0x0}}
(gdb) print wr
$3 = -1207894016
(gdb) print *wr
Cannot access memory at address 0xffffffffb8010000

It looks like 'addr' became null while infra_lookup_nottl() was running.

== 2024-02-08t02-10-49 ==

Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) bt -full
#0 rbtree_find_less_equal (rbtree=0x56182f18b550, key=0x7ffc99f47c90,
result=0x7ffc99f47c88) at util/rbtree.c:527
        r = <optimized out>
        node = 0x0
        __func__ = "rbtree_find_less_equal"
#1 0x000056182a24b3f1 in rbtree_search (key=0x7ffc99f47c90,
rbtree=<optimized out>) at util/rbtree.c:285
        node = 0x0
        node = <optimized out>
#2 mesh_area_find (mesh=<optimized out>, cinfo=<optimized out>,
qinfo=<optimized out>, qflags=<optimized out>, prime=<optimized out>,
valrec=<optimized out>) at services/mesh.c:1592
        key = {node = {parent = 0x56182bb0dca8, left = 0x0, right =
0x0, key = 0x7ffc99f47c90, color = 0 '\000'}, run_node = {parent =
0x0, left = 0x56182e79c42c, right = 0x23, key = 0x100000003, color = 0
'\000'}, s = {qinfo = {qname = 0x7ffc99f48050
"\001e\froot-servers\003net",
              qname_len = 20, qtype = 1, qclass = 1, local_alias =
0x0}, query_flags = 256, is_priming = 0, is_valrec = 0, reply = 0x0,
return_msg = 0x0, return_rcode = 0, reply_origin = 0x7ffc99f482f0,
blacklist = 0x56182bae22d0, region = 0x4, errinf = 0x0, curmod = 3,
ext_state = {
              module_state_initial, module_state_initial,
module_state_initial, 707042248, 22040, 790017248, 22040,
module_restart_next, module_state_initial, 2582937080, 32764,
1883532271, 32511, 4294967295, module_state_initial, 779732000}, minfo
= {0x0, 0x7ffc99f47ff0,
              0x7ffc99f481c0, 0x45, 0x43, 0x7ffc99f480f0,
0x7ffc99f480f0, 0x7eff70449afd, 0x56182e79c3f0, 0x56182f16b4e0,
0x60a03274372, 0x0, 0xffffffff00000000, 0xa0289dd641a06f00, 0x0,
0x7ffc99f47ff0}, env = 0x56182a35ccc8
<dnstap.message.field_descriptors.lto_priv+936>,
            mesh_info = 0x84, prefetch_leeway = 140722891423984,
serve_expired_data = 0x7ffc99f480f0, edns_opts_front_in =
0x7ffc99f480f0, edns_opts_back_out = 0x7eff70448ab3, edns_opts_back_in
= 0xe, edns_opts_front_out = 0x81, no_cache_lookup = 0, no_cache_store
= 0,
            need_refetch = -1712030080, was_ratelimited = 32764,
qstarttime = 0, is_cachedb_answer = 0, client_info = 0x0,
respip_action_info = 0x7eff70446e64, rpz_passthru = 0, tcp_required =
0, is_drop = 1883532271}, reply_list = 0x0, cb_list = 0x7ffc99f47ef0,
super_set = {
            root = 0x0, count = 139635549566676, cmp =
0x56002bd196d0}, sub_set = {root = 0xa0289dd641a06f00, count = 53, cmp
= 0x56182a25abb5 <ipstrtoaddr+213>}, num_activated = 140722891423872,
prev = 0x7eff704499bd, next = 0x0, list_select = (unknown:
0x99f47ef0), unique = 0x0,
          replies_sent = 92 '\\'}
        result = <optimized out>
#3 0x000056182a2df54f in mesh_new_callback.constprop.0
(mesh=0x56182f18b520, qinfo=qinfo@entry=0x7ffc99f47fb0,
edns=edns@entry=0x7ffc99f48250, buf=buf@entry=0x56182f19bdd0,
cb=cb@entry=0x56182a2881f0 <auth_xfer_probe_lookup_callback>,
cb_arg=cb_arg@entry=0x56182bd26350,
    rpz_passthru=0, qid=0, qflags=256) at services/mesh.c:559
        s = 0x0
        unique = 0
        timeout = 0
        was_detached = 0
        was_noreply = 0
        added = 0
#4 0x000056182a287683 in xfr_probe_lookup_host (env=0x56182c1b3518,
xfr=0x56182bd26350) at services/authzone.c:6630
        and so on

(gdb) print *node
Cannot access memory at address 0x0

Red-black tree node became null while it was examined.

== 2024-02-08t04-40-48 ==

Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) bt
#0 0x000055da1457abeb in mesh_state_compare (ap=0x55da15570c00,
bp=0x0) at services/mesh.c:144
#1 0x000055da1459195a in rbtree_find_less_equal
(rbtree=rbtree@entry=0x55da183c5058, key=0x55da15570c00,
result=result@entry=0x7ffea92b1638) at util/rbtree.c:527
#2 0x000055da145919e7 in rbtree_search (key=<optimized out>,
rbtree=0x55da183c5058) at util/rbtree.c:285
#3 rbtree_delete (rbtree=0x55da183c5058, key=<optimized out>) at
util/rbtree.c:333
#4 0x000055da1462439c in mesh_state_delete.part.0.isra.0
(qstate=<optimized out>) at services/mesh.c:984
#5 0x000055da145808c2 in mesh_state_delete (qstate=0x55da15570c50) at
services/mesh.c:952
#6 mesh_continue (ev=<synthetic pointer>, s=<optimized out>,
mstate=0x55da15570c00, mesh=0x55da183c5040) at services/mesh.c:1843
#7 mesh_run (mesh=0x55da183c5040, mstate=0x55da15570c00,
ev=<optimized out>, e=0x0) at services/mesh.c:1874
#8 0x000055da14549f13 in mesh_report_reply (what=0,
reply=0x7ffea92b1a40, e=0x55da155749a0, mesh=<optimized out>) at
services/mesh.c:816
#9 worker_handle_service_reply (c=0x55da1837fa30, arg=0x55da155749a0,
error=0, reply_info=0x7ffea92b1a40) at daemon/worker.c:268
#10 0x000055da145f7069 in serviced_callbacks
(sq=sq@entry=0x55da183f6a60, error=error@entry=0,
c=c@entry=0x55da1837fa30, rep=rep@entry=0x7ffea92b1a40) at
services/outside_network.c:3044
#11 0x000055da145f7e30 in serviced_udp_callback (c=0x55da1837fa30,
arg=0x55da183f6a60, error=error@entry=0, rep=rep@entry=0x7ffea92b1a40)
at services/outside_network.c:3384
#12 0x000055da145f3b7c in outnet_udp_cb (c=<optimized out>,
arg=0x55da17b2c6d0, error=<optimized out>, reply_info=0x7ffea92b1a40)
at services/outside_network.c:1537
#13 0x000055da145eba58 in comm_point_udp_callback (fd=236,
event=<optimized out>, arg=<optimized out>) at util/netevent.c:1145
#14 0x00007f6cf021d585 in ?? ()
#15 0x000055da179d4790 in ?? ()
#16 0x000055da179d4920 in ?? ()
#17 0x000000000013ce8c in ?? ()
#18 0x0000000025b244d4 in ?? ()
#19 0x000000000013cef5 in ?? ()
#20 0x33eddbe1ed004d00 in ?? ()
#21 0x000000000013ce7d in ?? ()
#22 0x000055da179d44d0 in ?? ()
#23 0x000055da179d4628 in ?? ()
#24 0x33eddbe1ed004d00 in ?? ()
#25 0x0000000000000000 in ?? ()

(gdb) print *a
$1 = {node = {parent = 0x55da155b0d00, left = 0x55da14696680
<rbtree_null_node>, right = 0x55da14696680 <rbtree_null_node>, key =
0x55da15570c00, color = 1 '\001'}, run_node = {parent = 0x55da14696680
<rbtree_null_node>, left = 0x55da14696680 <rbtree_null_node>,
    right = 0x55da14696680 <rbtree_null_node>, key = 0x55da15570c00,
color = 0 '\000'}, s = {qinfo = {qname = 0x55da15570e70
"\froot-servers\003net", qname_len = 18, qtype = 43, qclass = 1,
local_alias = 0x0}, query_flags = 272, is_priming = 0, is_valrec = 1,
reply = 0x0,
    return_msg = 0x55da155749c0, return_rcode = 0, reply_origin =
0x55da155d52c8, blacklist = 0x0, region = 0x55da15570bc0, errinf =
0x0, curmod = 0, ext_state = {module_finished, module_finished,
module_state_initial <repeats 14 times>}, minfo = {0x0,
0x55da15570eb8,
      0x0 <repeats 14 times>}, env = 0x55da153ecf48, mesh_info =
0x55da15570c00, prefetch_leeway = 0, serve_expired_data = 0x0,
edns_opts_front_in = 0x0, edns_opts_back_out = 0x0, edns_opts_back_in
= 0x0, edns_opts_front_out = 0x0, no_cache_lookup = 0, no_cache_store
= 0,
    need_refetch = 0, was_ratelimited = 0, qstarttime = 1707367230,
is_cachedb_answer = 0, client_info = 0x0, respip_action_info = 0x0,
rpz_passthru = 0, tcp_required = 0, is_drop = 0}, reply_list = 0x0,
cb_list = 0x0, super_set = {root = 0x55da15570e88, count = 1,
    cmp = 0x55da1457adf0 <mesh_state_ref_compare>}, sub_set = {root =
0x55da14696680 <rbtree_null_node>, count = 0, cmp = 0x55da1457adf0
<mesh_state_ref_compare>}, num_activated = 4, prev = 0x0, next = 0x0,
list_select = mesh_no_list, unique = 0x0, replies_sent = 1 '\001'}
(gdb) print *b
Cannot access memory at address 0x0

(gdb) frame 1
#1 0x000055da1459195a in rbtree_find_less_equal
(rbtree=rbtree@entry=0x55da183c5058, key=0x55da15570c00,
result=result@entry=0x7ffea92b1638) at util/rbtree.c:527
527 util/rbtree.c: No such file or directory.
(gdb) print *node
$2 = {parent = 0x0, left = 0x0, right = 0x0, key = 0x0, color = 0 '\000'}

Got a bit further with red-black tree handling, but ultimately failed due
null key.

== 2024-02-09t00-07-58 ==

Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) bt -full
#0 0x000055ec98cde3ef in new_entry (infra=infra@entry=0x55ec9a467230,
addr=addr@entry=0x55ec9a950958, addrlen=0, addrlen@entry=16, name=0x0,
name@entry=0x55ec9ad6ee88 "consul\005cfops\003net",
namelen=namelen@entry=11, tm=tm@entry=1707437259) at
services/cache/infra.c:440
        data = 0x550000000000
        key = 0x0
#1 0x000055ec98cdec79 in infra_rtt_update (infra=0x55ec9a467230,
addr=0x55ec9a950958, addrlen=16, nm=0x55ec9ad6ee88
"consul\005cfops\003net", nmlen=11, qtype=1, roundtrip=-1,
orig_rtt=12000, timenow=1707437259) at services/cache/infra.c:586
        e = <optimized out>
        data = <optimized out>
        needtoinsert = 0
        expired = 0
        rto = 1
        oldprobedelay = 0
#2 0x000055ec98d75dea in serviced_udp_callback (c=0x55ec9db37580,
arg=0x55ec9a950900, error=error@entry=-2, rep=rep@entry=0x0) at
services/outside_network.c:3270
        sq = 0x55ec9a950900
        outnet = 0x55ec9d2e6840
        now = <optimized out>
        p = <optimized out>
#3 0x000055ec98d71cf6 in pending_udp_timer_cb (arg=0x55ec9a7eb900) at
services/outside_network.c:1592
        p = 0x55ec9a7eb900
        outnet = 0x55ec9d2e6840
        __func__ = "pending_udp_timer_cb"
#4 0x00007f2462df7482 in ?? ()
No symbol table info available.
...
#15 0x0000000000000000 in ?? ()
No symbol table info available.

Trying to 'key->addrlen = addrlen' in to null key, which is really strange
point to fail because three earlier lines of code were key->value
assignments and they worked.

== 2024-02-04t08-56-00 ==

Program terminated with signal SIGABRT, Aborted.

(gdb) bt -full
#0 0x00007f01a06a9e2c in __pthread_mutex_cond_lock_full
(mutex=0x130fd1) at ../nptl/pthread_mutex_lock.c:210
        assume_other_futex_waiters = <optimized out>
        oldval = <optimized out>
        id = -1507605342
        __PRETTY_FUNCTION__ = "__pthread_mutex_cond_lock_full"
#1 0x0000555873f8f97e in iter_handle (qstate=<optimized out>,
iq=0x555874edec28, ie=0x55587489c710, id=1) at
iterator/iterator.c:4031
        cont = 1
#2 0x00007f01a07bb138 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#3 0x00007fff00000018 in ?? ()
No symbol table info available.
#4 0x00007fff288d2de0 in ?? ()
No symbol table info available.
#5 0x00007fff288d2d70 in ?? ()
No symbol table info available.
#6 0xa623c4a208356200 in ?? ()
No symbol table info available.
#7 0x0000003000000010 in ?? ()
No symbol table info available.
#8 0x00007fff288d2e50 in ?? ()
No symbol table info available.
#9 0x00007f01a07bb138 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#10 0x000007f01a07f242 in ?? ()
No symbol table info available.
#11 0x000055587400d0b0 in reuse_cmp (key1=0x7f01a069e24e
<__libc_message+366>, key2=0x7f01a07bb138) at
services/outside_network.c:175
        r = <optimized out>
#12 0x00007f01a07f2440 in main_arena () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#13 0x0000000000000000 in ?? ()
No symbol table info available.

Was the pthread destroyed before unlocking was attempted?

== 2024-02-03t16-54-26 ==

Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) bt -full
#0 0x00007ff192d717db in __memcpy_avx_unaligned_erms () at
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:557
No locals.
#1 0x0000564423d53032 in regional_alloc_init (size=5,
init=0x5644253bfe58, r=0x7ff192df23f0 <main_arena+1936>) at
util/regional.c:189
        s = <optimized out>
        s = <optimized out>
#2 serviced_create (region=0x7ff192df23f0 <main_arena+1936>,
alloc=0x564424d56ca0, pad_queries_block_size=0, opt_list=<optimized

, qtype=1, zonelen=5, zone=0x5644253bfe58 "\003net", addrlen=16,

addr=0x7ffe944f2db0, tls_auth_name=0x0, ssl_upstream=0,
tcp_upstream=0, nocaps=0, want_dnssec=1, dnssec=32784, buff=<optimized

, outnet=0x56442792f2f0)

    at services/outside_network.c:2628
        sq = 0x5644251363c0
        t = {tv_sec = 16, tv_usec = 94850680552456}
        sq = <optimized out>
        t = <optimized out>
#3 outnet_serviced_query (callback=<optimized out>,
was_ratelimited=0x7ffe944f2da8, env=0x5644251efc08, buff=<optimized

, callback_arg=0x5644253c3600, qstate=0x5644253bfa40, zonelen=5,

zone=0x5644253bfe58 "\003net", addrlen=16, addr=0x7ffe944f2db0,
tls_auth_name=0x0, ssl_upstream=0, tcp_upstream=0, check_ratelimit=1,
nocaps=0, want_dnssec=1, dnssec=32784,
    flags=<optimized out>, qinfo=<optimized out>, outnet=<optimized

) at services/outside_network.c:3466

        sq = <optimized out>
        region = 0x7ff192df23f0 <main_arena+1936>
        timenow = 1706979247
        cb = <optimized out>
        client_string_addr = <optimized out>
        backed_up_opt_list = <optimized out>
        per_upstream_opt_list = 0x0
        sq = <optimized out>
        cb = <optimized out>
        client_string_addr = <optimized out>
        region = <optimized out>
        backed_up_opt_list = <optimized out>
        per_upstream_opt_list = <optimized out>
        timenow = <optimized out>
#4 worker_send_query (qinfo=<optimized out>, flags=<optimized out>,
dnssec=32784, want_dnssec=1, nocaps=0, check_ratelimit=1,
addr=0x7ffe944f2db0, addrlen=16, zone=0x5644253bfe58 "\003net",
zonelen=5, tcp_upstream=0, ssl_upstream=0, tls_auth_name=0x0,
q=0x5644253bfa40, was_ratelimited=0x7ffe944f2da8) at
daemon/worker.c:2395
        etc etc...

How is it even possible to get memory alignment issue?

-- snip

As mentioned earlier these are coming from systems that have connectivity
problems with configuration to keep items in cache. Unfortunately I have
not examined unbound code base to the point where I could say if failing
requests can cause entries to be removed from cache.

Please let me know if there is anything I can do to make debugging easier.

Hi Sami,

I am quiet because I have nothing to share at the moment but thanks for these!

Best regards,
-- Yorgos

I am quiet because I have nothing to share at the moment but thanks for
these!

Hi Yorgos, et.al.,

I wonder could the issue be similar (but not the same) as earlier issue. Perhaps
something like:

Organization: Cloudflare

Similar-to: https://github.com/NLnetLabs/unbound/commit/cd22fdc28d35e3a659c89f346ff5e36b64e44845

Hi Sami,

I believe this alloc_reg_release() needs to be there and I don't see it being called twice on failure, unless I miss something.

As for the cores I started looking around; the latest ones seem really weird especially with pointers that seem to change between calls.

Did you find out if there was a reason systemctl restarts would happen unexpectedly, so that we can rule that out?

Best regards,
-- Yorgos

Hi Sami,

An extra question while looking around; when reloading, do you maybe use the reload_keep_cache command?

Best regards,
-- Yorgos

I believe this alloc_reg_release() needs to be there and I don't see it
being called twice on failure, unless I miss something.

Hi Yorgos,

No problem, I just thought maybe the previous crash and this mystery case
have something in common. Unfortunately I haven't had time to really start
understand unbound internals, so my 'fix might be needed here' guesses are
not very educated.

As for the cores I started looking around; the latest ones seem really
weird especially with pointers that seem to change between calls.

Would you like to receive my backtrace collection? There are a lot of
similar looking backtraces, but maybe with enough samples something catches
eye.

https://github.com/NLnetLabs/unbound/issues/1017

Pretty much everything stops SIGSEGV, exceptions are marked in the files.

Did you find out if there was a reason systemctl restarts would happen
unexpectedly, so that we can rule that out?

Only thing I have seen systemd doing is start the service after crash.
Number of restarts is already fairly large, and I have not debugged each of
them what comes to logs so I cannot rule out there could confirm all cases
the same - just the ones I looked.

An extra question while looking around; when reloading, do you maybe use
the reload_keep_cache command?

The reload_keep_cache is not in use.

Hi Yorgos, and other,

Some good and bad news.

The good news, core dumps stopped happening.

Bad news. Locations where cores were coming got fixes to networking,
most recent updates including kernel, latest libc, etc and so on. With
this much change I have no glue what exactly fixed the issue.

Well... if something similar happens to someone else at least there
is documented case that may provide insights what conditions trigger
the issue. I guess that's the end of this for now.

Sorry for nothingburger.

Have a nice weekend, Sami