NSD 4.2.0 intermittent segfaults @ libssl?

I just bumped NSD

  nsd -v
    NSD version 4.2.0

on a linux64 VM.

On axfrs, I'm seeing segfaults in libssl

  Jun 11 08:53:24 ns03 nsd[12296]: axfr for example1.com. from 109.74.194.10
  Jun 11 08:53:24 ns03 kernel: [35762.840704] nsd[12296]: segfault at 560244acb618 ip 00007fefedb81406 sp 00007ffe6c552ac0 error 7 in libssl.so.1.1[7fefedb43000+86000]
  Jun 11 08:53:24 ns03 nsd[12246]: server 12296 died unexpectedly, restarting
  Jun 11 08:53:24 ns03 nsd[12225]: [2019-06-11 08:53:24.960] nsd[12246]: warning: server 12296 died unexpectedly, restarting
  Jun 11 08:53:25 ns03 nsd[12246]: process 12296 terminated with status 139
  Jun 11 08:53:25 ns03 nsd[12225]: [2019-06-11 08:53:25.030] nsd[12246]: warning: process 12296 terminated with status 139
  Jun 11 08:53:25 ns03 nsd[12297]: axfr for example2.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 nsd[12225]: [2019-06-11 08:53:25.557] nsd[12297]: info: axfr for example2.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 nsd[12225]: [2019-06-11 08:53:25.557] nsd[12297]: info: axfr for example3.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 nsd[12297]: axfr for example3.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 nsd[12297]: axfr for example4.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 nsd[12225]: [2019-06-11 08:53:25.565] nsd[12297]: info: axfr for example4.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 kernel: [35763.583172] nsd[12297]: segfault at 560244acb618 ip 00007fefedb81406 sp 00007ffe6c552ac0 error 7 in libssl.so.1.1[7fefedb43000+86000]

Note, *NOT* on every axfr; some seem to work.

Just starting to troubleshoot here ...

Any obvious issues that are already known that might cause this?

fwiw, dropback from NSD_4_2_0_REL -> clean build of NSD_4_1_27_REL, with no other config changes, cures the issue ... no segfaults on axfr.

I just bumped NSD

  nsd -v
    NSD version 4.2.0

on a linux64 VM.

On axfrs, I'm seeing segfaults in libssl

  Jun 11 08:53:24 ns03 nsd[12296]: axfr for example1.com. from 109.74.194.10
  Jun 11 08:53:24 ns03 kernel: [35762.840704] nsd[12296]: segfault at 560244acb618 ip 00007fefedb81406 sp 00007ffe6c552ac0 error 7 in libssl.so.1.1[7fefedb43000+86000]
  Jun 11 08:53:24 ns03 nsd[12246]: server 12296 died unexpectedly, restarting
  Jun 11 08:53:24 ns03 nsd[12225]: [2019-06-11 08:53:24.960] nsd[12246]: warning: server 12296 died unexpectedly, restarting
  Jun 11 08:53:25 ns03 nsd[12246]: process 12296 terminated with status 139
  Jun 11 08:53:25 ns03 nsd[12225]: [2019-06-11 08:53:25.030] nsd[12246]: warning: process 12296 terminated with status 139
  Jun 11 08:53:25 ns03 nsd[12297]: axfr for example2.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 nsd[12225]: [2019-06-11 08:53:25.557] nsd[12297]: info: axfr for example2.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 nsd[12225]: [2019-06-11 08:53:25.557] nsd[12297]: info: axfr for example3.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 nsd[12297]: axfr for example3.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 nsd[12297]: axfr for example4.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 nsd[12225]: [2019-06-11 08:53:25.565] nsd[12297]: info: axfr for example4.com. from 207.192.70.10
  Jun 11 08:53:25 ns03 kernel: [35763.583172] nsd[12297]: segfault at 560244acb618 ip 00007fefedb81406 sp 00007ffe6c552ac0 error 7 in libssl.so.1.1[7fefedb43000+86000]

Note, *NOT* on every axfr; some seem to work.

Just starting to troubleshoot here ...

Any obvious issues that are already known that might cause this?

Hello "PGNet Dev"

now, as you ask:

I see similar messages before I updated from 4.1.27 to 4.2.0
And now, as you mentioned that issue, I also found the segfault message :slight_smile:

From what I see something bad must happen *after* AXFR is completed.

- I see no warning/error on the consumer side
- I don't use TLS for AXFR
- it happened also on 4.1.27
- I can't reproduce in an lab environment
- none of my users asked me that they miss something

BTW: There is a draft ¹) "Message Digest for DNS Zones" to prove a transferred zone was received complete

I've added ²) the ldns-zone-digest tool in my ldns instance and can create and verify zone files.
Unfortunately not in this particilar installation :-/
Would be helpful if nsd could check such ZONEMD if available

Andreas

¹) https://tools.ietf.org/html/draft-wessels-dns-zone-digest-06
²) https://open.nlnetlabs.nl/pipermail/ldns-users/2018-November/000934.html

if that matters: I *do* use TSIG-Keys to secure the AXFR

Andreas

Here, all the slave AXFRs are configured NOKEY.

Hi,

- I don't use TLS for AXFR

if that matters: I *do* use TSIG-Keys to secure the AXFR

Here, all the slave AXFRs are configured NOKEY.

So it is not SSL. Then it could be TCP Fast Open (server-side fast
open, from the logs I see this is the AXFR server). Or perhaps the file
descriptor events could be mangled, in the event handling. This patch
may help? Otherwise, perhaps a stack trace of the issue could be useful
to have, or a way to reproduce it.

https://github.com/NLnetLabs/nsd/commit/75c7e1a2fab265e504e8aa2f93dff195dc5b8394

Best regards, Wouter

I applied similiar changes to 4.2.0 but this make things worse.
https://andreasschulze.de/tmp/axfr_segfault_fix.patch

Jun 12 17:48:57 dnsserver02 kernel: [2448848.071188] show_signal_msg: 7871 callbacks suppressed
Jun 12 17:48:57 dnsserver02 kernel: [2448848.071191] nsd[59588]: segfault at 4 ip 00007f09e0d92ba3 sp 00007ffe460a0978 error 6 in libev.so.4.0.0[7f09e0d89000+d000]
Jun 12 17:48:57 dnsserver02 kernel: [2448848.071684] nsd[59589]: segfault at 4 ip 00007f09e0d92ba3 sp 00007ffe460a0978 error 6 in libev.so.4.0.0[7f09e0d89000+d000]
Jun 12 17:48:57 dnsserver02 kernel: [2448848.072179] nsd[59590]: segfault at 4 ip 00007f09e0d92ba3 sp 00007ffe460a0978 error 6 in libev.so.4.0.0[7f09e0d89000+d000]
Jun 12 17:48:57 dnsserver02 kernel: [2448848.072672] nsd[59591]: segfault at 4 ip 00007f09e0d92ba3 sp 00007ffe460a0978 error 6 in libev.so.4.0.0[7f09e0d89000+d000]
Jun 12 17:48:57 dnsserver02 kernel: [2448848.073452] nsd[59592]: segfault at 4 ip 00007f09e0d92ba3 sp 00007ffe460a0978 error 6 in libev.so.4.0.0[7f09e0d89000+d000]
Jun 12 17:48:57 dnsserver02 kernel: [2448848.073972] nsd[59593]: segfault at 4 ip 00007f09e0d92ba3 sp 00007ffe460a0978 error 6 in libev.so.4.0.0[7f09e0d89000+d000]
Jun 12 17:48:57 dnsserver02 kernel: [2448848.075735] nsd[59594]: segfault at 4 ip 00007f09e0d92ba3 sp 00007ffe460a0978 error 6 in libev.so.4.0.0[7f09e0d89000+d000]
Jun 12 17:48:57 dnsserver02 kernel: [2448848.076253] nsd[59595]: segfault at 4 ip 00007f09e0d92ba3 sp 00007ffe460a0978 error 6 in libev.so.4.0.0[7f09e0d89000+d000]
Jun 12 17:48:57 dnsserver02 kernel: [2448848.077077] nsd[59596]: segfault at 4 ip 00007f09e0d92ba3 sp 00007ffe460a0978 error 6 in libev.so.4.0.0[7f09e0d89000+d000]
Jun 12 17:48:57 dnsserver02 kernel: [2448848.077576] nsd[59597]: segfault at 4 ip 00007f09e0d92ba3 sp 00007ffe460a0978 error 6 in libev.so.4.0.0[7f09e0d89000+d000]

Andreas

I'm unable to produce core files.

- ulimit -c unlimited is set
- cat /proc/sys/kernel/core_pattern: /core/corefile

/core exist outside and inside chroot and is writeable by nsd user
nsd started chroot and without chroot
nsd crash but: no corefile :-/

Any more magic required?

Andreas

Try running nsd in the foreground directly under gdb?

OK here some result:

I use to start nsd as "nsd -d -c /path/to/nsd.conf

this start 3 processes
33786 ? Ss 0:00 \_ sshd: root@pts/1
33798 pts/1 Ss 0:00 | \_ -bash
30910 pts/1 S+ 0:00 | \_ /usr/sbin/nsd -c /chroot/etc/nsd/nsd.conf
30911 pts/1 S+ 0:00 | \_ /usr/sbin/nsd -c /chroot/etc/nsd/nsd.conf
32456 pts/1 S+ 0:00 | \_ /usr/sbin/nsd -c /chroot/etc/nsd/nsd.conf

# gdb /usr/sbin/nsd
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html&gt;
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/&gt;\.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/&gt;\.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/nsd...(no debugging symbols found)...done.
(gdb) attach 32456
Attaching to program: /usr/sbin/nsd, process 32456
Reading symbols from /lib/x86_64-linux-gnu/libssl.so.1.1...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libprotobuf-c.so.1...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libfstrm.so.0...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libcrypto.so.1.1...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libev.so.4...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libnss_compat.so.2...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libnsl.so.1...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libnss_nis.so.2...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libnss_files.so.2...(no debugging symbols found)...done.
0x00007ffbed5602e3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) continue
Continuing.

{{{{{ now I start "nsd_control force_transfer" @ the SLAVE-Server ... }}}}

Program received signal SIGSEGV, Segmentation fault.
0x00007ffbee331785 in SSL_shutdown () from /lib/x86_64-linux-gnu/libssl.so.1.1
(gdb) bt
#0 0x00007ffbee331785 in SSL_shutdown () from /lib/x86_64-linux-gnu/libssl.so.1.1
#1 0x000055d50cb2ce0e in ?? ()
#2 0x000055d50cb2dfd8 in ?? ()
#3 0x00007ffbed81a413 in ev_invoke_pending () from /lib/x86_64-linux-gnu/libev.so.4
#4 0x00007ffbed81da06 in ev_run () from /lib/x86_64-linux-gnu/libev.so.4
#5 0x00007ffbed81ffd9 in event_base_loop () from /lib/x86_64-linux-gnu/libev.so.4
#6 0x000055d50cb30bdd in ?? ()
#7 0x000055d50cb30fcb in ?? ()
#8 0x000055d50cb3244c in ?? ()
#9 0x000055d50cae3ff7 in ?? ()
#10 0x00007ffbed4972e1 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x000055d50cae45fa in ?? ()
(gdb) quit
A debugging session is active.

        Inferior 1 [process 32456] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/sbin/nsd, process 32456

Hope, this helps ...

Andreas

now with debug symbols installed :slight_smile:

(gdb) bt
#0 0x00007ffbee334a8a in SSL_free () from /lib/x86_64-linux-gnu/libssl.so.1.1
#1 0x000055d50cb2ce1a in cleanup_tcp_handler (data=0x55d50ec17d80) at server.c:3001
#2 0x000055d50cb2dfd8 in handle_tcp_reading (fd=<optimized out>, event=<optimized out>, arg=0x55d50ec17d80) at server.c:3188
#3 0x00007ffbed81a413 in ev_invoke_pending () from /lib/x86_64-linux-gnu/libev.so.4
#4 0x00007ffbed81da06 in ev_run () from /lib/x86_64-linux-gnu/libev.so.4
#5 0x00007ffbed81ffd9 in event_base_loop () from /lib/x86_64-linux-gnu/libev.so.4
#6 0x000055d50cb30bdd in server_child (nsd=nsd@entry=0x55d50cdbffa0 <nsd>) at server.c:2652
#7 0x000055d50cb30fcb in restart_child_servers (nsd=nsd@entry=0x55d50cdbffa0 <nsd>, region=region@entry=0x55d50ebf9490, netio=netio@entry=0x55d50ec0f7c0, xfrd_sock_p=0x55d50ebca532) at server.c:427
#8 0x000055d50cb3244c in server_main (nsd=0x55d50cdbffa0 <nsd>) at server.c:2200
#9 0x000055d50cae3ff7 in main (argc=<optimized out>, argv=<optimized out>) at nsd.c:1147

A. Schulze:

now with debug symbols installed :slight_smile:

(gdb) bt
#0 0x00007ffbee334a8a in SSL_free () from /lib/x86_64-linux-gnu/libssl.so.1.1
#1 0x000055d50cb2ce1a in cleanup_tcp_handler (data=0x55d50ec17d80) at server.c:3001
#2 0x000055d50cb2dfd8 in handle_tcp_reading (fd=<optimized out>, event=<optimized out>, arg=0x55d50ec17d80) at server.c:3188
#3 0x00007ffbed81a413 in ev_invoke_pending () from /lib/x86_64-linux-gnu/libev.so.4
#4 0x00007ffbed81da06 in ev_run () from /lib/x86_64-linux-gnu/libev.so.4
#5 0x00007ffbed81ffd9 in event_base_loop () from /lib/x86_64-linux-gnu/libev.so.4
#6 0x000055d50cb30bdd in server_child (nsd=nsd@entry=0x55d50cdbffa0 <nsd>) at server.c:2652
#7 0x000055d50cb30fcb in restart_child_servers (nsd=nsd@entry=0x55d50cdbffa0 <nsd>, region=region@entry=0x55d50ebf9490, netio=netio@entry=0x55d50ec0f7c0, xfrd_sock_p=0x55d50ebca532) at server.c:427
#8 0x000055d50cb3244c in server_main (nsd=0x55d50cdbffa0 <nsd>) at server.c:2200
#9 0x000055d50cae3ff7 in main (argc=<optimized out>, argv=<optimized out>) at nsd.c:1147

should I try to provide more details?

Andreas

Hi Andreas,

A. Schulze:

now with debug symbols installed :slight_smile:

(gdb) bt
#0 0x00007ffbee334a8a in SSL_free () from
/lib/x86_64-linux-gnu/libssl.so.1.1
#1 0x000055d50cb2ce1a in cleanup_tcp_handler (data=0x55d50ec17d80) at
server.c:3001
#2 0x000055d50cb2dfd8 in handle_tcp_reading (fd=<optimized out>,
event=<optimized out>, arg=0x55d50ec17d80) at server.c:3188
#3 0x00007ffbed81a413 in ev_invoke_pending () from
/lib/x86_64-linux-gnu/libev.so.4
#4 0x00007ffbed81da06 in ev_run () from /lib/x86_64-linux-gnu/libev.so.4
#5 0x00007ffbed81ffd9 in event_base_loop () from
/lib/x86_64-linux-gnu/libev.so.4
#6 0x000055d50cb30bdd in server_child (nsd=nsd@entry=0x55d50cdbffa0
<nsd>) at server.c:2652
#7 0x000055d50cb30fcb in restart_child_servers
(nsd=nsd@entry=0x55d50cdbffa0 <nsd>,
region=region@entry=0x55d50ebf9490, netio=netio@entry=0x55d50ec0f7c0,
xfrd_sock_p=0x55d50ebca532) at server.c:427
#8 0x000055d50cb3244c in server_main (nsd=0x55d50cdbffa0 <nsd>) at
server.c:2200
#9 0x000055d50cae3ff7 in main (argc=<optimized out>, argv=<optimized
>) at nsd.c:1147

should I try to provide more details?

That does not seem necessary, this looks like bug #15, with a commit for it:
https://github.com/NLnetLabs/nsd/commit/ec8d14f8aa32903e894033c678a7268091002a92

Perhaps that commit can solve your problem?

With this patch nsd master server no longer segfault
even when more then 100 axfr are triggered by "nsd-control force_transfer" @ a slave.

Thanks, Wouter!

Andreas