Segfault when using dnstap at high load

Hi,

I'm trying to test unbound witk dnstap. It works fine with low load, but exists with segfault at high load. The segfault only happens when dnstap is enabled in configuration.

I am using the debian package (version 1.5.3) avaible in [1] and recompiled with dnstap enabled.
I'm following instruction descripted in [2] and using fstrm version 0.2.0.

To test the server, I'm using dnsblast [3] with the follow command:

./dnsblast <server address> 50000 500

[1] https://packages.debian.org/experimental/unbound
[2] http://dnstap.info/Source/
[3] https://github.com/jedisct1/dnsblast

Rogerio Bastos wrote:

I'm trying to test unbound witk dnstap. It works fine with low load, but
exists with segfault at high load. The segfault only happens when dnstap is
enabled in configuration.

I am using the debian package (version 1.5.3) avaible in [1] and recompiled
with dnstap enabled.
I'm following instruction descripted in [2] and using fstrm version 0.2.0.

To test the server, I'm using dnsblast [3] with the follow command:

./dnsblast <server address> 50000 500

Hi, Rogerio:

Sorry to hear that. I would be happy to help debug dnstap (I wrote the
dnstap patchset for Unbound). Can I get some information about your
environment?

Can you show the "dnstap:" block of settings from your config, and the
"num-threads" server setting?

Does fstrm's "make check" test suite succeed?

What version of protobuf-c are you using? (Did you compile from source,
or did you use a packaged version?)

What OS version are you using? (Based on your mention of the Debian
package from experimental, I would guess Debian or Ubuntu.)

Are you using a uniprocessor or SMP machine? Also, since there are some
architecture-specific parts in fstrm, what architecture are you using?

Thanks!

Rogerio Bastos wrote:

I'm trying to test unbound witk dnstap. It works fine with low load, but
exists with segfault at high load. The segfault only happens when dnstap is
enabled in configuration.

I am using the debian package (version 1.5.3) avaible in [1] and recompiled
with dnstap enabled.
I'm following instruction descripted in [2] and using fstrm version 0.2.0.

To test the server, I'm using dnsblast [3] with the follow command:

./dnsblast <server address> 50000 500

Hi, Rogerio:

Sorry to hear that. I would be happy to help debug dnstap (I wrote the
dnstap patchset for Unbound). Can I get some information about your
environment?

Can you show the "dnstap:" block of settings from your config, and the
"num-threads" server setting?

I'm using optimisation settings based on [1] (the Debian version is compiled with libevent):

server:
     num-threads: 2

     msg-cache-slabs: 2
     rrset-cache-slabs: 2
     infra-cache-slabs: 2
     key-cache-slabs: 2

     rrset-cache-size: 100m
     msg-cache-size: 50m

     outgoing-range: 8192
     num-queries-per-thread: 4096

     so-rcvbuf: 4m
     so-sndbuf: 4m

I'm using the example from dnstap's site [2]:

dnstap:
     dnstap-enable: yes
     dnstap-socket-path: "/var/run/unbound/dnstap.sock"
     dnstap-send-identity: yes
     dnstap-send-version: yes
     dnstap-log-resolver-response-messages: yes
     dnstap-log-client-query-messages: yes

Does fstrm's "make check" test suite succeed?

Yes, all tests is ok.

What version of protobuf-c are you using? (Did you compile from source,
or did you use a packaged version?)

The packaged version from Debian Jessie (version 1.0.2).

What OS version are you using? (Based on your mention of the Debian
package from experimental, I would guess Debian or Ubuntu.)

Debian Jessie, the next-stable version.

Are you using a uniprocessor or SMP machine? Also, since there are some
architecture-specific parts in fstrm, what architecture are you using?

I'm using a amd64 virtual machine with a two core CPU.

[1] https://www.unbound.net/documentation/howto_optimise.html
[2] http://dnstap.info/Examples/

Hi, Rogerio:

Thanks for these details, I can easily spin up a dual core amd64 VM
running Debian jessie soon and try to replicate the problem.

Do you get a segfault immediately, or does it only occur after running
for some time under load?

Can you try testing with "num-threads: 1"? (This will still result in
multiple threads running in the Unbound process, but the dnstap I/O
thread will only be consuming data from a single worker thread.)

Also, can you compile your unbound package with debugging symbols and
obtain a backtrace from a crash? You should be able to build a
debugging enabled package with:

    DEB_BUILD_OPTIONS='nostrip debug' dpkg-buildpackage -b -uc -us

Then, run "gdb --args unbound -d" until it crashes, and at the gdb
prompt run:

    thread apply all bt full

Thanks!

Rogerio Bastos wrote:

Hi,

I appreciate you help.

Hi, Rogerio:

Thanks for these details, I can easily spin up a dual core amd64 VM
running Debian jessie soon and try to replicate the problem.

Do you get a segfault immediately, or does it only occur after running
for some time under load?

segfault occur after some time.

Can you try testing with "num-threads: 1"? (This will still result in
multiple threads running in the Unbound process, but the dnstap I/O
thread will only be consuming data from a single worker thread.)

I get the same error with "num-threads: 1".

Also, can you compile your unbound package with debugging symbols and
obtain a backtrace from a crash? You should be able to build a
debugging enabled package with:

    DEB_BUILD_OPTIONS='nostrip debug' dpkg-buildpackage -b -uc -us

Then, run "gdb --args unbound -d" until it crashes, and at the gdb
prompt run:

    thread apply all bt full

The output is attached.

(attachments)

gdb.out (3.26 KB)

Hi, Rogerio:

Based on the stack trace, it looks like the crash is occurring when a
TCP query times out, and we try to log it as if it were a normal TCP
response. Could you try the attached patch and see if it avoids the
crash?

Thanks!

Rogerio Bastos wrote:

(attachments)

0001-dnstap-Only-try-to-log-TCP-outside-responses-on-NETE.patch (958 Bytes)

Hi, Rogerio:

Based on the stack trace, it looks like the crash is occurring when a
TCP query times out, and we try to log it as if it were a normal TCP
response. Could you try the attached patch and see if it avoids the
crash?

Yes, this patch fix the crash.

Thanks!

Rogerio Bastos wrote: