Unbound crashing on FreeBSD

Hi,

I've upgraded one of my resolvers to FreeBSD 10.0, and since then, unbound
(1.4.21) crashes regularly (about once a day) with, say :

Jan 30 12:49:45 resolver3 unbound: [96044:2] fatal error: event_dispatch
returned error -1, errno is Capabilities insufficient

Any hints on what may be wrong ?

Regards,

Hi Mathieu,

Hi,

I've upgraded one of my resolvers to FreeBSD 10.0, and since then,
unbound (1.4.21) crashes regularly (about once a day) with, say :

Jan 30 12:49:45 resolver3 unbound: [96044:2] fatal error:
event_dispatch returned error -1, errno is Capabilities
insufficient

Any hints on what may be wrong ?

FreeBSD 10. Does that have a fine-grained user capabilities thing?
event_dispatch would run kqueue for unbound (if you compiled with
libevent). Does it not have permission to use kqueue?

Without an event loop there is very little that unbound can do; no
events means no information about network sockets.

If you compile --without-libevent, then unbound uses select() which
may avoid this.

Perhaps this is about the number of sockets opened? The
filedescriptor count in the ulimit structure? You configured unbound
for high performance with many open sockets, but when it does (when it
gets busy once a day) the OS gives this error? Strange because
unbound checks the rlimits (resource limits) when it starts. Does it
run out of memory, i.e. about once a day the cache fills up and
something set the ulimit on heap-size or something like to, say, 1G
but you configured unbound to use 2G, and when it crosses the 1G line
it gets killed (but weird that kqueue gives an error).

What version of libevent are you using?

Best regards,
   Wouter

Hi Mathieu,

Hi Mathieu,

Hi,

I've upgraded one of my resolvers to FreeBSD 10.0, and since
then, unbound (1.4.21) crashes regularly (about once a day) with,
say :

Jan 30 12:49:45 resolver3 unbound: [96044:2] fatal error:
event_dispatch returned error -1, errno is Capabilities
insufficient

Any hints on what may be wrong ?

FreeBSD 10. Does that have a fine-grained user capabilities
thing? event_dispatch would run kqueue for unbound (if you compiled
with libevent). Does it not have permission to use kqueue?

Without an event loop there is very little that unbound can do; no
events means no information about network sockets.

If you compile --without-libevent, then unbound uses select()
which may avoid this.

Perhaps this is about the number of sockets opened? The
filedescriptor count in the ulimit structure? You configured
unbound for high performance with many open sockets, but when it
does (when it gets busy once a day) the OS gives this error?
Strange because unbound checks the rlimits (resource limits) when
it starts. Does it run out of memory, i.e. about once a day the
cache fills up and something set the ulimit on heap-size or
something like to, say, 1G but you configured unbound to use 2G,
and when it crosses the 1G line it gets killed (but weird that
kqueue gives an error).

It is not the number of sockets or the heap limits, but capsicum.

What version of libevent are you using?

- From FreeBSD documentation I learned that this errno indicates that
the capabilities associated with a socket did not permit an operation
to be performed. One of the capabilities is the capability to use the
kqueue socket for kqueue polling. But no doubt there are also other
capabilities. It says capabilities can be reduced but not expanded by
the program. This is great, but why does a particular fd have its
capabilities reduced (unbound does not mess with socket capabilities)?

I have no idea why the capability reduction happens. ktrace is
probably too expensive in its logging fervor?

Best regards,
   Wouter

This is the Capsicum capabilities system; a lot more is available to
read at:
  http://www.cl.cam.ac.uk/research/security/capsicum/

Man-pages specific to the new capabilities system are:
  http://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4
  http://www.freebsd.org/cgi/man.cgi?query=rights&sektion=4
and a bunch more linked therefrom. The full list of capabilities in the
rights(4) manpage, URL just above.

(I haven't looked into this specific issue, just know some background
which _might_ be useful).

+rwatson@FreeBSD.org

Hello, Robert!

Could you give us a hint about this problem with Capsicum and Unbound, please?

30.01.2014 18:52, W.C.A. Wijngaards wrote:

Forward Robert's answer to the list.

31.01.2014 15:30, Robert N. M. Watson wrote:

Hi Sergey, Robert,

Forward Robert's answer to the list.

31.01.2014 15:30, Robert N. M. Watson wrote:

+rwatson@FreeBSD.org

Hello, Robert!

Could you give us a hint about this problem with Capsicum and
Unbound, please?

I've added Pawel to the CC line in case he has any insights.

Capability limits, in general, apply only to file descriptors
that have been explicitly limited using cap_rights_limit(),
implicitly as a result of accept() from a limited socket, or open
via openat() on a limited directory descriptor. It seems like
there's scope for several possible bugs here:

(1) A previously undetected bug means that the wrong file
descriptor (but correctly limited) is being passed to the system
call -- it's just no one noticed before because waiting on the
wrong event can have subtle-to-spot outcomes sometimes (whereas
writing to the wrong file descriptor is more often obvious!).

So, we do not run into regressions, and the software seems to work for
a long time. This must then be some sort of odd corner case in the
code, generally that would mean a file descriptor is interchanged with
another, and it is closed while still being used. Would that not
result in other error output? How would we find out which file
descriptor is wrongly being used?

(2) A file descriptor is unexpectedly (but correctly) limited --
perhaps returned by a library or inherited from another process,
in which case we need to work out how to limit it less, or at
least figure out what is going on and prevent the problem.

Since NSD does not do libcapsicum calls, and it uses very little
library. Openssl is used for nsd-control connections. So I would
think this option is not there.

(3) A bug exists in the Capsicum implementation, which manifests
once in a while due to a race condition or similar, causing
rights to be lost from capabilities improperly: you make a
legitimate request but the rights are undesirably gone.

This could be the case, from my point of view NSD4 works great except
on FreeBSD10-capsicum.

(4) A bug exists in the file-descriptor implementation such that
you specify the right unlimited file descriptor, but get an
operation on the wrong one which fails.

There should not be 'limited' file descriptors around at all.

The usual tools for debugging this sort of thing are ktrace and
procstat -fC. The latter gets a snapshot while the former
provides lots of detail. You can make ktrace scale somewhat
better by asking it not to log I/O and various other events that
seem less relevant, but agreed that it may be a bit painful if
long runtimes are required to reproduce the problem (and if it's
a race condition, there's a chance you mask it by changing
timing.)

Regardless of where the bug is, some sort of trace would probably be
very helpful. Catching it at the point where it exits is hard since
it takes all day and happens at an unexpected point in time (note that
NSD has multiple processes active; the one listening to nsd-control
uses libevent, the 'master' uses select() to avoid forking issues, and
the 'servers' use libevent to listen to UDP and a number of TCP
connections).

Best regards,
   Wouter