Unbound should probably manage other RLIMITs as well

I have unbound-1.3.0 running on a pair of local hosts as my main
recursive resolvers on my local IPv4-only network, and sometimes one or
the other instance dies, this time I've got enough records to show that
it was claiming that it ran out of memory.

Syslog messages from just prior to, and during, the crash are appended
for completeness.

I have the following non-default memory-usage related settings:

  num_threads: 1
  rrset-cache-size: 256m
  rrset-cache-slabs: 8

However note the soft datasize RLIMIT is just exactly 256m, through some
accident in how my system and kernel are configured, and this is of
course slightly too small for Unbound's current configuration:

  # sysctl proc.25953
  proc.25953.corename = %n.core
  proc.25953.rlimit.cputime.soft = unlimited
  proc.25953.rlimit.cputime.hard = unlimited
  proc.25953.rlimit.filesize.soft = unlimited
  proc.25953.rlimit.filesize.hard = unlimited
  proc.25953.rlimit.datasize.soft = 268435456
  proc.25953.rlimit.datasize.hard = 3221225472
  proc.25953.rlimit.stacksize.soft = 67108864
  proc.25953.rlimit.stacksize.hard = 67108864
  proc.25953.rlimit.coredumpsize.soft = unlimited
  proc.25953.rlimit.coredumpsize.hard = unlimited
  proc.25953.rlimit.memoryuse.soft = 3920551936
  proc.25953.rlimit.memoryuse.hard = 3920551936
  proc.25953.rlimit.memorylocked.soft = 1306850645
  proc.25953.rlimit.memorylocked.hard = 3920551936
  proc.25953.rlimit.maxproc.soft = 2068
  proc.25953.rlimit.maxproc.hard = 2068
  proc.25953.rlimit.descriptors.soft = 6668
  proc.25953.rlimit.descriptors.hard = 6668
  proc.25953.rlimit.sbsize.soft = unlimited
  proc.25953.rlimit.sbsize.hard = unlimited
  proc.25953.stopfork = 0
  proc.25953.stopexec = 0
  proc.25953.stopexit = 0

So, this crash seems to have happening because the cache does finally
fill close to the specified limit, but since the process can't grow that
big, malloc() reports being out of memory and eventually Unbound gives
up and dies.

I think Unbound should probably try to manage other of its RLIMIT values
besides just the number of open files.

it could even try using getrusage(2) where available to discover initial
memory requirements at startup (though on many systems getrusage(2)
doesn't really provide enough useful information).

Oct 27 01:54:42 once unbound: [12262:0] info: server stats for thread 0: 450 queries, 56 answers from cache, 394 recursions
Oct 27 01:54:42 once unbound: [12262:0] info: server stats for thread 0: requestlist max 9 avg 4.35787 exceeded 0
Oct 27 01:54:42 once unbound: [12262:0] info: average recursion processing time 8.471841 sec
Oct 27 01:54:42 once unbound: [12262:0] info: histogram of recursion processing times
Oct 27 01:54:42 once unbound: [12262:0] info: [25%]=0.016384 median[50%]=0.00119156 [75%]=0.118561
Oct 27 01:54:42 once unbound: [12262:0] info: lower(secs) upper(secs) recursions
Oct 27 01:54:42 once unbound: [12262:0] info: 0.000000 0.000001 4
Oct 27 01:54:42 once unbound: [12262:0] info: 0.001024 0.002048 47
Oct 27 01:54:42 once unbound: [12262:0] info: 0.002048 0.004096 10
Oct 27 01:54:42 once unbound: [12262:0] info: 0.004096 0.008192 3
Oct 27 01:54:42 once unbound: [12262:0] info: 0.032768 0.065536 2
Oct 27 01:54:42 once unbound: [12262:0] info: 0.065536 0.131072 130
Oct 27 01:54:42 once unbound: [12262:0] info: 0.131072 0.262144 110
Oct 27 01:54:42 once unbound: [12262:0] info: 0.262144 0.524288 22
Oct 27 01:54:42 once unbound: [12262:0] info: 0.524288 1.000000 3
Oct 27 01:54:42 once unbound: [12262:0] info: 1.000000 2.000000 3
Oct 27 01:54:42 once unbound: [12262:0] info: 2.000000 4.000000 3
Oct 27 01:54:42 once unbound: [12262:0] info: 4.000000 8.000000 2
Oct 27 01:54:42 once unbound: [12262:0] info: 8.000000 16.000000 4
Oct 27 01:54:42 once unbound: [12262:0] info: 16.000000 32.000000 8
Oct 27 01:54:42 once unbound: [12262:0] info: 32.000000 64.000000 19
Oct 27 01:54:42 once unbound: [12262:0] info: 64.000000 128.000000 24
Oct 27 01:58:41 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 01:59:42 once unbound: [12262:0] info: server stats for thread 0: 552 queries, 126 answers from cache, 426 recursions
Oct 27 01:59:42 once unbound: [12262:0] info: server stats for thread 0: requestlist max 8 avg 3.6385 exceeded 0
Oct 27 01:59:42 once unbound: [12262:0] info: average recursion processing time 9.901999 sec
Oct 27 01:59:42 once unbound: [12262:0] info: histogram of recursion processing times
Oct 27 01:59:42 once unbound: [12262:0] info: [25%]=0.0156245 median[50%]=0.0624979 [75%]=0.114123
Oct 27 01:59:42 once unbound: [12262:0] info: lower(secs) upper(secs) recursions
Oct 27 01:59:42 once unbound: [12262:0] info: 0.000000 0.000001 2
Oct 27 01:59:42 once unbound: [12262:0] info: 0.001024 0.002048 56
Oct 27 01:59:42 once unbound: [12262:0] info: 0.002048 0.004096 4
Oct 27 01:59:42 once unbound: [12262:0] info: 0.004096 0.008192 1
Oct 27 01:59:42 once unbound: [12262:0] info: 0.016384 0.032768 2
Oct 27 01:59:42 once unbound: [12262:0] info: 0.032768 0.065536 7
Oct 27 01:59:42 once unbound: [12262:0] info: 0.065536 0.131072 151
Oct 27 01:59:42 once unbound: [12262:0] info: 0.131072 0.262144 116
Oct 27 01:59:42 once unbound: [12262:0] info: 0.262144 0.524288 28
Oct 27 01:59:42 once unbound: [12262:0] info: 0.524288 1.000000 11
Oct 27 01:59:42 once unbound: [12262:0] info: 1.000000 2.000000 5
Oct 27 01:59:42 once unbound: [12262:0] info: 2.000000 4.000000 1
Oct 27 01:59:42 once unbound: [12262:0] info: 4.000000 8.000000 2
Oct 27 01:59:42 once unbound: [12262:0] info: 8.000000 16.000000 5
Oct 27 01:59:42 once unbound: [12262:0] info: 16.000000 32.000000 11
Oct 27 01:59:42 once unbound: [12262:0] info: 32.000000 64.000000 11
Oct 27 01:59:42 once unbound: [12262:0] info: 64.000000 128.000000 12
Oct 27 01:59:42 once unbound: [12262:0] info: 128.000000 256.000000 3
Oct 27 01:59:42 once unbound: [12262:0] info: 256.000000 512.000000 4
Oct 27 02:01:24 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:01:24 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:01:24 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:01:24 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:01:25 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:01:26 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:01:35 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:01:36 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:01:56 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:01:57 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:03 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:03 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:03 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:03 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:13 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:13 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:14 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:15 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:16 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:17 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:35 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:35 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:35 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:35 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:35 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:35 once last message repeated 2 times
Oct 27 02:02:35 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:35 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:35 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:35 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:35 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:35 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:35 once last message repeated 2 times
Oct 27 02:02:35 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:36 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:36 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:36 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:36 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:36 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:36 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:36 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:36 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:38 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:38 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:38 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:38 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:39 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:39 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:40 once unbound: [12262:0] error: find_delegation: addrs out of memory
Oct 27 02:02:40 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:44 once last message repeated 3 times
Oct 27 02:02:44 once unbound: [12262:0] error: mesh_state_attachment: out of memory
Oct 27 02:02:44 once last message repeated 275 times
Oct 27 02:02:44 once unbound: [12262:0] error: malloc failure: allocating incoming dns_msg
Oct 27 02:02:45 once last message repeated 18 times
Oct 27 02:02:48 once unbound: [12262:0] fatal error: prealloc: out of memory

Hi Greg,

it was claiming that it ran out of memory.
I have the following non-default memory-usage related settings:
  rrset-cache-size: 256m

Note that I expect due to malloc-overhead that unbound grows to
double the size of what it actually uses. So, this grows to
512m + some for your other caches. (this is because of the
libc memory allocator algorithm).

Note it is useful to also increase the message cache size -
you have been using 256m to store rrsets which helped recursion
but not answering clients from cache. msg-cache-size: 128m.

Since you do not seem to need that much cache, perhaps:
rrset-cache-size: 64m
msg-cache-size: 32m
total 96, add some more for other stuff and double for malloc
overhead and this fits in 256m much more easily.

Or change rlimit of course

However note the soft datasize RLIMIT is just exactly 256m, through some
accident in how my system and kernel are configured, and this is of
course slightly too small for Unbound's current configuration:

So, this crash seems to have happening because the cache does finally
fill close to the specified limit, but since the process can't grow that
big, malloc() reports being out of memory and eventually Unbound gives
up and dies.

Yes, I have looked through the kindly attached logs and fixed the
offending error to be more soft-fail. This would make unbound continue
operation a little longer under such conditions.

I think Unbound should probably try to manage other of its RLIMIT values
besides just the number of open files.

I am hesitant, based on 'code bloat' reasons to do this. I see this is
something for BSD and Solaris mostly as Linux does not impose limits by
default it seems.

it could even try using getrusage(2) where available to discover initial
memory requirements at startup (though on many systems getrusage(2)
doesn't really provide enough useful information).

Oct 27 02:02:48 once unbound: [12262:0] fatal error: prealloc: out of memory

getrlimit(RLIMIT_DATA) is more useful I guess. The code bloat is in
trying to estimate its own memory usage.

Best regards,
    Wouter

Why not try to auto-size the various memory pools based on the memory
rlimit, and warn if configured pool sizes are likely to exceed the rlimit?

Tony.

>
> > I think Unbound should probably try to manage other of its RLIMIT values
> > besides just the number of open files.
>
> I am hesitant, based on 'code bloat' reasons to do this. I see this is
> something for BSD and Solaris mostly as Linux does not impose limits by
> default it seems.

I really can't see the sense (in the server realm at least, and
especially outside of the low-end embedded world) of trying to use a
server in production that didn't support some kind of per-process
resource limit controls, and if I'm not mistaken all modern POSIX-like
server OS platforms do support such controls, and mostly in a way
directly compatible with getrlimit(2)/setrlimit(2).

Note that getrlimit() and setrlimit() have been part of the Single UNIX
Specification in Version 2 since 1997, that's well over a decade now:

  http://opengroup.org/onlinepubs/007908799/xsh/getrlimit.html

It's not really relevant whether any given kernel or environment does
set such limits by default or not, just so long as they can be set.

Why not try to auto-size the various memory pools based on the memory
rlimit, and warn if configured pool sizes are likely to exceed the rlimit?

Something like that would probably make a little bit more sense.

However if sizing could be controlled by simply growing until there's a
malloc() failure then that would solve lots of problems in what I think
would be the simplest way possible; i.e. eliminating all system
dependencies, eliminating all need for special in-application user
configuration controls, eliminating the need for estimating malloc()
overhead (since that can change per runtime instance), etc., etc., etc.

(of course that would mean allocating growing things in whole units so
that a failure could be rerouted to a reuse of an existing allocation,
and being sure to pre-allocate everything else needed at or near startup)

(and of course some initial table size ratios could be automatically
determined by first querying getrlimit() to see what total usage will be
allowed)

If you want to make things a little more complicated then you could
"trap" on malloc() failure and try to increase the soft limit (up to the
hard limit of course) in order to allow for retry on the current
allocation and then set the table size limits at the current usage.
Then you'd just have to work out some reasonable values to suggest to
admins so that they can give some headroom in the soft limit.

I'm not sure I understand. What is the relationship between the
msg-cache-size and the rrset-cache-size? They seem to default to being
the same size, but I thought that was just a co-incidence.

I'm not even sure what msg-cache-size is used for. The documentation
for msg-cache-size just says, very unhelpfully:

    "Number of bytes size of the message cache"

The on-line "how to optimise" page gives the rule you suggest, but also
without any explanation:

    "Use roughly twice as much rrset cache memory as you use msg cache
     memory."

I have assumed, perhaps wrongfully, that the msg-cache is simply the
size of the cache for messages that have been received from clients and
upstream nameservers but not yet parsed and handled. I'm guessing it
could be used to format replies to be sent to clients too. I'm not sure
why it would be useful for it to be up to 1/2 the size of
rrset-cache-size.

If there is such a fixed relationship, then should it not be encoded
into the controls heuristics such that the user need only specify one
value for sizing the caches?

I wish I had time to learn the code!

* Greg A. Woods:

I really can't see the sense (in the server realm at least, and
especially outside of the low-end embedded world) of trying to use a
server in production that didn't support some kind of per-process
resource limit controls, and if I'm not mistaken all modern POSIX-like
server OS platforms do support such controls, and mostly in a way
directly compatible with getrlimit(2)/setrlimit(2).

On the other hand, most server software does NOT cope gracefully with
resource allocation failures (there's a popular enterprise software
stack which requires a server restart if a memory allocation failure
occurs, for instance).

(And for historic reasons, most DNS resolvers do not run with
configured cache size limits.)

Note that getrlimit() and setrlimit() have been part of the Single UNIX
Specification in Version 2 since 1997, that's well over a decade now:

  http://opengroup.org/onlinepubs/007908799/xsh/getrlimit.html

It's not really relevant whether any given kernel or environment does
set such limits by default or not, just so long as they can be set.

I think those limits are only meaningful if you implement your own
memory allocator. You can't really know why you're approaching the
process limit if all you do is call malloc to get heap blocks of
varying lengths. Operating systems and OS versions vary wildly.

On the other hand, most server software does NOT cope gracefully with
resource allocation failures (there's a popular enterprise software
stack which requires a server restart if a memory allocation failure
occurs, for instance).

One should not aspire to meet just the lowest common denominator, and
certainly never to meet just the minimum of all alternatives.

BIND-8 and newer (i.e. since at least 1997) has had code to allow it set
various the RLIMIT values at startup, thus at least crudely imposing its
own controls upon itself.

That's sort of where I got the idea from, though here my initial idea
was to automatically make sure _all_ of the RLIMIT settings were not in
conflict with the other application-specific settings (i.e. in the same
way RLIMIT_NOFILE is currently adjusted to meet requirements,
RLIMIT_DATA and RLIMIT_STACK could also be automatically adjusted to
meet expected requirements). Much better though would be the automatic
sizing of each tunable based on the current RLIMIT values set by the
system manager. Though Unbound starts as root and can raise soft
limits, it's best to honour soft limits instead.

Indeed BIND-8 and newer have punted on the auto-sizing feature I
describe and they do behave badly if malloc() fails for whatever reason.
The BIND-9 documentation recommends against using RLIMIT datasize to
restrict the process size.

(And for historic reasons, most DNS resolvers do not run with
configured cache size limits.)

Well, BIND-9 has had cache size controls for well over 8 years now
(since 9.2.0, in official releases) -- I'm not sure what you mean by
"configured cache size limits" though. I'd say only a very naive
sysadmin would let his caching resolvers run unbounded if at all
possible.

I think those limits are only meaningful if you implement your own
memory allocator. You can't really know why you're approaching the
process limit if all you do is call malloc to get heap blocks of
varying lengths. Operating systems and OS versions vary wildly.

Well, that's sort of what I was suggesting, though not in the sense of
replacing malloc(), but rather in being smart about how any memory
allocations which "grow" during runtime are managed such that a failure
is not fatal to the current operation, but instead simply causes reuse
of the LRU entry in the cache.

In fact what Wouter said in reply to me was that the next release will
actually try to behave much better when it runs out of memory
"unexpectedly". Perhaps this behaviour can be perfected.

From an operations point of view one doesn't really care how big the
internal cache table is in bytes or in numbers of records, etc. One
only cares that the cache stay within the bounds that make sense in the
context it is running in -- eg. so that it doesn't cause undue paging by
exceeding the total real-memory requirements of the whole system's
working page set.

In fact trying to meet operational requirements by having to calculate
how many bytes or records can safely be used by a cache or by each of a
set of caches is simply punting all the mundane calculations onto the
poor human operator.

The best way to meet this kind of operational requirement is to allow
the operator to tune the various resource limits applied to each process
and then to have the processes deal with those limits in the most
appropriate way. For a process that leaks memory, the best action is to
die, and to hopefully be restarted. For a process that intentionally
tries to use as much memory as is available for caching the best
approach to automatic tuning is to keep using memory until it runs out
and then reuse old entries, perhaps first freeing some additional old
entries to give headroom for other infrequent uses, and perhaps retrying
allocations periodically in case more resources are somehow made
available for the process.