Unbound memory resource consumption?

heidnes · May 22, 2024, 10:03am

Hi,

I've been running a set of recursive resolvers using both BIND and
unbound for quite a while. This was set up at a time when BIND
supported neither DNS-over-TLS nor DNS-over-HTTPS, so I had set up
unbound to serve those protocols, and from unbound forwarded all the
queries to the BIND instance via

server:
# Must allow query localhost since we use separate instances:
do-not-query-localhost: no

# Forward all queries to local recursor
forward-zone:
  name: "."
  forward-addr: 127.0.0.1
  # Use the forwarders cache, don't build own cache
  forward-no-cache: yes

One would have thought that "forward-no-cache" would do what it
claims to do, i.e. to not build a cache, and cause unbound to not
balloon in size. That appears to not be the case, unbound still
grew in both virtual and resident size according to "top", but while
the query volume was low to moderate it didn't really matter all
that much.

However, I recently came across RFC 9462, "Discovery of Designated
Resolvers" where a plain recursive resolver can be set up to serve
"resolver.arpa", and via the "_dns" label and SVCB records on that
node indicate to recursive resolver clients where they can find
endpoints for DNS-over-TLS or DNS-over-HTTPS.

As a consequence of setting this up, on the busiest node in our
setup, the DoT query volume on the unbound side increased around
tenfold, and DoH picked up from around zero to around the same
level, up to around 500qps for each of DoT and DoH.

Unbound was running with default (unconfigured) sizing, and at one
point both unbound and BIND were killed by the kernel for exhausting
swap space (and real memory), both at 32GB. Obviously this is "out
of control". So I tried to configure some basic limits on a few of
the most important data structures in unbound via

  # Put some limits on virtual memory consumption
  # to avoid being killed due to "out of swap"...
  rrset-cache-size: 8G
  msg-cache-size: 4G
  key-cache-size: 800m

The strange thing is that these limits are not observable as
being obeyed -- "top" currently reports:

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
27045 unbound 25 0 156G 15G CPU/1 23.6H 689% 689% unbound

The virtual size is obviously "over-committed" with respect to
what's actually available (32GB RAM and 32GB swap), possibly via
mmap-based memory allocation(?), and one can *maybe* see it adhering
to the configured limits with respect to the resident set size(?)
However, currently 25GB of swap is already used (the OS is not doing
excessive paging, though), though in the time it's taken to write
this message, swap usage has expanded to 29GB, and it's on its way
to being killed again. The elevated CPU time consumption for
unbound is also a bit suspicious; that appears to have started
around 3-4 hours ago, while unbound was upgraded to 1.20.0 (from
1.19.1) and restarted around midnight, now more than ten hours ago.
(I have system monitoring via collectd.)

Meanwhile, the BIND instance which sees all the queries and
builds its own cache sits at

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
12487 named 85 0 894M 405M kqueue/9 411:21 37.40% 37.40% named

i.e. in the sub-1GB virtual + resident size.

This is all on NetBSD/amd64 9.3.

Obviously the unbound resource consumption is "out of control",
and the configured resource limits appear to do little to nothing
with the virtual memory consumption. I find this concerning.

Ideally I would have liked to bring unbound's virtual memory
consumption in under administrative control. Is that at all
possible?

Or is this simply a memory leak related to either of DoT or DoH?

Guidance sought.

Best regards,

- Håvard

heidnes · August 30, 2024, 4:15pm

Hm,

no response to the earlier message?

The host in question here is a (backup) recursive resolver node
for our customers, and it has daily query volume peaks somewhere
between 1000 and 1500 qps.

Since my previous message, I've upgraded to unbound 1.21.0, and in
conjuntion with that I've started collecting some of the "cache
statistics" from unbound using "unbound-control stats_noreset".

I'm running with the default "cache-max-ttl" setting of 86400, and
analyzing a cache dump shows that all the RRSets have a TTL less than
this threshold. See attached plot #1 for the distribution per 60s
bin. No big surprise there.

Now, one would therefore expect that the RRsets cached during one
day would no longer occupy the cache the following day?

That is however not the behaviour I observe. If I plot
mem.cache.rrset and rrset.cache.count over time, I get a steady
increase, see the attached plots #2 and #3 respectively. As a result,
I get ever increasing memory consumption, instead of a stabilized
memory usage after a while.

As stated earlier, my unbound is configured with

  # Put some limits on virtual memory consumption
  # in attempt to avoid being killed due to "out of swap"...
  rrset-cache-size: 8G
  msg-cache-size: 4G
  key-cache-size: 800m

OK, so I've not yet reached any of those thresholds; the RRset
cache memory is at the moment closing in on 1.4G, while unbound
itself as a process is nearing 3.6G:

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
124 unbound 85 0 3594M 3457M kqueue/0 906:08 12.06% 12.06% unbound

In my perhaps naive assumptions as an operator, I would have
thought that the memory used to store a now-expired RRset would
be released, and the count of cached RRsets would be decremented,
but despite having a cap on TTL, the count of cached RRsets seems
to be ever increasing, and the same goes for the consumed memory.
Is the releasing of the memory only going to happen when one of
the configured sizes are being approached or exceeded?

I'm trying to understand how I as an operator am supposed to
configure unbound so as to not grow ever larger, falling victim
to having to use swap (slow!), or eventually exceeding the
configured swap space (ouch!).

It's not behaving the way I anticipated.

"Help!"

Best regards,

- Håvard

(attachments)

Yorgos · September 24, 2024, 10:36am

Hi Håvard,

Unbound does not have a clean strategy for cache.
Records are not evicted based on their TTL status.
Instead Unbound will try to fill all of the configured memory with data.
Then when a new entry needs to be cached and there is no space left, the earliest used entry (based on an LRU list) will be dropped off to free space for the new entry.

Unfortunately the size of the cache is not something trivial to solve because it heavily depends on client traffic patterns.

Monitoring the cache-hit rate with different memory configurations could give a preferred size for specific installations.

Best regards,
-- Yorgos

heidnes · March 12, 2025, 5:24pm

Following up on an old message (from Sep 2024):

Unbound does not have a clean strategy for cache.
Records are not evicted based on their TTL status.
Instead Unbound will try to fill all of the configured memory with
data.
Then when a new entry needs to be cached and there is no space left,
the earliest used entry (based on an LRU list) will be dropped off to
free space for the new entry.

OK, that sort of matches up with what I observe, in that the
memory consumption of unbound only increases.

Unfortunately the size of the cache is not something trivial to solve
because it heavily depends on client traffic patterns.

Monitoring the cache-hit rate with different memory configurations
could give a preferred size for specific installations.

Well, there are these configuration knobs that I have tuned for
cache size limitation, in the hopes that it would be respected:

# grep cache-size unbound.conf
        # rrset-cache-size: 4m
  rrset-cache-size: 4G
        # msg-cache-size: 4m
  msg-cache-size: 3G
        # key-cache-size: 4m
  key-cache-size: 500m
        # neg-cache-size: 1m

Olivier_Benghozi · March 12, 2025, 8:18pm

You want to disable Linux transparent hugepage feature (that is a real mess with Unbound – and not only Unbound actually) if not already done.

apt install sysfsutils
echo “kernel/mm/transparent_hugepage/enabled = madvise” >> /etc/sysfs.d/transparent_hugepage.conf
echo “kernel/mm/transparent_hugepage/defrag = madvise” >> /etc/sysfs.d/transparent_hugepage.conf
systemctl restart sysfsutils
systemctl restart unbound

heidnes · March 13, 2025, 9:07am

You want to disable Linux transparent hugepage feature (that is a real mess
with Unbound – and not only Unbound actually) if not already done.

apt install sysfsutils
echo "kernel/mm/transparent_hugepage/enabled = madvise" >>
/etc/sysfs.d/transparent_hugepage.conf
echo "kernel/mm/transparent_hugepage/defrag = madvise" >>
/etc/sysfs.d/transparent_hugepage.conf
systemctl restart sysfsutils
systemctl restart unbound

Thanks for the reply / hint.

Unfortunately, this does not apply in this case, quoting from the
tail of my message from yesterday:

And lastly, this is on NetBSD/amd64 10.0, using net/unbound
packaged from pkgsrc.

Just FYI, unbound is continuing to balloon, this is the same
instance as started yesterday:

load averages: 1.00, 1.10, 1.02; up 5+20:37:29 10:03:58
76 processes: 73 sleeping, 1 stopped, 2 on CPU
CPU states: 20.7% user, 0.0% nice, 3.4% system, 1.0% interrupt, 74.9% idle
Memory: 12G Act, 164M Inact, 17M Wired, 27M Exec, 2366M File, 8135M Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 3147M Used / Network: 2298K In, 33

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14678 unbound 38 0 18G 9875M CPU/3 669:49 103% 103% unbound

Regards,

- Håvard

Carsten_Strotmann · March 13, 2025, 1:03pm

Hello Havard,

So ... what I'm looking for is ... what, if anything, can I do to
find and stop what looks like a massive memory leak? Or ... is
anyone else observing similar symptoms?

I've seen similar issues with Unbound, BIND 9 and other DNS resolver on Linux, BSD and other Unix systems.

Part of the problem can be the memory allocator used in the OS or in the application. With many small allocations that are typical for a DNS resolver, the memory fragments fast.

I usually plan at least 50%, sometimes 100% or 200% the amount of memory as configured in the DNS resolver application. That is if the resolver is configured with 1 GB of cache memory, I've made sure there is at least 1.5-3 GB of physical memory reserved to account for memory fragmentation.

Testing out other (better for small memory allocations) memory allocators might help.

ISC is now recommending "jemalloc" (https://jemalloc.net/) for BIND 9.

I've heard good things about "mimalloc" (https://github.com/microsoft/mimalloc), but never had the chance to test it on a really busy DNS resolver.

I did a webinar in 2022 for ISC on the topic (Slide 4.10ff):

https://www.isc.org/docs/2022-webinar-bind-memory-management.pdf

Measuring the "working set" is important (see Slide 5.12ff).

Other interesting resources on memory allocators:

https://blog.reverberate.org/2009/02/one-malloc-to-rule-them-all.html

https://web.archive.org/web/20110607183215/http://blog.pavlov.net/2007/11/21/malloc-replacements/

http://hoard.org/

The books from Brandon Gregg are excellent, but mostly Linux centric these days (1st edition was also about Solaris):

Systems Performance: Enterprise and the Cloud, 2nd Edition (2020)
https://www.brendangregg.com/systems-performance-2nd-edition-book.html

While I run my own little DNS servers on NetBSD and OpenBSD, all my experience with large ISP resolver are on Linux.

When I was operating a farm of DNS resolver for an cable ISP in Germany for > 1M subscriber, we've configured the DNS resolver cache to be around 1G. At that point of time (5 years ago) that was the sweet spot for cache size. Larger caches did not help with the query performance or DNS resolver load, as most records cached were only queries once and never again from the cache. Your customer base might vary, but a large cache is not always the optimal configuration.

If you need help compiling Unbound or BIND 9 with a custom allocator for/on NetBSD, let me know.

Greetings

Carsten

heidnes · March 13, 2025, 4:50pm

Hi,

and thanks for the feedback, the general advice, and the pointer to
jemalloc. I may look into that a bit later.

However, in the mean time I have come to the conclusion that there
may be a correlation between me enabling DoH and DoT and using RFC
9462 to direct clients which probe for _dns.resolver.arpa to use the
DoH and/or DoT endpoints on the one hand, and on the other hand what
really does look like a massive memory leak in unbound. If that is
true, which malloc() you use should not make much of a difference.

To test this hypothesis, I turned off DoH and DoT (diff to config
attached below, it was only turned on about last month), and also
stopped serving resolver.arpa, and then restarted unbound. Here are
a few "top" displays taken over the span of a few hours. First
after this config change:

load averages: 0.26, 0.20, 0.25; up 6+00:57:31 14:24:00
79 processes: 76 sleeping, 1 stopped, 2 on CPU
CPU states: 4.5% user, 0.0% nice, 2.2% system, 1.0% interrupt, 92.2% idle
Memory: 2702M Act, 7948K Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 1574K In, 16

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14982 unbound 43 0 398M 268M CPU/2 6:55 30.22% 30.22% unbound

load averages: 0.13, 0.17, 0.21; up 6+01:49:28 15:15:57
79 processes: 77 sleeping, 1 stopped, 1 on CPU
CPU states: 2.8% user, 0.0% nice, 2.0% system, 0.6% interrupt, 94.5% idle
Memory: 2847M Act, 11M Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 1234K In, 13

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14982 unbound 85 0 544M 417M kqueue/2 18:13 38.23% 38.23% unbound

load averages: 0.22, 0.11, 0.10; up 6+03:55:58 17:22:27
90 processes: 87 sleeping, 1 stopped, 2 on CPU
CPU states: 1.2% user, 0.0% nice, 1.1% system, 0.2% interrupt, 97.3% idle
Memory: 3040M Act, 18M Inact, 17M Wired, 27M Exec, 2367M File, 17G Free
Swap: 14G Total, 32M Used, 14G Free / Pools: 3149M Used / Network: 648K In, 700

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14982 unbound 43 0 738M 604M CPU/2 38:45 3.61% 3.61% unbound

If we compare this to what I experienced with these options turned
on and a number of DoH / DoT clients using those endpoints, quoting
from yesterday's e-mail:

load averages: 0.86, 0.94, 0.92; up 5+00:58:04 14:24:33
86 processes: 83 sleeping, 1 stopped, 2 on CPU
CPU states: 14.8% user, 0.0% nice, 1.3% system, 0.8% interrupt, 83.0% idle
Memory: 3035M Act, 68M Inact, 17M Wired, 21M Exec, 14M File, 17G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1322K In, 1906K Out

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14678 unbound 40 0 5408M 3033M CPU/2 183:17 78.47% 78.47% unbound

load averages: 0.52, 0.53, 0.52; up 5+02:22:23 15:48:52
85 processes: 82 sleeping, 1 stopped, 2 on CPU
CPU states: 11.4% user, 0.0% nice, 1.8% system, 1.0% interrupt, 85.7% idle
Memory: 3815M Act, 81M Inact, 17M Wired, 21M Exec, 14M File, 16G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2885M Used / Network: 1509K In, 19

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14678 unbound 84 0 6863M 3825M kqueue/0 236:12 39.55% 39.55% unbound

load averages: 0.19, 0.35, 0.41; up 5+04:50:24 18:16:53
85 processes: 1 runnable, 82 sleeping, 1 stopped, 1 on CPU
CPU states: 11.3% user, 0.0% nice, 1.2% system, 0.0% interrupt, 87.4% idle
Memory: 5085M Act, 99M Inact, 17M Wired, 21M Exec, 14M File, 15G Free
Swap: 14G Total, 38M Used, 14G Free / Pools: 2886M Used / Network: 79G In, 107G

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
14678 unbound 85 0 9358M 5118M RUN/1 319:53 29.30% 29.30% unbound

You'll notice the difference is quite stark.

Not only is the CPU time much lower (OK, crypto costs, I guess), but
also the trajectory of the virtual size is vastly different:

5408M -> 6863M (1:24h later) -> 9358M (3:52h after 0th measurement)

compared to

398M -> 544M (51m later) -> 738M (2:58h after 0th measurement)

And according to "unbound-control stats" the query rate is
comparable to what it was yesterday.

So I suspect there is a serious memory leak, possibly in unbound,
related to the code which does DoH and/or DoT handling.

Diff to our unbound.conf compared to yesterday attached.

Regards,

- Håvard

yvoinov · March 13, 2025, 5:09pm

I suspect there may be a leak in the DoH library. Not in unbound itself. I have had it running in uptime for years without any signs of leaks. However, I had a precedent for a leak in one of the setups with a DoH library (third-party).

heidnes · March 14, 2025, 11:05am

I suspect there may be a leak in the DoH library. Not in
unbound itself. I have had it running in uptime for years
without any signs of leaks. However, I had a precedent for a
leak in one of the setups with a DoH library (third-party).

That's possible. Of the external dependencies for this package
in our setup the only plausible candidate is nghttp2, and the
installed version is 1.64.0. The dynamic dependencies of unbound
are:

% ldd /usr/pkg/sbin/unbound
/usr/pkg/sbin/unbound:
        -lunbound.8 => /usr/pkg/lib/libunbound.so.8
        -lssl.15 => /usr/lib/libssl.so.15
        -lcrypto.15 => /usr/lib/libcrypto.so.15
        -lcrypt.1 => /lib/libcrypt.so.1
        -lc.12 => /usr/lib/libc.so.12
        -lutil.7 => /usr/lib/libutil.so.7
        -levent.5 => /usr/lib/libevent.so.5
        -lpthread.1 => /usr/lib/libpthread.so.1
        -lnghttp2.14 => /usr/pkg/lib/libnghttp2.so.14
%

I would think this is a "garden-variety" configuration?

Regards,

- Håvard

yvoinov · March 14, 2025, 11:49am

Yep

14.03.2025 16:05, Havard Eidnes пишет:

Cowbay · March 15, 2025, 2:14pm

Maybe this is off-topic, I want to know that if you got the _dns.resolver.arpa work in your environment.

I have the only device that use the RFC 9462 _dns.resolver.arpa thing is Apple's iPhone. But I can never make the iPhone to use the DoH or DoT which is specified in the _dns.resolver.arpa SVCB record.

Even I use Cloudflare's SVCB record iPhone still don't want to use it. I mean iPhone queried _dns.resolver.arpa SVCB and one.one.one.one HTTPS/A/AAAA then ignore. Maybe it tried to make a TLS connection to one.one.one.one then disconnect and ignore.

_dns.resolver.arpa. IN SVCB 1 one.one.one.one. alpn="h2,h3" port=443 ipv4hint=1.1.1.1,1.0.0.1 ipv6hint=2606:4700:4700::1111,2606:4700:4700::1001 key7="/dns-query{?dns}"

Do you have any suggestion ?

Cowbay

heidnes · March 16, 2025, 8:04am

Maybe this is off-topic, I want to know that if you got the
_dns.resolver.arpa work in your environment.

I got it to work in that I got a rather sharp increase in DoT and
DoH clients.

However, what I do not have is any identity of the querying
clients, so I can't tell you which OS / version is responding to
the presence of these records in the DNS lookup view.

The _dns records in my local resolver.arpa zone looked like this:

_dns SVCB 1 dns-resolver1.uninett.no. (
                        alpn=h2 dohpath=/dns-query{?dns} )
_dns SVCB 2 dns-resolver1.uninett.no. (
                        alpn=dot port=853 )
_dns SVCB 3 dns-resolver2.uninett.no. (
                        alpn=h2 dohpath=/dns-query{?dns} )
_dns SVCB 4 dns-resolver2.uninett.no. (
                        alpn=dot port=853 )

Best regards,

- Håvard

heidnes · April 9, 2025, 9:05am

So...

I've finally done some testing with gcc's -fsanitize=leak, and
after a few false starts I now have managed to get some results.
I've submitted my findings in

https://github.com/NLnetLabs/unbound/issues/1264

The TL;DR; of it is that the leak sanitizer found leaks related
to use of libnghttp2.

Regards,

- Håvard