I have just ordered four SuperMicro servers with quad core Xeon CPU and 36 Gb RAM to use in a geographically redundant setup behind a load balancing cluster. We are using this configuration already, but with older servers that need to be replaced. Unbound is doing a great job already, but what we really need to keep our customers happy is a massive cache.
So, what are your suggestions for cache size settings with 36 Gb RAM? I think that it would be resonable to let Unbound use 32 Gb so the servers have plenty of room to breathe.
Generally larger cache memory can improve latency (cache hit ratio)
because it can keep more records in cache. I'm not certain whether too
much memory introduce something harmful or not (except power
consumption or failure rate of memory modules).
But according to my test, around 3-4GBs memory per one Unbound cache
server seems to be enough (e.g. rrset-cache-size: 2g, msg-cache-size:
1g) even if in large ISP environment. More memory won't improve cache
hit ratio meaningfully. PowerDNS recursor doc gives similar suggestion
[1].
Note that Unbound will consume more memory than you specified due to
malloc() overhead [2] and we will need more cache memory as
DNSSEC-signed zone is deployed more widely.
This is an ISP setup, of course, designed to withstand just about anything we have had issues with in the past.
We have used PowerDNS a while ago and it was quite speedy, but I think Unbund outperforms just about anything. Could it perhaps be more efficient regarding cache management as well?
I have seen configurations in the thread with many gigabytes of cache, so how does it work for you all?
I run Unbound with rrset-cache-size = msg-cache-size = 8000m, as
I found they grew fairly parallel. Filling up those 8GBs of cache took
a long while too (several days at 30-40k qps), so without having
looked too much into the details, it should give heaps and bounds more
caching than is really needed to make the long tail (rare, but
high-TTL queries) happy.
Unbound is run with 4 threads, which makes it run comfortably even in
failover (double or triple load conditions). I started out with 16,
but it ended up doing more thread work than real work.
Some other notes: Beware of Unbound 1.4.17! We've been hit maybe 5-10
times by this bug over the last year or so: http://unbound.nlnetlabs.nl/pipermail/unbound-users/2012-July/002493.html
We work around it by monitoring the Unbound process with monit, and
I'll still do that for newer versions since the last message in the
thread ended up in a sort of cliffhanger with no conclusion.
You're doing proper failover and balancing, naturally, but make sure
the processes are kept alive. Also, don't assume that users and stub
resolvers will cleverly start using your secondary resolver if the
first one is unavailable (libc will try the primary resolver first on
every lookup, regardless of any 'rotate' option in resolv.conf, for
example) -- always, alway, always keep your primary available.
For further tweaking, you can consider enabling Unbound's prefetch
option, although the impact of leaving it disabled could be considered
negligible.
Same here, and while it performs nicely, two things kept cropping up:
* In rare cases, cache entries never expired and were kept around
forever, meaning that it kept responding with the old data way after
TTL expiry. It was pretty much impossible to debug, and ended up
being a dealbreaker, unfortunately.
* Upon start/restart, the load of new requests to a clean cache would
cause it to somehow fail a lot of queries (queue full?) and return
SERVFAIL. This response would be cached for quite a while (60s?),
and cause it to serve SERVFAIL to even the most popular queries like
facebook, google and youtube. It can be resolved with a slowstart
mechanism in the loadbalancer, though.
As for performance itself, have you observed any significant and/or
reproducible differences between PowerDNS and Unbound?
We work around it by monitoring the Unbound process with monit, and
I'll still do that for newer versions since the last message in
the thread ended up in a sort of cliffhanger with no conclusion.
That bug has been fixed, but it did take some time to do so. I have
made two fixes for that same bug; one as a fallback for the other fix,
for extra robustness in the code with regards to the issue (callback
within a callback type of issue).