Resolving Timeouts/Issues

Dave_Ellis1 · October 8, 2008, 1:13pm

Hello!

I’m looking at using unbound as a replacement for Bind9 for our datacenters caching nameservers. Bind is overly bloated and complex for something as simple as providing a DNS resolver for our customers. So far unbound seems streamlined and fast, I like it.

Earlier this morning, I actually implemented unbound on one of our lesser used caching nameservers in order to try it out under some actual load. I came across a problem and hopefully you all can give me a hand with it. Here is a snippit of the some applicable logs.

[1223447403] unbound[4318:0] info: validator operate: query <fox.com. A IN>
[1223447403] unbound[4318:0] info: resolving <fox.com. A IN>
[1223447403] unbound[4318:0] info: resolving (init part 2): <fox.com. A IN>
[1223447403] unbound[4318:0] info: resolving (init part 3): <fox.com. A IN>
[1223447403] unbound[4318:0] info: processQueryTargets: <fox.com. A IN>
[1223447403] unbound[4318:0] info: sending query: <fox.com. A IN>
[1223447403] unbound[4318:0] info: 345RDd mod1 rep <fox.com. A IN>
[1223447403] unbound[4318:0] info: 345RDd mod1 rep <fox.com. A IN>
[1223447403] unbound[4318:0] info: 345RDd mod1 rep <fox.com. A IN>
[1223447403] unbound[4318:0] info: 345RDd mod1 rep <fox.com. A IN>
[1223447403] unbound[4318:0] info: 345RDd mod1 rep <fox.com. A IN>
…insert 100’s of repeats of this log entry…
[1223447441] unbound[4318:0] info: 339RDdc mod1 rep <fox.com. A IN>
[1223447441] unbound[4318:0] info: 339RDdc mod1 rep <fox.com. A IN>
[1223447441] unbound[4318:0] info: 339RDdc mod1 rep <fox.com. A IN>
[1223447441] unbound[4318:0] info: 339RDdc mod1 rep <fox.com. A IN>
[1223447441] unbound[4318:0] info: 339RDdc mod1 rep <fox.com. A IN>
[1223447441] unbound[4318:0] info: iterator operate: query <fox.com. A IN>
[1223447441] unbound[4318:0] info: scrub for <fox.com. NS IN>
[1223447441] unbound[4318:0] info: response for <fox.com. A IN>
[1223447441] unbound[4318:0] info: reply from <fox.com.> 212.187.244.39#53
;; fox.com. IN A
fox.com. 600 IN A 69.10.20.100
[1223447441] unbound[4318:0] info: finishing processing for <fox.com. A IN>
[1223447441] unbound[4318:0] info: validator operate: query <fox.com. A IN>

After 4-5 queries and timeouts with nslookup/dig I eventually get the response shown above. Any ideas? Something wrong with my config?

Thanks!

Config:

cache-ns6:/usr/local/etc/unbound# cat unbound.conf
server:
directory: “/usr/local/etc/unbound”
username: unbound
chroot: “/usr/local/etc/unbound”
logfile: “/usr/local/etc/unbound/unbound.log”
pidfile: “/usr/local/etc/unbound/unbound.pid”
interface: 0.0.0.0
access-control: 0.0.0.0/0 allow
root-hints: “/usr/local/etc/unbound/named.cache”
do-ip6: no
outgoing-num-tcp: 100
incoming-num-tcp: 100
msg-cache-size: 1500m
msg-cache-slabs: 8
statistics-interval: 30

Wouter · October 8, 2008, 2:16pm

Hi Dave,

Ok, the 100s of repeats is because you are running at high verbosity;
and after each query it prints the full requestlist. (it is servicing
100s of other queries). I conclude this is an excerpt of the logfile,
which is probably huge.

The entry:
345RDd mod1 rep <fox.com. A IN>
means:
it was number 345 in the requestlist. This is RD (recursion desired by
client) and 'd' detached (no parent queries). mod1 is the iterator by
default. rep there is a user waiting for reply.

It takes a long time then, and it is:
339RDdc mod1 rep <fox.com. A IN>
the added c means that it has child states. This means that the fox.com
query is waiting for other queries. Usually, these are deeper queries
for nameserver addresses.

The only reason it would query for additional nameservers in this manner
is when the current one is not responding.

Sorry for the rambling, I think these timeouts are caused by not having
enough file descriptor space ; it can not keep up and the queue grows
very long, try
  outgoing-range: 900
  outgoing-num-tcp: 30
  incoming-num-tcp: 30

Best regards,
Wouter

Dave Ellis wrote:

Dave_Ellis1 · October 8, 2008, 2:30pm

Wouter,

Thanks for the info, I made the changes to my config and I'm still
experiencing the same issue. Here is what I've got in my config now.

Wouter · October 8, 2008, 2:57pm

Hi Dave,

Aha, you need to configure for more performance then. I assume you have
a multiprocessor machine (since you set your slabs to 8). Because of
port randomization it needs loads of file descriptors.

recompile unbound but before recompiling do
./configure --without-pthreads --without-solaris-threads
then recompile, reinstall. This enabled a special forking.

config changes, add the line:
num-threads: 8 # if you have 8 cores on the machine.

It is also possible to reconfigure using --with-libevent with almost the
same config. Depending on OS and libevent version, that may be faster
or buggy.

Also, you should increase the rrset cache. you need about 2x the amount
of rrset cache as msg cache.
rrset-cache-size: 1024m
msg-cache-size: 512m
Right now it is running with 4m cache...

Best regards,
Wouter

Dave Ellis wrote:

Dave_Ellis1 · October 8, 2008, 4:23pm

I recompiled as suggested, and made the configuration changes.
Everything is running much better now, although I'm still getting some
timeouts but nearly as quickly. Anything else I can improve on to get
rid of the timeout problem?

This server is a Dual Quad Core Xeon 2Ghz, with 4Mb of cache running
with 2GB of RAM. Just to give you an idea of specs.

Looking through the logs I found the following right after start up, not
sure if its helpful.

[1223458526] unbound[23824:0] info: 16.000000 32.000000 11
[1223458526] unbound[23824:0] info: 32.000000 64.000000 1
[1223458648] unbound[23872:0] notice: init module 0: validator
[1223458648] unbound[23872:0] notice: init module 1: iterator
[1223458648] unbound[23872:0] notice: openssl has no entropy, seeding
with time and pid
[1223458648] unbound[23872:0] info: start of service (unbound 1.0.2).
[1223458648] unbound[23878:6] error: accept failed: Resource temporarily
unavailable
[1223458648] unbound[23878:6] info: remote address is (inet_ntop error)
port 0
[1223458648] unbound[23879:7] error: accept failed: Resource temporarily
unavailable
[1223458648] unbound[23879:7] info: remote address is (inet_ntop error)
port 0
[1223458658] unbound[23872:0] error: accept failed: Resource temporarily
unavailable
[1223458658] unbound[23872:0] info: remote address is 72.249.76.123 port
51400
[1223458659] unbound[23872:0] error: accept failed: Resource temporarily
unavailable
[1223458659] unbound[23872:0] info: remote address is 72.249.76.123 port
51400
[1223458659] unbound[23879:7] error: accept failed: Resource temporarily
unavailable
[1223458659] unbound[23879:7] info: remote address is (inet_ntop error)
port 0
[1223458662] unbound[23872:0] error: accept failed: Resource temporarily
unavailable
[1223458662] unbound[23872:0] info: remote address is 206.123.115.117
port 50068
[1223458664] unbound[23872:0] error: accept failed: Resource temporarily
unavailable
[1223458664] unbound[23872:0] info: remote address is 206.123.64.245
port 53096
[1223458664] unbound[23876:4] error: accept failed: Resource temporarily
unavailable
[1223458664] unbound[23876:4] info: remote address is 72.249.76.123 port
51491
[1223458672] unbound[23878:6] error: accept failed: Resource temporarily
unavailable
[1223458672] unbound[23872:0] error: accept failed: Resource temporarily
unavailable
[1223458672] unbound[23878:6] info: remote address is 72.249.76.123 port
51483
[1223458672] unbound[23872:0] info: remote address is 72.249.76.123 port
51605
[1223458672] unbound[23872:0] error: accept failed: Resource temporarily
unavailable
[1223458672] unbound[23872:0] info: remote address is 72.249.76.123 port
51605

Again, I appreciate this. Thank you.

Wouter · October 9, 2008, 7:08am

Hi Dave,

Great that it is working better.

You are still configuring more than 1024 file descriptors per thread;
hence the accept failures. Did you turn down the number of tcp
connections like I told you to? Because it looks like you did not.

you need to provide access to /dev/random from within the chroot
(/usr/local/etc/unbound/dev/random -> /dev/random), to provide entropy
for the random numbers.

Did you do the outgoing-range: 900 change? I think so. Otherwise, you
did not compile with debugging (esp. memory or lock debugging) ? What
is happening with the timeouts you experience now?

When unbound exits, can you provide the statistics it prints: especially
the size of the requestlists per thread, number of packets dropped and
so on. Those numbers may help find out where the capacity problem is.

Best regards,
Wouter

Dave Ellis wrote:

Teran_McKinney · October 9, 2008, 11:40am

Nice to see Unbound getting used in larger scale environments. I run
it at home on my 500Mhz P3 laptop/router with 256MB of RAM and the
standard cache settings (as well as a fair share of DNSSEC keys).
While I'm sure that I don't put it through a fraction of the stress,
I'm not terribly gentle on it either :-). I have had no performance
issues with it, and don't link it (yet) to an external libevent
either. However, it doesn't run too well on my 486 with 16MB of RAM,
so I may have a project for another day. I do not chroot Unbound, but
I do have a dedicated "unbound" user for it. The chroot issues
definitely sound like a possible culprit to me if /dev/random is not
accessible. I'm not sure if you are using Linux or *BSD, but
/dev/random is generally _very_ slow under Linux unless you have a
hardware random number generator. I would recommend /dev/urandom
instead, unless /dev/random is fast enough. I guess Unbound has its
own fallback internal random number generator?

Cheers,
Teran

Dave_Ellis1 · October 9, 2008, 11:44am

Hey Wouter,

The timeouts happen as before, except less often and usually after a
single timeout it'll resolve the host correctly when retrying the 2nd
time. Either way, everything is quite a bit better.

After about 3 minutes from startup the timeouts start. I did notice any
domain in the cache resolves right away, but if unbound has to query
another NS server it times out.

Thanks!

Wouter · October 9, 2008, 11:53am

Hi Dave,

Dave Ellis wrote:

Hey Wouter,

The timeouts happen as before, except less often and usually after a
single timeout it'll resolve the host correctly when retrying the 2nd
time. Either way, everything is quite a bit better.

After about 3 minutes from startup the timeouts start. I did notice any
domain in the cache resolves right away, but if unbound has to query
another NS server it times out.

num-threads: 8

And the output of the log when the service is stopped.

[1223529704] unbound[24517:0] info: service stopped (unbound 1.0.2).
[1223529704] unbound[24517:0] info: server stats for thread 0: 42804
queries, 21012 answers from cache, 21792 recursions
[1223529704] unbound[24517:0] info: server stats for thread 0:
requestlist max 1031 avg 811.488 exceeded 10926
[1223529704] unbound[24517:0] info: average recursion processing time
2.075908 sec

The max requestlist van 1031 - which is pretty big - it was exceeded
10926 times. Thus, you are still under-capacity. I think that explains
the timeouts.

You paste only the statistics for thread 0. The others are the same?
Or are there no others (?).

Did you recompile without threading? In that case, do you have 1.5g * 8
= 12 Gb memory for all the threads? If not scale back the cache sizes,
otherwise you start swapping and slowing down. Divide the cache size by
8 perhaps.

If you recompiled with libevent, then the memory is not the issue. An in
that case I would suggest trying --with-libevent, and outgoing-range:
2048 and num-queries-per-thread: 2048 to get additional capacity.

Best regards,
Wouter

Dave_Ellis1 · October 9, 2008, 1:09pm

I do have 8 threads running, and I compiled with the following
configure, is this correct?

./configure --without-pthreads --without-solaris-threads --with-libevent

I also made this change to the config:
outgoing-range: 2048
num-queries-per-thread: 2048

8 threads:
#ps ax | grep unbound
24740 ? Ss 0:00 unbound -c
/usr/local/etc/unbound/unbound.conf
24741 ? S 0:00 unbound -c
/usr/local/etc/unbound/unbound.conf
24742 ? S 0:00 unbound -c
/usr/local/etc/unbound/unbound.conf
24743 ? S 0:00 unbound -c
/usr/local/etc/unbound/unbound.conf
24744 ? S 0:00 unbound -c
/usr/local/etc/unbound/unbound.conf
24745 ? S 0:00 unbound -c
/usr/local/etc/unbound/unbound.conf
24746 ? S 0:00 unbound -c
/usr/local/etc/unbound/unbound.conf
24747 ? S 0:00 unbound -c
/usr/local/etc/unbound/unbound.conf

Logs start to finish after config change:
[1223534119] unbound[24662:0] info: start of service (unbound 1.0.2).
[1223534121] unbound[24663:1] error: accept failed: Resource temporarily
unavailable
[1223534121] unbound[24663:1] info: remote address is (inet_ntop error)
port 0
[1223534122] unbound[24663:1] error: accept failed: Resource temporarily
unavailable
[1223534122] unbound[24663:1] info: remote address is (inet_ntop error)
port 0
[1223534122] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534122] unbound[24662:0] info: remote address is 72.249.76.123 port
53468
[1223534122] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534122] unbound[24662:0] info: remote address is 72.249.76.123 port
53475
[1223534124] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534124] unbound[24662:0] info: remote address is 72.249.76.123 port
53730
[1223534124] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534124] unbound[24662:0] info: remote address is 72.249.76.123 port
53761
[1223534167] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534167] unbound[24662:0] info: remote address is 72.249.76.123 port
55893
[1223534185] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534185] unbound[24662:0] info: remote address is 72.249.76.123 port
41715
[1223534192] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534192] unbound[24662:0] info: remote address is 72.249.76.123 port
55950
[1223534196] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534196] unbound[24662:0] info: remote address is 72.249.76.123 port
41823
[1223534196] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534196] unbound[24662:0] info: remote address is 72.249.76.123 port
41823
[1223534226] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534226] unbound[24662:0] info: remote address is 72.249.76.123 port
42083
[1223534234] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534234] unbound[24662:0] info: remote address is 72.249.76.123 port
42105
[1223534240] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534240] unbound[24662:0] info: remote address is 72.249.76.123 port
42119
[1223534266] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534266] unbound[24662:0] info: remote address is 72.249.76.123 port
42347
[1223534267] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534267] unbound[24662:0] info: remote address is 72.249.76.123 port
42355
[1223534280] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534280] unbound[24662:0] info: remote address is 72.249.76.123 port
42413
[1223534286] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534286] unbound[24662:0] info: remote address is 72.249.76.123 port
42433
[1223534286] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534286] unbound[24662:0] info: remote address is 72.249.76.123 port
42433
[1223534311] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534311] unbound[24662:0] info: remote address is 207.210.234.34
port 45114
[1223534311] unbound[24669:7] error: accept failed: Resource temporarily
unavailable
[1223534311] unbound[24669:7] info: remote address is 72.249.19.195 port
42103
[1223534311] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534311] unbound[24662:0] info: remote address is 207.210.234.34
port 45114
[1223534311] unbound[24667:5] error: accept failed: Resource temporarily
unavailable
[1223534311] unbound[24667:5] info: remote address is 72.249.76.123 port
42477
[1223534317] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534317] unbound[24662:0] info: remote address is 72.249.76.123 port
42516
[1223534323] unbound[24662:0] error: accept failed: Resource temporarily
unavailable
[1223534323] unbound[24662:0] info: remote address is 72.249.26.99 port
59362
[1223534347] unbound[24662:0] info: service stopped (unbound 1.0.2).
[1223534347] unbound[24662:0] info: server stats for thread 0: 95858
queries, 57619 answers from cache, 38239 recursions
[1223534347] unbound[24662:0] info: server stats for thread 0:
requestlist max 2056 avg 1508.5 exceeded 15582
[1223534347] unbound[24662:0] info: average recursion processing time
3.076592 sec
[1223534347] unbound[24662:0] info: histogram of recursion processing
times
[1223534347] unbound[24662:0] info: [25%]=0.00555758
median[50%]=0.0232741 [75%]=0.118504
[1223534347] unbound[24662:0] info: lower(secs) upper(secs) recursions
[1223534347] unbound[24662:0] info: 0.000000 0.000001 2021
[1223534347] unbound[24662:0] info: 0.000256 0.000512 3
[1223534347] unbound[24662:0] info: 0.000512 0.001024 40
[1223534347] unbound[24662:0] info: 0.001024 0.002048 263
[1223534347] unbound[24662:0] info: 0.002048 0.004096 347
[1223534347] unbound[24662:0] info: 0.004096 0.008192 329
[1223534347] unbound[24662:0] info: 0.008192 0.016384 474
[1223534347] unbound[24662:0] info: 0.016384 0.032768 684
[1223534347] unbound[24662:0] info: 0.032768 0.065536 4596
[1223534347] unbound[24662:0] info: 0.065536 0.131072 3165
[1223534347] unbound[24662:0] info: 0.131072 0.262144 3207
[1223534347] unbound[24662:0] info: 0.262144 0.524288 2340
[1223534347] unbound[24662:0] info: 0.524288 1.000000 742
[1223534347] unbound[24662:0] info: 1.000000 2.000000 288
[1223534347] unbound[24662:0] info: 2.000000 4.000000 205
[1223534347] unbound[24662:0] info: 4.000000 8.000000 167
[1223534347] unbound[24662:0] info: 8.000000 16.000000 120
[1223534347] unbound[24662:0] info: 16.000000 32.000000 94
[1223534347] unbound[24662:0] info: 32.000000 64.000000 207
[1223534347] unbound[24662:0] info: 64.000000 128.000000 358
[1223534347] unbound[24662:0] info: 128.000000 256.000000 112

Wouter · October 9, 2008, 1:37pm

Hi Dave,

Oh! I meant you have two choices, compile with

./configure --without-pthreads --without-solaris-threads

or

./configure --with-libevent

Using them together is indeed also an option. Perhaps the first one
(without libevent) does not have the errors about resource unavailable
printed.

Resource unavailable ; what is your ulimit -a?
Set open file (file descriptors) to 8*2048 or more, say ulimit -n 100000.

There are no stats for the other threads - but they do print them to the
file! It is like your logfile is losing information (some syslog version?)

You have an 8 processor but one is doing all the work.
Is your OS unfair in scheduling (raging herd problem)?
Getting the other 7 cores to do some work could help a lot, perhaps the
other two compile modes cause the OS to schedule it differently.

Anyway, with your current --with-libevent compile, you could set
outgoing-range: 16384
num-queries-per-thread: 16384
or even more, and perhaps the thread0 that does all the work gets enough
power to do all of it ...

Best regards,
Wouter

Dave Ellis wrote:

Wouter · October 9, 2008, 1:42pm

Teran McKinney wrote:

Nice to see Unbound getting used in larger scale environments. I run
it at home on my 500Mhz P3 laptop/router with 256MB of RAM and the
standard cache settings (as well as a fair share of DNSSEC keys).
While I'm sure that I don't put it through a fraction of the stress,
I'm not terribly gentle on it either :-). I have had no performance
issues with it, and don't link it (yet) to an external libevent
either. However, it doesn't run too well on my 486 with 16MB of RAM,
so I may have a project for another day. I do not chroot Unbound, but
I do have a dedicated "unbound" user for it. The chroot issues
definitely sound like a possible culprit to me if /dev/random is not
accessible. I'm not sure if you are using Linux or *BSD, but
/dev/random is generally _very_ slow under Linux unless you have a
hardware random number generator. I would recommend /dev/urandom
instead, unless /dev/random is fast enough. I guess Unbound has its
own fallback internal random number generator?

Yes.

And by default the memory can grow to about 30-40Mb.
The unbound.conf(5) manpage, has sample minimum-memory configuration
settings for your 486

Best regards,
Wouter

Dave_Ellis1 · October 9, 2008, 2:04pm

I don't believe the other threads are crashing, when I enable
statistics-interval I get 8 threads printed to the log. I went ahead and
enabled statistics-interval, with a 3 minute timer. You can see them
below, it does look like thread0 is taking most of the queries, but not
100% of them.

I also changed the config:
outgoing-range: 16384
num-queries-per-thread: 16384

Wouter · October 9, 2008, 2:31pm

Dave Ellis wrote:

I don't believe the other threads are crashing, when I enable
statistics-interval I get 8 threads printed to the log. I went ahead and
enabled statistics-interval, with a 3 minute timer. You can see them
below, it does look like thread0 is taking most of the queries, but not
100% of them.

Thanks for the detailed logs

I also changed the config:
outgoing-range: 16384
num-queries-per-thread: 16384

So, you allow only 1024 open files, please increase ulimit open files
ulimit -n 100000
Again, that may empower your brave thread0 to do all of the work.

But the biggest problem you have seems to be the OS distribution:
thread 2: 1912 queries, 348 answers from cache, 1564 recursions
thread 2: requestlist max 162 avg 92.7954 exceeded 0
thread 1: 1976 queries, 308 answers from cache, 1668 recursions
thread 1: requestlist max 170 avg 96.6511 exceeded 0
thread 4: 2575 queries, 491 answers from cache, 2084 recursions
thread 4: requestlist max 237 avg 117.37 exceeded 0
thread 3: 2003 queries, 366 answers from cache, 1637 recursions
thread 3: requestlist max 183 avg 102.059 exceeded 0
thread 5: 2051 queries, 349 answers from cache, 1702 recursions
thread 5: requestlist max 218 avg 114.461 exceeded 0
thread 6: 2003 queries, 415 answers from cache, 1588 recursions
thread 6: requestlist max 168 avg 91.5101 exceeded 0
thread 7: 3719 queries, 638 answers from cache, 3081 recursions
thread 7: requestlist max 283 avg 153.266 exceeded 0
thread 0: 75846 queries, 25599 answers from cache, 50247 recursions
thread 0: requestlist max 4135 avg 2140.02 exceeded 0

You are correct, thread 0 is doing most of the work.

If you look at /proc/cpuinfo, how many cpus does the OS see? Set
num-threads to that number.

Try compiling in one of the other two modes to see if that causes a
better distribution of work among the threads.

Best regards,
Wouter

Dave_Ellis1 · October 9, 2008, 2:47pm

I'm not getting timeouts when setting the open file limit 100,000, but
we still have the issue with most of the work being done by thread0. I'm
going to recompile like you suggested and post again in a few minutes.

BTW, /proc/cpuinfo shows 8 processors.

Thanks for the help.

-Dave

cache-ns2:~# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
max nice (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 100000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) unlimited
max rt priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Dave_Ellis1 · October 9, 2008, 3:19pm

We have success! I compiled with:

./configure --without-pthreads --without-solaris-threads

And changed the config:
outgoing-range: 900

It seems the threads are much closer to being balanced. Top shows all 8
threads running lose to the same usage.

The only thing odd, I am still getting these errors in the logs:
[1223543388] unbound[3440:1] error: accept failed: Resource temporarily
unavailable
[1223543388] unbound[3440:1] info: remote address is 72.249.76.123 port
58890

Wouter · October 9, 2008, 3:31pm

Hi Dave,

Glad to hear that.

The accept is a TCP error. It could complain about network buffer space
in the kernel maybe or file descriptor limits.
But 900 + 30 +30 + some extra is well within 1024 file descriptors ...
Also 30*8 TCP connections should be well within linux default capacity.

You could try to lower the
outgoing-num-tcp: 10
incoming-num-tcp: 10
But I have no clue why the errors are appearing.

At least it is being distributed better. That helps.

Well for fine-tuning; set the other *-slabs options to 8.
increase the infra structure cache size.

Best regards,
Wouter

Dave Ellis wrote:

Dave_Ellis1 · October 9, 2008, 3:59pm

Thanks for the tweaking tips. I'm going to keep an eye on things over
the weekend to see how we do.

I also tried moving the outgoing-num-tcp and incoming-num-tcp to 10, but
the same errors came up again. I did notice the errors were coming from
one specific client of ours, and no others. I am curious if he has a
firewall doing some kind of filtering causing the issue. I may give him
a call to see if he has anything weird going.

Other than that, I think we're good for now. Thanks so much for the
assistance. If this works well we will be rolling Unbound out through
our entire cluster. I do want to mention, that Unbound is using 5% the
CPU that bind uses on a typical day.

Kuddos to everyone working this project. I'm impressed.

-Dave