Unbound issues

We are trying unbound out. Machine is quad CPU with 16GB ram. I'm
picking on thread 2 because it shows the issue best as it seems to
receive up to 6 times as many queries as the other 3 threads. As you
can see, the time to execute climbs gradually through the period and
starts to "exceed" allowed limits about mid-way down the stats results
shown here.

Config bits.
server:
  verbosity: 1
  statistics-interval: 300
  statistics-cumulative: yes
  extended-statistics: no
  num-threads: 4
  outgoing-range: 2048
  msg-cache-size: 3G
  msg-cache-slabs: 8
  num-queries-per-thread: 2048
  jostle-timeout: 1000
  so-rcvbuf: 8M
  rrset-cache-size: 6G
  rrset-cache-slabs: 8
  cache-min-ttl: 7776000
  harden-glue: yes
  harden-dnssec-stripped: yes
  harden-referral-path: yes
  use-caps-for-id: yes
  unwanted-reply-threshold: 10000000
  val-clean-additional: yes
  val-permissive-mode: no
  key-cache-slabs: 8
  neg-cache-size: 1g

Below are snippets from the statistics taken roughly every 5 minutes
from startup over a period of 7 hours.

server stats for thread 2: 420 queries, 284 answers from cache, 136 recursions
server stats for thread 2: requestlist max 21 avg 16.2647 exceeded 0
server stats for thread 2: 3189 queries, 3053 answers from cache, 136 recursions
server stats for thread 2: requestlist max 21 avg 16.2647 exceeded 0
server stats for thread 2: 3375 queries, 3239 answers from cache, 136 recursions
server stats for thread 2: requestlist max 21 avg 16.2647 exceeded 0
server stats for thread 2: 4560 queries, 4424 answers from cache, 136 recursions
server stats for thread 2: requestlist max 21 avg 16.2647 exceeded 0
server stats for thread 2: 32129 queries, 13211 answers from cache,
18918 recursions
server stats for thread 2: requestlist max 1196 avg 620.017 exceeded 0
server stats for thread 2: 227635 queries, 63600 answers from cache,
164035 recursions
server stats for thread 2: requestlist max 1196 avg 605.375 exceeded 0
server stats for thread 2: 412374 queries, 113701 answers from cache,
298673 recursions
server stats for thread 2: requestlist max 1196 avg 649.591 exceeded 0
server stats for thread 2: 591627 queries, 164560 answers from cache,
427067 recursions
server stats for thread 2: requestlist max 1196 avg 685.99 exceeded 0
server stats for thread 2: 762418 queries, 214876 answers from cache,
547542 recursions
server stats for thread 2: requestlist max 1196 avg 725.279 exceeded 0
server stats for thread 2: 925072 queries, 265281 answers from cache,
659791 recursions
server stats for thread 2: requestlist max 1196 avg 766.182 exceeded 0
server stats for thread 2: 1082217 queries, 315509 answers from cache,
766708 recursions
server stats for thread 2: requestlist max 1213 avg 803.033 exceeded 0
server stats for thread 2: 1240514 queries, 367629 answers from cache,
872885 recursions
server stats for thread 2: requestlist max 1336 avg 841.245 exceeded 0
server stats for thread 2: 1399258 queries, 421640 answers from cache,
977618 recursions
server stats for thread 2: requestlist max 1360 avg 874.701 exceeded 0
server stats for thread 2: 1555372 queries, 476520 answers from cache,
1078852 recursions
server stats for thread 2: requestlist max 1390 avg 908.038 exceeded 0
server stats for thread 2: 1714148 queries, 534429 answers from cache,
1179719 recursions
server stats for thread 2: requestlist max 1478 avg 942.845 exceeded 0
server stats for thread 2: 1877100 queries, 595948 answers from cache,
1281152 recursions
server stats for thread 2: requestlist max 1603 avg 977.634 exceeded 0
server stats for thread 2: 2046507 queries, 661158 answers from cache,
1385349 recursions
server stats for thread 2: requestlist max 1661 avg 1018.34 exceeded 0
server stats for thread 2: 2230528 queries, 734070 answers from cache,
1496458 recursions
server stats for thread 2: requestlist max 1792 avg 1063.32 exceeded 0
server stats for thread 2: 2420948 queries, 811552 answers from cache,
1609396 recursions
server stats for thread 2: requestlist max 1876 avg 1106.96 exceeded 0
server stats for thread 2: 2627825 queries, 898335 answers from cache,
1729490 recursions
server stats for thread 2: requestlist max 1948 avg 1152.9 exceeded 0
server stats for thread 2: 2847864 queries, 993012 answers from cache,
1854852 recursions
server stats for thread 2: requestlist max 2010 avg 1198.99 exceeded 0
server stats for thread 2: 3075730 queries, 1093945 answers from
cache, 1981785 recursions
server stats for thread 2: requestlist max 2088 avg 1244.61 exceeded 0
server stats for thread 2: 3320791 queries, 1204754 answers from
cache, 2116037 recursions
server stats for thread 2: requestlist max 2129 avg 1289.26 exceeded 0
server stats for thread 2: 3593773 queries, 1329840 answers from
cache, 2263933 recursions
server stats for thread 2: requestlist max 2129 avg 1333.68 exceeded 0
server stats for thread 2: 3877901 queries, 1462415 answers from
cache, 2415486 recursions
server stats for thread 2: requestlist max 2340 avg 1381.81 exceeded 0
server stats for thread 2: 4169522 queries, 1601187 answers from
cache, 2568335 recursions
server stats for thread 2: requestlist max 2340 avg 1424.49 exceeded 0
server stats for thread 2: 4458010 queries, 1741370 answers from
cache, 2716640 recursions
server stats for thread 2: requestlist max 2340 avg 1461.19 exceeded 0
server stats for thread 2: 4755954 queries, 1887372 answers from
cache, 2868582 recursions
server stats for thread 2: requestlist max 2340 avg 1497.73 exceeded 0
server stats for thread 2: 5054311 queries, 2035933 answers from
cache, 3018378 recursions
server stats for thread 2: requestlist max 2392 avg 1534.58 exceeded 0
server stats for thread 2: 5362834 queries, 2191665 answers from
cache, 3171169 recursions
server stats for thread 2: requestlist max 2438 avg 1571.12 exceeded 0
server stats for thread 2: 5669256 queries, 2347984 answers from
cache, 3321272 recursions
server stats for thread 2: requestlist max 2463 avg 1604.42 exceeded 0
server stats for thread 2: 5975214 queries, 2506445 answers from
cache, 3468769 recursions
server stats for thread 2: requestlist max 2463 avg 1635.18 exceeded 0
server stats for thread 2: 6279369 queries, 2665381 answers from
cache, 3613988 recursions
server stats for thread 2: requestlist max 2591 avg 1664.99 exceeded 0
server stats for thread 2: 6577860 queries, 2822890 answers from
cache, 3754970 recursions
server stats for thread 2: requestlist max 2591 avg 1691.34 exceeded 0
server stats for thread 2: 6879207 queries, 2983199 answers from
cache, 3896008 recursions
server stats for thread 2: requestlist max 2591 avg 1717.35 exceeded 0
server stats for thread 2: 7175981 queries, 3141895 answers from
cache, 4034086 recursions
server stats for thread 2: requestlist max 2607 avg 1744.31 exceeded 0
server stats for thread 2: 7468573 queries, 3300361 answers from
cache, 4168212 recursions
server stats for thread 2: requestlist max 2607 avg 1768.51 exceeded 0
server stats for thread 2: 7763213 queries, 3459687 answers from
cache, 4303526 recursions
server stats for thread 2: requestlist max 2615 avg 1792.17 exceeded 275
server stats for thread 2: 8049494 queries, 3615698 answers from
cache, 4433796 recursions
server stats for thread 2: requestlist max 2649 avg 1814.33 exceeded 391
server stats for thread 2: 8339874 queries, 3771948 answers from
cache, 4567926 recursions
server stats for thread 2: requestlist max 2673 avg 1836.72 exceeded 9291
server stats for thread 2: 8675486 queries, 3928361 answers from
cache, 4747125 recursions
server stats for thread 2: requestlist max 2673 avg 1865.03 exceeded 79527
server stats for thread 2: 9013544 queries, 4065706 answers from
cache, 4947838 recursions
server stats for thread 2: requestlist max 2673 avg 1894.08 exceeded 197196
server stats for thread 2: 9325951 queries, 4184820 answers from
cache, 5141131 recursions
server stats for thread 2: requestlist max 2673 avg 1919.66 exceeded 320054
server stats for thread 2: 9649570 queries, 4311387 answers from
cache, 5338183 recursions
server stats for thread 2: requestlist max 2673 avg 1943.58 exceeded 442555
server stats for thread 2: 9883086 queries, 4386420 answers from
cache, 5496666 recursions
server stats for thread 2: requestlist max 2673 avg 1961.76 exceeded 567156
server stats for thread 2: 10187789 queries, 4502652 answers from
cache, 5685137 recursions
server stats for thread 2: requestlist max 2673 avg 1982.28 exceeded 690181
server stats for thread 2: 10527437 queries, 4638576 answers from
cache, 5888861 recursions
server stats for thread 2: requestlist max 2673 avg 2002.7 exceeded 818783
server stats for thread 2: 10845189 queries, 4760934 answers from
cache, 6084255 recursions
server stats for thread 2: requestlist max 2673 avg 2020.86 exceeded 946745
server stats for thread 2: 11084550 queries, 4838916 answers from
cache, 6245634 recursions
server stats for thread 2: requestlist max 2673 avg 2034.8 exceeded 1073714
server stats for thread 2: 11402798 queries, 4959856 answers from
cache, 6442942 recursions
server stats for thread 2: requestlist max 2673 avg 2051.26 exceeded 1202308
server stats for thread 2: 11722778 queries, 5082975 answers from
cache, 6639803 recursions
server stats for thread 2: requestlist max 2673 avg 2066.72 exceeded 1330516
server stats for thread 2: 11993865 queries, 5179640 answers from
cache, 6814225 recursions
server stats for thread 2: requestlist max 2673 avg 2079.3 exceeded 1457073
server stats for thread 2: 12286524 queries, 5286754 answers from
cache, 6999770 recursions
server stats for thread 2: requestlist max 2673 avg 2091.88 exceeded 1588787
server stats for thread 2: 12569883 queries, 5390261 answers from
cache, 7179622 recursions
server stats for thread 2: requestlist max 2673 avg 2103.54 exceeded 1717405
server stats for thread 2: 12833143 queries, 5480857 answers from
cache, 7352286 recursions
server stats for thread 2: requestlist max 2673 avg 2114.34 exceeded 1848543
server stats for thread 2: 13088804 queries, 5566827 answers from
cache, 7521977 recursions
server stats for thread 2: requestlist max 2673 avg 2124.56 exceeded 1979982
server stats for thread 2: 13339028 queries, 5649428 answers from
cache, 7689600 recursions
server stats for thread 2: requestlist max 2673 avg 2134.2 exceeded 2112551
server stats for thread 2: 13589261 queries, 5731389 answers from
cache, 7857872 recursions
server stats for thread 2: requestlist max 2673 avg 2143.23 exceeded 2246431
server stats for thread 2: 13831526 queries, 5809218 answers from
cache, 8022308 recursions
server stats for thread 2: requestlist max 2673 avg 2151.68 exceeded 2380025
server stats for thread 2: 14074045 queries, 5885578 answers from
cache, 8188467 recursions
server stats for thread 2: requestlist max 2673 avg 2159.78 exceeded 2517341
server stats for thread 2: 14310994 queries, 5958194 answers from
cache, 8352800 recursions
server stats for thread 2: requestlist max 2673 avg 2167.59 exceeded 2656417
server stats for thread 2: 14542656 queries, 6028031 answers from
cache, 8514625 recursions
server stats for thread 2: requestlist max 2673 avg 2174.91 exceeded 2795338
server stats for thread 2: 14768848 queries, 6093829 answers from
cache, 8675019 recursions
server stats for thread 2: requestlist max 2673 avg 2181.93 exceeded 2936464
server stats for thread 2: 14993464 queries, 6158626 answers from
cache, 8834838 recursions
server stats for thread 2: requestlist max 2673 avg 2188.48 exceeded 3079132
server stats for thread 2: 15216858 queries, 6222283 answers from
cache, 8994575 recursions
server stats for thread 2: requestlist max 2673 avg 2194.84 exceeded 3222538
server stats for thread 2: 15440568 queries, 6282954 answers from
cache, 9157614 recursions
server stats for thread 2: requestlist max 2673 avg 2201.12 exceeded 3372315
server stats for thread 2: 15663324 queries, 6344440 answers from
cache, 9318884 recursions
server stats for thread 2: requestlist max 2673 avg 2207.11 exceeded 3520766
server stats for thread 2: 15888324 queries, 6406005 answers from
cache, 9482319 recursions
server stats for thread 2: requestlist max 2673 avg 2212.97 exceeded 3671367
server stats for thread 2: 16110454 queries, 6465240 answers from
cache, 9645214 recursions
server stats for thread 2: requestlist max 2673 avg 2218.59 exceeded 3824213
server stats for thread 2: 16332045 queries, 6523675 answers from
cache, 9808370 recursions
server stats for thread 2: requestlist max 2673 avg 2224.04 exceeded 3978297
server stats for thread 2: 16552213 queries, 6581148 answers from
cache, 9971065 recursions
server stats for thread 2: requestlist max 2673 avg 2229.27 exceeded 4132981
server stats for thread 2: 16772511 queries, 6639153 answers from
cache, 10133358 recursions
server stats for thread 2: requestlist max 2673 avg 2234.27 exceeded 4287371
server stats for thread 2: 16992902 queries, 6698239 answers from
cache, 10294663 recursions
server stats for thread 2: requestlist max 2673 avg 2239.12 exceeded 4439600
server stats for thread 2: 17213660 queries, 6756507 answers from
cache, 10457153 recursions
server stats for thread 2: requestlist max 2673 avg 2243.8 exceeded 4594324
server stats for thread 2: 17432380 queries, 6812701 answers from
cache, 10619679 recursions
server stats for thread 2: requestlist max 2673 avg 2248.35 exceeded 4750633
server stats for thread 2: 17650889 queries, 6868541 answers from
cache, 10782348 recursions
server stats for thread 2: requestlist max 2673 avg 2252.86 exceeded 4907895

Thanks in advance for any suggestion you can give.

Hi Rod,

With unbound-control dump_requestlist you can get a view of what those
2000 queries are that it is working on (you have then to configure
unbound-control-setup, set remote-control: yes in config). Presumably
unbound is receiving a lot of queries that take very long to resolve.

Without the other elements in this mail I would suggest you could
increase the outgoing range and num-queries-per-thread to 4096.

The jostle-timeout is also very large, meaning that this is causing
unbound to be unable to remove those very long taking queries from its
requestlist when exceeding its limits. jostle-timeout: 100 or 200 could
help unbound remove the 'bad queries' earlier on.

Also, because you have harden-referral-path set, unbound is generating
internal work for those checks. This needs extra outgoing-range (but
not num-queries-per-thread), so try outgoing-range: 4096 and
num-queries-per-thread: 2048 . So that there is plenty of space for
the check queries, otherwise an internal queue could be forming to get
an outgoing-range(source port number), but because you hit the max
queries per thread and all those spawn extra checks, the limit is
reached with certainty, slowing down the server considerably, (and this
slowdown may even trigger a 'death spiral' as it causes even longer
waiting lists - hence perhaps the large 'exceeded' values).

Why thread 2 gets so many queries, it could be that the quad cpu is
really hyperthreaded? Or else, perhaps the OS wants to run the network
stack on some of the CPUs, causing not all unbound threads to get work
(as they are scheduled out in favor of the network stack), in which case
less unbound threads could remove this behaviour. Another reason could
be simply that the load on the machine is so light that the OS decides
thread 2 can handle it...

Another disconcerting fact is that the cache hit rate is 54% when its
still OK but drops to 26% when you get trouble. Take a look at that
dump-requestlist, and also perform a unbound-control stats_noreset >
snapshot and look at it: the number of unwanted replies in particular.
Certain forms of Kaminsky-eeks! would tend to blast lots of different
queries to a resolver...

Best regards,
   Wouter

Another disconcerting fact is that the cache hit rate is 54% when its
still OK but drops to 26% when you get trouble. Take a look at that
dump-requestlist, and also perform a unbound-control stats_noreset >
snapshot and look at it: the number of unwanted replies in particular.
Certain forms of Kaminsky-eeks! would tend to blast lots of different
queries to a resolver...

We are attempting to use Unbound as a DNS resolver/cache for a spider.
In a typical hour we are looking up several million domains but
typically only need to resolve it a handful of times per day.

We boosted outgoing-range to 8192, reduced jostle-timeout to 200, and
eliminated the min-ttl.

We also tried boosting the slabs counters to various values between 4 and 512.

None of these made any noticeable difference.

CPU usage becomes very low when exceeded begins to climb. Does unbound
sleep or spin while waiting for a lock?

Additional thoughts? We are going to attempt using much much smaller
cache values. I tend to believe it is the cache management from
dealing with several million different domains, but of course I have
no proof of that.