; <<>> DiG 9.4.2-P2 <<>> @localhostxjdjtallrd.groupinfra.com
; (1 server found)
;; global options: printcmd
;; connection timed out; no servers could be reached
groupinfra.com's servers, ns1.logica.com and ns2.logica.com are both
'recursion-lame'. They are configured as a cache (and offer recursion
but not the AA flag on answers). Unbound tries to avoid them, but there
are no alternatives (no AAAA records or anything). Then, unbound tries
a +RD query there (as if it were forwarding) and receives an answer (TTL
51 seconds, yes they really are recursors with TTLs).
Since there is not really authoritative servers for groupinfra.com, it
could that their 'semi-caches' cannot find the information all the time,
or have trouble as well. zonecheck says 'it has no nameservers'.
Try to use unbound-control lookup groupinfra.com to get more information.
I see that groupinfra.com says it has different nameservers, its NS
record has 75 entries. This explains the very long times where queries
exist for unbound; as it is trying every server and gets timeouts. I
notice a lot of these entries seem to be on a subnet of some sort
(10.0.0.0/8 and others maybe too), and perhaps firewalled.
Since it claims to have nameservers that do not answer, it is not going
to get very good service. They official nameservers registered with
.com are not authoritative.
And feedback on having 200 subdomains: this is not a resource problem,
older queries are removed ('jostled' in statistics counter) in favor of
new queries if there is a resource problem, your nameserver has capacity
to handle this query for that extremely long time without having to
remove this query for newer ones, so that is not an issue.
The issue that after flush and restart it works again is because the
first couple queries use the nameservers as advertised from .com
referral, but these are soon replaces by 75 non-working nameservers, all
of whom unbound prefers (since the child is authoritative for its
domain!). Working through 75 timeouts before it tries the
parent-also-not-working servers takes a very long time, and your 'dig'
has timeouted by that time. It will cache this, but the default of 15
minutes (infra-ttl) is probably too low to be able to help.
This brings me to my next issue I have due to this groupinfra behaviour.
That is that my resolver begins to show "requestlist exceeded" counters up to 3K per sec.
After my requestlist hits about 250.... My assumption is that it probably only sets 512 slots for the requestlist at startup, while I configured the value 20480 for num-queries-per-thread.
But it seems somehow that this config entry is ignored..
Is there somehow to check in unbound how many slots are actually allocated after startup?
I compiled with libevent so it should at least have 1024 num-queries-perthread.
It could be that openbsd has a restrictive ulimit on the number of open
files, and that unbound throttles back its usage to fit in that ulimit
(of 256?). ulimit -n. You can override it as root. Unbound prints a
warning at startup.