Increase of requestlist entries/connection timeout due to groupinfra.com domain

Hi,

I’ve been scratching my head for a few days now, trying to figure out what is happening here.

  1. I noticed that the requestlist dump contains about 200 subdomains for groupinfra.com, some of them are there for up to 85000 seconds.

  2. 1 entry in the requestlist is:
    215 A IN xjdjtallrd.groupinfra.com. 25205.720826 iterator wants A IN de-dc002.groupinfra.com. A IN in-dc007.groupinfra.com. A IN uk-dc015.groupinfra.com.

Resolving this domain with dig returns:

dig @localhost xjdjtallrd.groupinfra.com

; <<>> DiG 9.4.2-P2 <<>> @localhost xjdjtallrd.groupinfra.com
; (1 server found)
;; global options: printcmd
;; connection timed out; no servers could be reached

Hi Michael,

groupinfra.com's servers, ns1.logica.com and ns2.logica.com are both
'recursion-lame'. They are configured as a cache (and offer recursion
but not the AA flag on answers). Unbound tries to avoid them, but there
are no alternatives (no AAAA records or anything). Then, unbound tries
a +RD query there (as if it were forwarding) and receives an answer (TTL
51 seconds, yes they really are recursors with TTLs).

Since there is not really authoritative servers for groupinfra.com, it
could that their 'semi-caches' cannot find the information all the time,
or have trouble as well. zonecheck says 'it has no nameservers'.

Try to use unbound-control lookup groupinfra.com to get more information.

I see that groupinfra.com says it has different nameservers, its NS
record has 75 entries. This explains the very long times where queries
exist for unbound; as it is trying every server and gets timeouts. I
notice a lot of these entries seem to be on a subnet of some sort
(10.0.0.0/8 and others maybe too), and perhaps firewalled.

Since it claims to have nameservers that do not answer, it is not going
to get very good service. They official nameservers registered with
.com are not authoritative.

Best regards,
   Wouter

And feedback on having 200 subdomains: this is not a resource problem,
older queries are removed ('jostled' in statistics counter) in favor of
new queries if there is a resource problem, your nameserver has capacity
to handle this query for that extremely long time without having to
remove this query for newer ones, so that is not an issue.

The issue that after flush and restart it works again is because the
first couple queries use the nameservers as advertised from .com
referral, but these are soon replaces by 75 non-working nameservers, all
of whom unbound prefers (since the child is authoritative for its
domain!). Working through 75 timeouts before it tries the
parent-also-not-working servers takes a very long time, and your 'dig'
has timeouted by that time. It will cache this, but the default of 15
minutes (infra-ttl) is probably too low to be able to help.

Best regards,
   Wouter

Hi Wouter,

Thanks for your swift and thorough answer!

This brings me to my next issue I have due to this groupinfra behaviour.

That is that my resolver begins to show "requestlist exceeded" counters up to 3K per sec.
After my requestlist hits about 250.... My assumption is that it probably only sets 512 slots for the requestlist at startup, while I configured the value 20480 for num-queries-per-thread.

But it seems somehow that this config entry is ignored..
Is there somehow to check in unbound how many slots are actually allocated after startup?

I compiled with libevent so it should at least have 1024 num-queries-perthread.

Thanks,
mike

Hi Michael,

You need to configure
outgoing-range: 20480 too, so that it has sockets to service those 20480
requests in the requestlist.

libevent is good. You can get_option in unbound-control.

I'll point to http://unbound.net/documentation/howto_optimise.html for
the audience.

It could be that openbsd has a restrictive ulimit on the number of open
files, and that unbound throttles back its usage to fit in that ulimit
(of 256?). ulimit -n. You can override it as root. Unbound prints a
warning at startup.

Best regards,
   Wouter