Requestlist filling ? automatic cleanup?

Thomas · March 20, 2011, 1:31pm

Hi,

I don't really understand how the request list flush old and often wrong queries from the request list. It is said in the man. that jostle-timeout is triggered when the server is very busy. What defines 'busy' ?

For instance we have often a lot of crappy queries towards groupinfra.com (see attachment for a dump_requeslist|grep groupinfra) and they seem to never go away and fill our request list by more than a half (e.g: 195/375). Could that impact unbound reactivity ?

Note: jostle-timeout is still set to the default (see my config below).

I am asking that because sometimes our unbounds have a random hiccup and I am wondering if it could be due to this or not. The 'hiccup' is very hard to debug because it's random (once a month or so) on servers doing something like 500 to 1500 qps each so increasing the verbosity from 1 to 2 is not really possible

Cheers,
Thomas

(attachments)

GRPINFRA (78 KB)

Wouter · March 20, 2011, 9:05pm

Hi Thomas,

hat version are you using? Recently the timeout code was changed to
cope with this sort of situation (1.4.7):
http://www.unbound.net/documentation/info_timeout.html

Hi,

I don't really understand how the request list flush old and often wrong
queries from the request list. It is said in the man. that
jostle-timeout is triggered when the server is very busy. What defines
'busy' ?

The requestlist is full.

Your requestlist is the default, so about 1000 and 300 does not fill it
up. I would recommend a recompile with libevent because of your
somewhat high load (then you can increase the requestlist and range to
several thousand, and in recent versions the default increases by
itself, http://www.unbound.net/documentation/howto_optimise.html )

For instance we have often a lot of crappy queries towards
groupinfra.com (see attachment for a dump_requeslist|grep groupinfra)
and they seem to never go away and fill our request list by more than a
half (e.g: 195/375). Could that impact unbound reactivity ?

No, other queries that priority over these older queries.

The requestlist is divided into two halves: run-to-completion, and
fast-stuff. The run-to-completion is that. The fast stuff deletes
older queries to make room for new queries (but not unless the
jostle-timeout has expired, otherwise you could deleted everything that
comes in immediately under a DoS).

Note: jostle-timeout is still set to the default (see my config below).

Yes that should be OK. If you lower it, it will be more likely to drop
the groupinfra stuff.

I am asking that because sometimes our unbounds have a random hiccup and
I am wondering if it could be due to this or not. The 'hiccup' is very
hard to debug because it's random (once a month or so) on servers doing
something like 500 to 1500 qps each so increasing the verbosity from 1
to 2 is not really possible

What seems to happen is groupinfra has a lot of servers. And they
sometimes experience outages. When they experience an outage, unbound
gets timeouts and tries to fetch the names, but also the other
nameserver names (and there are a lot of them). Given user demand for
groupinfra, unbound starts to explore all the nameservers for
groupinfra, with timeouts and thus the entries fill up your requestlist.
The dependency structure is like that log excerpt that you show.
Because the thing has timeouts those entries are necessarily pretty old,
and thus (the ones in the fast-stuff list) would be dropped to make room
for new queries (if there was a lack of space, but there is no lack of
space, so these queries are performed: there is interest and there is
capacity to undertake actions to find the answers).

Best regards,
Wouter

Thomas · March 20, 2011, 9:53pm

Hi Wouter,

Excellent explanations and fast reply as usual, many thanks.

(W)hat version are you using? Recently the timeout code was changed to
cope with this sort of situation (1.4.7):
http://www.unbound.net/documentation/info_timeout.html

Oops sorry. I forgot to tell but I am using the latest : 1.4.8.

It's running on Centos 5.5 (old 2.6.18 kernel sadly). We built our own packages. And it should have libevent. I created a thread a while ago about where I wanted an explicit way to be sure that we have and are using libevent. And you told me that I was using it IIRC

unbound-libs-1.4.8-2.el5
unbound-1.4.8-2.el5
ldns-1.6.8-1.el5
libevent-1.4.13-1

unbound-control status
version: 1.4.8
verbosity: 1
threads: 1
modules: 2 [ validator iterator ]
uptime: 2773636 seconds
unbound (pid 3952) is running...

Version 1.4.8
linked libs: libevent 1.4.13-stable (it uses epoll), ldns 1.6.8, OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008
linked modules: validator iterator
configured for i386-redhat-linux-gnu on Wed Feb 16 10:26:27 EST 2011 with options: '--build=i386-koji-linux-gnu' '--host=i386-koji-linux-gnu' '--target=i386-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--localstatedir=/var' '--sharedstatedir=/usr/com' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-ldns=' '--with-libevent' '--with-pthreads' '--with-ssl' '--disable-rpath' '--enable-debug' '--disable-static' '--with-conf-file=/etc/unbound/unbound.conf' '--with-pidfile=/var/run/unbound/unbound.pid' '--disable-gost' '--enable-sha2'
BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs@nlnetlabs.nl

[..] jostle-timeout is triggered when the server is very busy. What defines
'busy' ?

The requestlist is full.

Ok. I think this should be clarified in the documentation, I can send you a patch if you want to save your time.

Your requestlist is the default, so about 1000 and 300 does not fill it
up. I would recommend a recompile with libevent because of your
somewhat high load (then you can increase the requestlist and range to
several thousand, and in recent versions the default increases by
itself, http://www.unbound.net/documentation/howto_optimise.html )

I read this document many times since I am using unbound (and I will read it again;). But what parameter defines the requestlist size or actually influence on it.

[..]. Could that impact unbound reactivity ?

No, other queries that priority over these older queries.

Ok.

The requestlist is divided into two halves: run-to-completion, and
fast-stuff. The run-to-completion is that. The fast stuff deletes
older queries to make room for new queries (but not unless the
jostle-timeout has expired, otherwise you could deleted everything that
comes in immediately under a DoS).

Thanks for the explanation. Is this written somewhere as well in the docos ?

Note: jostle-timeout is still set to the default (see my config below).

Yes that should be OK. If you lower it, it will be more likely to drop
the groupinfra stuff.

Ok. I may have some questions about that but I will read the doco first about jostle-timeout.

I am asking that because sometimes our unbounds have a random hiccup and
I am wondering if it could be due to this or not. The 'hiccup' is very
hard to debug because it's random (once a month or so) on servers doing
something like 500 to 1500 qps each so increasing the verbosity from 1
to 2 is not really possible

Ok, so I think I will have to do a script to increase verbosity when it seems that unbound can't resolve anymore and hopefully I will be able to catch this nasty issue (could be network related).

What seems to happen is groupinfra has a lot of servers. And they
sometimes experience outages. When they experience an outage, unbound
gets timeouts and tries to fetch the names, but also the other
nameserver names (and there are a lot of them). Given user demand for
groupinfra, unbound starts to explore all the nameservers for
groupinfra, with timeouts and thus the entries fill up your requestlist.
The dependency structure is like that log excerpt that you show.
Because the thing has timeouts those entries are necessarily pretty old,
and thus (the ones in the fast-stuff list) would be dropped to make room
for new queries (if there was a lack of space, but there is no lack of
space, so these queries are performed: there is interest and there is
capacity to undertake actions to find the answers).

Yep ok, I understand but still it is weird to see unbound trying to resolve something for almost forever. For instance 143000 secs aka 39 hours But we have resources so maybe one day it will work (I reckon this domain just never works ;).

252 AAAA IN uk-dc007.groupinfra.com. 142994.571268 iterator wants AAAA IN au-dc012.groupinfra.com. AAAA IN br-dc003.groupinfra.com. AAAA IN de-dc008.groupinfra.com. AAAA IN my-dc003.groupinfra.com. AAAA IN nl-dc006.groupinfra.com. AAAA IN ph-dc001.groupinfra.com.

-Thomas

Wouter · March 20, 2011, 10:16pm

Hi Thomas,

Hi Wouter,

Excellent explanations and fast reply as usual, many thanks.

(W)hat version are you using? Recently the timeout code was changed to
cope with this sort of situation (1.4.7):
http://www.unbound.net/documentation/info_timeout.html

Oops sorry. I forgot to tell but I am using the latest : 1.4.8.

Good to know. It has newer time code that should minimise the impact of
the event you report.

It's running on Centos 5.5 (old 2.6.18 kernel sadly). We built our own
packages. And it should have libevent. I created a thread a while ago
about where I wanted an explicit way to be sure that we have and are
using libevent. And you told me that I was using it IIRC

uptime: 2773636 seconds

One month uptime, nice

Ok you are using libevent, so you can increase your requestlist size
(significantly) above 300.

[..] jostle-timeout is triggered when the server is very busy. What
defines
'busy' ?

The requestlist is full.

Ok. I think this should be clarified in the documentation, I can send
you a patch if you want to save your time.

Yes please, I do not know where to make the edit.

Your requestlist is the default, so about 1000 and 300 does not fill it
up. I would recommend a recompile with libevent because of your
somewhat high load (then you can increase the requestlist and range to
several thousand, and in recent versions the default increases by
itself, http://www.unbound.net/documentation/howto_optimise.html )

I read this document many times since I am using unbound (and I will
read it again;). But what parameter defines the requestlist size or
actually influence on it.

num-queries-per-thread: the requestlist size
outgoing-range: number of sockets

So, with libevent, you can set the num-queries-per-thread to, say, 4096.
And you can set the socketcount, outgoing-range, to, say, 8192.

It would allow unbound to open 8192 network connections (ports).

It would allow unbound to service 4096 queries that are not answered
directly from cache, of which 2048 are run-to-completion and 2048 are
the fast-stuff list (that are removed to make space for newer queries).

[..]. Could that impact unbound reactivity ?

No, other queries that priority over these older queries.

Ok.

You can see in the statistics (if you use the munin plugin, or cacti, or
another monitoring system), if unbound jostled or dropped packets.

The requestlist is divided into two halves: run-to-completion, and
fast-stuff. The run-to-completion is that. The fast stuff deletes
older queries to make room for new queries (but not unless the
jostle-timeout has expired, otherwise you could deleted everything that
comes in immediately under a DoS).

Thanks for the explanation. Is this written somewhere as well in the
docos ?

The man pages explains the settings under the jostle heading, but that
is a config help, not an operations manual.

Note: jostle-timeout is still set to the default (see my config below).

Yes that should be OK. If you lower it, it will be more likely to drop
the groupinfra stuff.

Ok. I may have some questions about that but I will read the doco first
about jostle-timeout.

The idea is that older queries are removed to make space for newer
queries. But under a DoS we must not remove everything all the time, we
want it to be able to do one 'quantum' of work. That is, make a UDP
request upstream and get an answer. This is the jostle-timeout: about
an average UDP ping. For queries that need multiple UDP packets,
intermediate results would get stored in the cache, so this way the
queries make progress towards answering clients.

If you are usually on low-ping, you could lower the value.
If you are usually on a higher ping (i.e. Australia), a higher value
might more accurately represent the 'one useful action'.

The default is 200 msec so will often make (somewhat local) queries work
under a DoS.

This is all explanations for the audience, in this case the groupinfra
stuff is hours old, and thus would get dropped right away if there were
to be a resource problem.

I am asking that because sometimes our unbounds have a random hiccup and
I am wondering if it could be due to this or not. The 'hiccup' is very
hard to debug because it's random (once a month or so) on servers doing
something like 500 to 1500 qps each so increasing the verbosity from 1
to 2 is not really possible

Ok, so I think I will have to do a script to increase verbosity when it
seems that unbound can't resolve anymore and hopefully I will be able to
catch this nasty issue (could be network related).

What would be more interesting is a unbound-control lookup
groupinfra.com This lists the nameservers, the TTLs and so on.

What seems to happen is groupinfra has a lot of servers. And they
sometimes experience outages. When they experience an outage, unbound
gets timeouts and tries to fetch the names, but also the other
nameserver names (and there are a lot of them). Given user demand for
groupinfra, unbound starts to explore all the nameservers for
groupinfra, with timeouts and thus the entries fill up your requestlist.
The dependency structure is like that log excerpt that you show.
Because the thing has timeouts those entries are necessarily pretty old,
and thus (the ones in the fast-stuff list) would be dropped to make room
for new queries (if there was a lack of space, but there is no lack of
space, so these queries are performed: there is interest and there is
capacity to undertake actions to find the answers).

Yep ok, I understand but still it is weird to see unbound trying to
resolve something for almost forever. For instance 143000 secs aka 39
hours But we have resources so maybe one day it will work (I reckon
this domain just never works ;).

Yes it is exactly that. There are resources, unbound thinks that
(potentially) an answer might come back, and thus 'probes' these
servers. You could blacklist the 10.0.0.0/8 in the
do-not-query-address: config, (if you do not use those yourself), as
groupinfra.com advertises that it uses a lot of servers from there, and
they'll never respond.

252 AAAA IN uk-dc007.groupinfra.com. 142994.571268 iterator wants AAAA
IN au-dc012.groupinfra.com. AAAA IN br-dc003.groupinfra.com. AAAA IN
de-dc008.groupinfra.com. AAAA IN my-dc003.groupinfra.com. AAAA IN
nl-dc006.groupinfra.com. AAAA IN ph-dc001.groupinfra.com.

Yes that is indeed a very long time.

Best regards,
Wouter