Unbound stops answering after ADSL-line bounce

Jan-Piet_Mens1 · January 23, 2012, 7:29am

Hello,

I'm running unbound-1.4.14 behind an ADSL line which is cycled once in a
while. This can take anything from several seconds to four/five minutes
(router reboot) before the Internet is visible again. At this point,
I've been experiencing Unbound replies with a SERVFAIL to all queries,
as though it marks the DNS servers as being down and stops sending
requests to them.

Is this a known issue?

(Somebody else has also experienced this problem with a vanilla Unbound
on what I call DAP [1], and he's reported this to me privately with a
pcap and verbosiyt=5 logfile I can submit privately if required.)

Regards,

-JP

[1]: http://jpmens.net/pages/dns/dnssec-appliance/

Wouter · January 23, 2012, 8:21am

Hi Jan-Piet,

Yes it marks servers as down and stops sending queries. There was a
rewrite for better handling of this case, if you run an older version
an update may be useful. However, all versions step down packets to
hosts that are down.

This has a timeout of x minutes (1-15). After that your service
should re-enable again. If the downtime was under two minutes, you
have to wait that amount of time for service to resume (exponential
backoff used here). If the downtime was longer, you may have to wait
15 minutes (infra-ttl config option) before service resumes to hosts
that were probed to be 'down'.

If you want service to instantly go down and up with the line
downtime, and you can notice the line-bounce via some other method,
then you could use e.g. unbound-control flush_infra to resume traffic.

Best regards,
Wouter

Jan-Piet_Mens1 · January 23, 2012, 12:04pm

Hello Wouter,

This has a timeout of x minutes (1-15). After that your service
should re-enable again. If the downtime was under two minutes, you
have to wait that amount of time for service to resume (exponential
backoff used here). If the downtime was longer, you may have to wait
15 minutes (infra-ttl config option) before service resumes to hosts
that were probed to be 'down'.

If I understand you (and the man page) correctly, setting
`infra-host-ttl' to something like 10 seconds would mean that at most 10
seconds would elapse before Unbound starts querying dropped servers.
Maybe that would be a tolerable setting for a relatively low-volume SOHO
environment.

If you want service to instantly go down and up with the line
downtime, and you can notice the line-bounce via some other method,
then you could use e.g. unbound-control flush_infra to resume traffic.

I'll certainly try that next time, although this will be difficult to
automate without continuously monitoring the line status.

Regards,

-JP

Paul_Taylor · January 23, 2012, 6:40pm

Wouter,

Hi – I’m the DAP user that JP mentioned.

As a side note, I’m extremely impressed with the performance of Unbound. We are looking at using Unbound at my job and have been doing a bit of testing. Using ResPerf to stress test with a cleared cache resulted in a peak of about 23,500 queries per second with Unbound doing DNSSEC. This was on a Dell 2850 server with two dual core Xeon’s running at 2.8 Ghz under Ubuntu 12.04 alpha. We also tested Unbound with DNSSEC disabled and got over 35,000 queries per second. A 3^rd party Windows DNS server (not performing DNSSEC validation) peaked at around 1250 queries per second under Windows 2003 on similar hardware.

Back to my home issue, though. The first time I experienced this issue, my internet connection had gone down for about an hour around 2 AM. It was about 7AM before I noticed the problem (sleep has to happen sometime). I restarting Unbound, and it recovered.

The 2^nd time this happened, I had about 3 bounces in about 10 minutes during the afternoon. I believe each bounce took a minute or so to recover I was at work at the time and my wife and kids couldn’t get anywhere on the Internet. I got home a few hours later and DNS resolution was not working until I restarted Unbound.

So, in these two cases I’ve had outages of various lengths, but hours have passed without DNS resolution working.

Since most people using Unbound are probably using it for the DNSSEC capability, perhaps my configuration has to do with the issue I’m having recovering? In my environment, Unbound isn’t configured to go direct, but rather forward to various DNS servers. I have about 10-12 domains (mostly CDNs) that I’m forwarding to my ISP’s DNS servers so I get DNS replies directing me to close servers. Theoretically, this should help me have a better experience with Netflix at home. After the forwarder definitions for all the CDNs, I have a forwarder defined for “.” to send everything else to OpenDNS. This is to help keep my family from getting to websites I don’t want little eyes to run across.

Is it possible that with this type of config that it might cause Unbound to recover differently?

Thanks,

Paul

Wouter · January 24, 2012, 4:06pm

Hi Paul,

Nice that the performance looks good

If you are running unbound under windows, there are some things to be
aware of. On windows, unbound has reduced capacity because it cannot
open a lot (thousands) of file descriptors. Windows simply lacks an
API that makes this possible, unless you spawn thousands of threads or
something similar. So, if the performance you see is based on
recursion, then using Linux (or FreeBSD) on a similar box should have
more capacity (you can configure unbound to have extra capacity). If
the performance you see is based on cache-responses, then the move to
Linux makes less of a difference. With capacity I mean recursing a
lot of user queries at the same time, with thousands of sockets open.

The capacity on windows today is more relevant to small workgroups or
desktop environments. With some code changes it could be improved,
e.g. with polling behaviour the number of sockets can be increased to
very large numbers. Today unbound sleeps the process nicely when not
busy in WSAWaitForMultipleEvents.

The easiest way today to get more capacity on windows, by the way, is
to increase the number of workers (num-thread) to 4 (or so).

You note unbound was down for a lengthy time, can you upgrade or if
this was a recent version get me more details? It should really fully
recover after 15 minutes from anything, I believe.

Best regards,
Wouter

Paul_Taylor · January 25, 2012, 1:00pm

Wouter,

Unfortunately, I’ve not had a chance to do further testing. I should be able to test this weekend. I plan to take my connection down for 10 minutes, then bring it back up and wait 20 minutes to see if Unbound will recover. While doing this, should I have a verbosity level 5 log file going? Would a pcap file at my router (filtered on the IP of the Unbound box) be helpful?

Thanks,

Paul

Wouter · January 26, 2012, 10:45am

Hi Paul,

Wouter,

Unfortunately, I’ve not had a chance to do further testing. I
should be able to test this weekend. I plan to take my connection
down for 10 minutes, then bring it back up and wait 20 minutes to
see if Unbound will recover. While doing this, should I have a
verbosity level 5 log file going? Would a pcap file at my router
(filtered on the IP of the Unbound box) be helpful?

Packet info would be included in verbosity 5 already, the pcap dump
may be useful as a different format, but not really needed.

What would be nice and interesting is a look at unbound-control
dump_infra . This prints a textoutput list of the probe-status of
IPs. Maybe do it at start (>file1 to store it), when failed, and if
it stays down, afterwards.

Best regards,
Wouter

Paul_Taylor · January 26, 2012, 3:15pm

> Packet info would be included in verbosity 5 already, the pcap dump

may be useful as a different format, but not really needed.

What would be nice and interesting is a look at unbound-control

dump_infra . This prints a textoutput list of the probe-status of

IPs. Maybe do it at start (>file1 to store it), when failed, and if

it stays down, afterwards.

Wouter,

I had an unexpected chance to test last night… When I tested before, I didn’t know that it could take up to 15 minutes to recover, so I assumed that I had successfully recreated the problem after bringing the WAN interface back up and finding that Unbound wasn’t sending the requests out within the next few minutes. Last night after waiting 15-20 minutes after bringing the WAN back up, everything recovered as you had explained that it should. I did this twice, and it recovered fine both times, so the problem I experienced before is not as easy to re-create as I previously thought. I’ve not seen an actual occurrence of this problem since last week. It happened twice within about a week previously.

So, I’ll try to turn on a verbosity level 5 log and then run the dump-infra commands at startup, and then when it gets to a failed state (as in, failed for more than 15 minutes after a WAN link recovery). It looks like I’ll have to wait until there’s an actual natural occurrence of the problem again.

Thanks,

Paul

user20 · January 27, 2012, 10:39am

Zitat von Paul Taylor <PaulTaylor@winn-dixie.com>:

Since most people using Unbound are probably using it for the DNSSEC
capability, perhaps my configuration has to do with the issue I'm having
recovering? In my environment, Unbound isn't configured to go direct,
but rather forward to various DNS servers. I have about 10-12 domains
(mostly CDNs) that I'm forwarding to my ISP's DNS servers so I get DNS
replies directing me to close servers. Theoretically, this should help
me have a better experience with Netflix at home. After the forwarder
definitions for all the CDNs, I have a forwarder defined for "." to send
everything else to OpenDNS. This is to help keep my family from getting
to websites I don't want little eyes to run across.

Is it possible that with this type of config that it might cause Unbound
to recover differently?

This reminds me of the issues we have when using Unbound with DNSSEC validation *and* using a forwarder. For some time it was Unbound using Bind 9.7.4 as parent but it also happend with a second Unbound instance as parent that Unbound stop resolving any names, because of some obscure validation failure. We have "solved" the problem by setting the internal Unbound to not validate and let the forwarder do the DNSSEC work.

Regards

Andreas

Jan-Piet_Mens1 · January 27, 2012, 10:54am

We have "solved" the problem by
setting the internal Unbound to not validate and let the forwarder
do the DNSSEC work.

That would be a neat feature for DNSSEC-Trigger: detect that the
upstream forwarder is Unbound (version.bind chaos txt) and disable the
validator. Well, maybe not.

-JP

user20 · January 27, 2012, 12:57pm

Zitat von Jan-Piet Mens <jpmens.dns@gmail.com>:

We have "solved" the problem by
setting the internal Unbound to not validate and let the forwarder
do the DNSSEC work.

That would be a neat feature for DNSSEC-Trigger: detect that the
upstream forwarder is Unbound (version.bind chaos txt) and disable the
validator. Well, maybe not.

In our case it doesn't matter because both resolvers are managed by us, but for sure this should not be done automatically. Basically it looks like there are "rough-edges" when cascaded resolvers all try to do DNSSEC validation.

Regards

Andreas

Wouter · February 10, 2012, 10:05am

Hi Andreas,

Zitat von Jan-Piet Mens <jpmens.dns@gmail.com>:

We have "solved" the problem by
setting the internal Unbound to not validate and let the forwarder
do the DNSSEC work.

That would be a neat feature for DNSSEC-Trigger: detect that the
upstream forwarder is Unbound (version.bind chaos txt) and disable the
validator. Well, maybe not.

In our case it doesn't matter because both resolvers are managed by us,
but for sure this should not be done automatically. Basically it looks
like there are "rough-edges" when cascaded resolvers all try to do
DNSSEC validation.

This was with unbound at an older version? In 1.4.11 there has been a
fix that should help cascading validators. The issue is that the
downstream validator sends CD=1 queries to the upstream. Now, suppose
an authority server is outdated but another is not. Then the downstream
validator cannot perform failover to the other authority server, because
it has to talk to the upstream validator. The upstream validator cannot
perform failover to the other authority server because with CD=1 it is
not validating the query. The fix in 1.4.11 is to make the upstream
validator perform failover to the other authority server for CD=1
queries as well.

Best regards,
Wouter

Paul_Taylor · February 10, 2012, 1:28pm

On the original topic of this thread, I have another incident to report.
After experiencing some strangeness with my NAS (where unbound was
running previously), I moved Unbound to an installation of pfSense
running on an old net4801. I believe pfSense is still on version 1.4.14
of Unbound. I configured it pretty much identically to my NAS
installation of Unbound. By that, I mean that I have numerous
forwarders added for various CDNs, with a "." forwarder pointing to
OpenDNS. DNSSEC validation is disabled. About two weeks had passed
with no further problems, until this morning.

Just before I was about to leave home for work (just after 7 AM), my
daughter told me that the internet was down. I checked my router and
saw that the internet connection went down last night for a little over
an hour.. It recovered about 3:15 AM. So, it had been up and
operational for almost 4 hours by the time I started looking at the
issue. A quick nslookup showed SERVFAIL replies. Since I had to leave
for work, I didn't have time to do much in the way of troubleshooting.
I recycled the service via pfSense's Services page (I think it just
kills and restarts the service), and DNS was resolving properly again.

Unfortunately, since it's on an embedded box, I didn't have logging
enabled, and I don't know what commands, if any, I could run that let
you see the "state" Unbound is stuck in when this happens.

Wouter · February 10, 2012, 2:34pm

Hi Paul,

On the original topic of this thread, I have another incident to report.
After experiencing some strangeness with my NAS (where unbound was
running previously), I moved Unbound to an installation of pfSense
running on an old net4801. I believe pfSense is still on version 1.4.14
of Unbound. I configured it pretty much identically to my NAS
installation of Unbound. By that, I mean that I have numerous
forwarders added for various CDNs, with a "." forwarder pointing to
OpenDNS. DNSSEC validation is disabled. About two weeks had passed
with no further problems, until this morning.

Just before I was about to leave home for work (just after 7 AM), my
daughter told me that the internet was down. I checked my router and
saw that the internet connection went down last night for a little over
an hour.. It recovered about 3:15 AM. So, it had been up and
operational for almost 4 hours by the time I started looking at the
issue. A quick nslookup showed SERVFAIL replies. Since I had to leave
for work, I didn't have time to do much in the way of troubleshooting.
I recycled the service via pfSense's Services page (I think it just
kills and restarts the service), and DNS was resolving properly again.

It should not be down for that long; 15 minutes really.

Unfortunately, since it's on an embedded box, I didn't have logging
enabled, and I don't know what commands, if any, I could run that let
you see the "state" Unbound is stuck in when this happens.

unbound-control verbosity 4 ; then nslookup and capture the logs (which
are then plentiful).

unbound-control dump_infra > tofile.txt
that shows the state of the infrastructure cache.

Best regards,
Wouter

Paul_Taylor · February 10, 2012, 3:50pm

Thank you - I'll file these commands away for future reference.

Previously, I tried recreating the problem a few times, but after
waiting 15 minutes (per your previous advice) after the WAN recovery,
DNS has worked. I've not tried leaving my internet connection down for
much more than about 10 minutes, though.

Mark_Picone · August 21, 2012, 6:09am

Hi All,

Continuing from: http://comments.gmane.org/gmane.network.dns.unbound.user/1976

Does anyone know if cause/fix for this behaviour was ever discovered?

I considered playing with the 'infra-host-ttl' setting but we have seen outages of 10 hours+ (with unbound returning NXDOMAIN for all forwarded queries until it is restarted) so I don't think lowering from the default of 900 seconds would help in our case.

I am fairly confident the problem is unbound related as we have BIND resolvers running side-by-side and they have not skipped a beat, over many years.

For now we'll be moving back to BIND (9.8.0b1+, which has the static-stub zone type we need) but I would like to try unbound again at some point in the future.

Nagios check graphs which show the resolvability of 'google.com':
http://www.deakin.edu.au/~markp/unbound/unbound_resolver1.png
http://www.deakin.edu.au/~markp/unbound/unbound_resolver2.png

user@host:~ %cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.2 (Santiago)

user@host:~ %uname -a
Linux host 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

user@host:~ %rpm -qa | grep unbound
unbound-1.4.14-1.el6.x86_64
unbound-libs-1.4.14-1.el6.x86_64

Regards,

Mark Picone
Unix Administrator

Deakin University
Geelong Waterfront Campus
1 Gheringhap Street, Geelong, VIC 3220
Phone: +61 3 52278602
Deakin University CRICOS Provider Code 00113B

Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.

Deakin University does not warrant that this email and any attachments are error or virus free.

Wouter · August 21, 2012, 6:58am

Hi Mark,

In 1.4.17 there is a fix called:
Fix timeouts to keep track of query type, A, AAAA and other, if
another has caused timeout blacklist, different type can still probe.

It could likely prevent this issue. You should not (with updated
unbound version) have this issue. It should probe the nameserver.
You are forwarding, unbound basically talks to one upstream machine?

Best regards,
Wouter