Looks like there is a problem with forward-zone: mechanism. If I'l setup
unbound for request forwarding to my ISP's DNS cache server, and during this
time of operations my Internet connection fails for a couple of minutes (3-7
min average), then unbound freazes in strange condition and do not makes any
queing at all until hard restarting (restarting using unbound-control do not
helps - only via rc.d script). In the same time, unbound continues to answer
for names what remained in it's cache, but if I do nslookup for something
what is not cached, then it says SERVFAIL in the same moment - SERVFAIL
without any timeout for queuing. And bad news is that unbound stays in
this "freaze condition" after Internet connection has been reistablished...
Internet connection do not fails physically (ethernet no-carrier) but only
logicaly (no respons from GW or somthing like this).
How to repeate:
1) start unbound in ' forward-zone name: "." ' mode
2) prevent it's communication with forward-addr: DNS server
3) wait for ~5min and make during this time a lot of resolving queues
4) connect internet back - unbound will stays in "freaze"
My system is FreeBSD 7.1-PRERELEASE, unbound is compilled from ports with
threads and are linked with libevent-1.4.8.
What is happening is that the server has blacklisted the forwarder IP
address. Because it does not answer any queries (it has to be
unreachable for about 2 minutes or more for that to happen).
This blacklist has a TTL of 15 minutes, by default.
You can set it in the config file.
infra-host-ttl: 900 # default 900 seconds
You could set it to infra-host-ttl: 60
It would then come back up within a minute after the connection is
reestablished.
This config parameter also sets how long roundtrip times and
EDNS-support is cached. This cache is not cleared when you do a reload
command.
So, although this all exactly explains what is happening to you. And
there is a config setting to workaround the problem. I do not know how
I can help to fix it.
What is happening is that the server has blacklisted the forwarder IP
address. Because it does not answer any queries (it has to be
unreachable for about 2 minutes or more for that to happen).
Turning a 2 minute outage into a 17 minute outage by default is awful
behavior. Dmitriy is being hit particularly hard here because he's only
talking to one forwarder, but I assume this will happen just as easily with
the root, .com, etc if my internet connectivity goes down for 2 minutes but
my users are still actively trying to get somewhere new.
Blacklisting a subset of nameservers for a zone for a while is sane, as long
as you have someone left to talk to. But as soon as all possible IPs to
send a query to are marked unresponsive, you can't just decide to not do any
lookups for the zone for an extended period. Is it unreasonable to ask for
either a much shorter blacklist TTL in the all-IPs-unavailable case or do to
some form of low-volume probing (e.g. allow one query through per minute, as
a test)?
It would then come back up within a minute after the connection is
reestablished.
This should be the default.
This config parameter also sets how long roundtrip times and
EDNS-support is cached.
It is unfortunate that they are the same parameter. I'd like to increase
the caching of roundtrip times and EDNS-support, and decrease the blacklist
TTL.
Thanks for you answer! After I been changed infra-host-ttl for 60 sec, I got
unbound back after "freaze" (correctly - looks like I just do not have much
patience. With infra-host-ttl:900 I just can't wait him to came back
Now I will remember about this feature. Looks like 60 or 120 sec will be good
enough for me.
Turning a 2 minute outage into a 17 minute outage by default is awful
behavior. Dmitriy is being hit particularly hard here because he's only
talking to one forwarder, but I assume this will happen just as easily with
the root, .com, etc if my internet connectivity goes down for 2 minutes but
my users are still actively trying to get somewhere new.
Blacklisting a subset of nameservers for a zone for a while is sane, as
long
as you have someone left to talk to. But as soon as all possible IPs to
send a query to are marked unresponsive, you can't just decide to not do
any
lookups for the zone for an extended period. Is it unreasonable to ask for
either a much shorter blacklist TTL in the all-IPs-unavailable case or
do to
some form of low-volume probing (e.g. allow one query through per
minute, as
a test)?
some form of low-volume probing (e.g. allow one query through per
minute, as
a test)?
That sounds reasonable, I'll see what I can do.
Implemented in svn trunk that if all the servers are blacklisted, then
99% of queries are stopped. So, very busy domains get polled more
often, without being a hindrance in traffic volume. Quiet domains get
the normal 15 minute timeout.
I think this might have changed Dmitriys situation so that the forwarder
is probed quickly after the connection resumes.
This works out to a 50% chance of unblacklisting an IP within 67 queries,
80% within 161 queries, 90% within 230 queries, and 95% within 299 queries.
While I applaud the effort, doesn't it seem kind of strange to have the
behavior of the server be non-deterministic?
If you don't want to bloat the infrastructure cache with TTLs for this, how
about a global or per-thread rate-limit instead?