we are experiencing weird failures for local unbound installations
on FreeBSD. Under circumstances that we are not able to pinpoint
yet, the service answers SERVFAIL for every single request
including e.g. ". ns".
unbound-control flush-negative or flush-bogus does not fix the
problem, only a complete restart does so immediately.
If we do not restart the service the problem vanishes again after
some time between 12 and 90 minutes maximum.
The version bundled seems to be 1.5.10.
Is this a known problem in this somewhat older version?
We could replace it with one from FreeBSD ports/packages.
If not, how would one go about getting more diagnostic output?
My bet would be that it's a phenomenon called "DNSSec death" caused by improperly implemented key rotation. Often, the new keys signed by the old keys are not published long enough in advance, and when the actual rotation is performed, the trust chain is broken (the new keys are not trusted by the resolver) and it answers SERVFAIL to every record in said zone until the old DNSSec-related records expire from the cache.
It maybe a long shot without actual diagnosis done on the matter, but I have encountered these symptoms before on domains with improperly implemented DNS key management on their authoritative nameservers.
After that, increase debug level and check the debug log to see how the queries are processed - if the server fails, the debug log should contain exact infromation why it happens.
Can one increase the debug level for the already running process like
"rndc trace" for a running BIND named?
The problem arises very infrequently, so we did not want to have
the debug log active permanently on all hosts (about 100 of them).
Yet, once one of the unbound processes shows the problem, our
Icinga will alarm us, so we can investigate ...
After that, increase debug level and check the debug log to see how the queries are processed - if the server fails, the debug log should contain exact infromation why it happens.
Can one increase the debug level for the already running process like
"rndc trace" for a running BIND named?
unbound-control can modify the running process. unbound-control
verbosity 5 logs at high detail.
a few things we learned in the last couple of days:
- the actual "live time" for these SERVFAIL answers is somewhere
between 12 and 30 minutes, probably 15 minutes. After that unbound
"magically" works again.
- "unbound-control flush ." fixes it most of the times, but not always.
Sometimes only stopping and starting unbound restores operation
immediately.
- we forward requests to a Bind nameserver via IPv6. Restarting that
nameserver or flushing its cache does not result in unbound working
again.
- dumping the unbound cache during failure works and the result looks
like a normal cache dump, frequently holding the entries that we just
looked for but got SERVFAIL (and will get SERVFAIL again when asking
for them during the "failure livetime").
This sound like the name server you are forwarding to is unreachable
sometimes, and marked in Unbound as such. Unbound stores this
information in its infra cache, the TTL for this cache is 15 minutes by
default.
Besides looking into the issue of the unreachable upstream, you could
consider lowering the infra cache TTL using infra-host-ttl. Flushing the
infra cache using unbound-control flush_infra should also make it work
again.