Unbound strange stub_zone behavior?

Hi there,

I have an unbound server that acts as a recursive resolver for clients and also acts as a target for fully delegated DNS (i.e. unbound is the NS record).

For the fully-delegated domain it is a simple stub zone with an upstream of localhost on a different port. Let's call it "blah.example.com".

Occasionally, unbound (has happened on versions 1.10.1 and 1.7.3) will start responding to non-recursive queries with the list of root zones instead of a response from the stub-zone. It seems that clients that use the `rd` flag are fine and continue to be able to resolve records in the stub-zone. Only recursive desired clients will receive correct records from unbound (using the stub server). All records in seemingly all stub zones have this behavior simultaneously.

I don't know what triggers it, but a full restart of unbound is the only thing that fixes it. I've tried flushing cache, flushing infra, and everything, nothing seems to matter.

I've seen only 2 things that may point to the issue.

- With verbosity turned up to 10, there's an entry produced in strace (but not in the actual log - maybe a misconfig):

  "unbound[2213085:5] debug: answer from the cache failed"

- stracing the "broken" unbound process is a very tight recvmsg() (of the request) and sendmsg() (with the root servers) with no syscalls in between.

Again, Using dig with +recurse works all the time, even when unbound gets in this state. So seems like an unbound bug / cache corruption or something?

Any ideas?

If it is a bug, you may want to try a work around while waiting for a fix. You could try "auth-zone:" instead of "stub-zone:" or as a companion to "stub-zone:" You may need to give the authoritative server permission for a wholesale zone transfer to the Unbound instance. This may help avoid some undiscovered bug in piecemeal zone recursion.
- Eric

Hi Andrew,

I believe that stub-zones will not work correctly for +norecurse (RD (recursion desired) flag unset) queries. Also, if your blah.example.com has delegations to subzones (even on the same server) and you use a non-standard port, you would need a stub-zone for each sub-zone.

I would follow Eric's advice to use an auth-zone, either as primary or secondary server (depending on your authoritative requirements).

Regards,

Jan.

Hi Andrew,

I believe that stub-zones will not work correctly for +norecurse (RD (recursion desired) flag unset) queries. Also, if your blah.example.com has delegations to subzones (even on the same server) and you use a non-standard port, you would need a stub-zone for each sub-zone.

After restarting unbound, non-recursive queries work fine for several days, until they don't (not sure why). My understanding is that stub_zone presents as if it's local data, and the behavior you're describing would be more like the behavior of a forward zone.

I would follow Eric's advice to use an auth-zone, either as primary or secondary server (depending on your authoritative requirements).

Yeah, Thanks Eric & Jan I'll take a look at that, but I'm not sure the "proxied" dns server can do notifies, but seems to be a good lead.

-Andrew

Just to bump this again – here’s the progress so far. We’ve been able to reproduce this with auth_zones too.

With my limited knowledge of unbound code and gdb it appears that in answer_norec_from_cache:

daemon/worker.c:492 (or so):

answer_norec_from_cache(…) {

dp = dns_cache_find_delegation(&worker->env, qinfo->qname,
qinfo->qname_len, qinfo->qtype, qinfo->qclass,
worker->scratchpad, &msg, timenow);
if(!dp) { /* no delegation, need to reprime */
return 0;
}

in the happy case, dp is NULL meaning there’s no delegation (so it hits return 0), and the correct answer is returned.

In the failure case: dp is a delegation point to what looks like the root zones:

Ok, I think we found the issue.

First, when a auth_zone is for-downstream: no cache is used first. stub_zones seems to behave the same as for-downstream: no.

Somehow, in our environment, someone is triggering a cache of the root zone NS records for “.”, which causes unbound to do a referral to the root zone instead of answering from auth data.

I was able to reproduce this by triggering it with dig -t NS "." @server

More details:
https://github.com/NLnetLabs/unbound/issues/292

-Andrew