"harden-referral-path" returns SERVFAIL on subsequent queries

Hello.

I know referral path hardening is an experimental feature. But when
testing new software, I always like to see what all the knobs and
switches do and how they work, where they help and, especially, what
they break and what the symptoms are, so I can recognize it easier.

I have a (not-DNSSEC-signed) subdomain with what I think is a mostly
straight-forward referral path, except that it shares some but not all
nameservers with its parent zone.

With "harden-referral-path" and DNSSEC validation enabled, I can query
every name in that domain once and as long as that answer is in the
cache. Subsequent queries after TTL expiry return SERVFAIL.

Disabling *either* "harden-referral-path" or "dlv-anchor-file" fixes the
problem.

This is with Unbound 1.3.0. The authoritative servers for my subdomain
run tinydns, BIND 9.6 and NSD 3, although the result is always the same,
regardless of which server Unbound queries.

Any thoughts on this? Although I probably won't enable
"harden-referral-path" in a productive environment, this problem keeps
me wondering because I don't see where my delegation path would be
wrong. Or is it the path to dlv.isc.org that causes it? I can repeatedly
query other low-TTL records, though, without any failures.

The first query looks like this:

info: resolving <frell.ambush.de. A IN>
info: response for <frell.ambush.de. A IN>
info: reply from <ambush.de.> 82.197.159.53#53
info: query response was ANSWER
info: resolving <frell.ambush.de.dlv.isc.org. DLV IN>
info: response for <frell.ambush.de.dlv.isc.org. DLV IN>
info: reply from <dlv.isc.org.> 199.6.1.29#53
info: query response was ANSWER
info: validate(nxdomain): sec_status_secure
info: validation success <frell.ambush.de.dlv.isc.org. DLV IN>

resulting in this response:

;; ANSWER SECTION:
frell.ambush.de. 23 IN A 85.177.253.56

z

;; AUTHORITY SECTION:
frell.ambush.de. 230042 IN NS ns1.frell.ambush.de.
frell.ambush.de. 230042 IN NS ns3.frell.ambush.de.
frell.ambush.de. 230042 IN NS ns4.frell.ambush.de.
frell.ambush.de. 230042 IN NS a.ns.ambush.de.
frell.ambush.de. 230042 IN NS b.ns.ambush.de.
frell.ambush.de. 230042 IN NS n.ns.ambush.de.

I can repeat the query for the next 23 seconds and get the cached answer.

After the TTL expires, the next lookups seems fine in the logfile but
return SERVFAIL:

info: resolving <frell.ambush.de. A IN>
info: response for <frell.ambush.de. A IN>
info: reply from <frell.ambush.de.> 213.9.73.106#53
info: query response was ANSWER
info: resolving <frell.ambush.de.dlv.isc.org. DLV IN>

What stands out to me is that Unbound always queries for both the A
record and the DLV record. Shouldn't it have cached the NXDOMAIN answer
to the DLV query earlier? It never does this for other records with low
TTL (like frell.ath.cx). Also, "resolving" is actually the last entry,
there are no "reply" and "query response" entries following it. Indeed,
tcpdump shows only the A request going out, no query for the DLV record.

I can resolve other names in the zone (ie. mail.frell.ambush.de) and
always get the same behaviour. The first query suceeds, subsequent
queries (after the initial TTL expired) all fail.

Hauke.

Hi Hauke,

Some testing, I reproduced and fixed a bug in svn r1657.
The bug was that on a second validation pass a message lost the secure
status, and this caused dlv lookups to abort, with SERVFAIL to the user.

This issue may also have impacted your other question. The fix is in
the subversion trunk of the code.

Hauke Lampe wrote:

Hello.

I know referral path hardening is an experimental feature. But when
testing new software, I always like to see what all the knobs and
switches do and how they work, where they help and, especially, what
they break and what the symptoms are, so I can recognize it easier.

Yeah!

Disabling *either* "harden-referral-path" or "dlv-anchor-file" fixes the
problem.

Yes, it causes different recursion order of processing if you have them
both, with more likely to pickup things from the cache and show them to
the validator again.

This is with Unbound 1.3.0. The authoritative servers for my subdomain
run tinydns, BIND 9.6 and NSD 3, although the result is always the same,
regardless of which server Unbound queries.

Good

What stands out to me is that Unbound always queries for both the A
record and the DLV record. Shouldn't it have cached the NXDOMAIN answer
to the DLV query earlier? It never does this for other records with low
TTL (like frell.ath.cx). Also, "resolving" is actually the last entry,
there are no "reply" and "query response" entries following it. Indeed,
tcpdump shows only the A request going out, no query for the DLV record.

Yes it says resolving, but then does not print anymore because it is
processing results from the cache. Maybe 'internal recursion to this
name' could be a better term to print, but that is so long.

I can resolve other names in the zone (ie. mail.frell.ambush.de) and
always get the same behaviour. The first query suceeds, subsequent
queries (after the initial TTL expired) all fail.

Thanks for the report,
   Wouter

Ahh. I ran into similar issues with DLV at some point. Good thing I didn't
build the 1.3.0 rpms yet!

Paul