unbound replaces CNAME query with A query?

Hi,

we are tracking/debugging [1][2] an issue that results in the failure of
certificate renewal (ACME DNS challenge).

If you ask unbound 1.17.1 the query shown below when it has an empty cache you get an NXDOAMIN reply, if you ask it again you will get the actual expected answer (NOERROR), PowerDNS Recursor does not have that issue.

Investigating the DNS traffic has also shown that
the stub -> unbound CNAME query results in an unbound -> authoritative A qtype query instead of a CNAME query.

Can you reproduce this issue and confirm this is unexpected?

thanks!
Christoph

dig _acme-challenge.bender-doh.applied-privacy.net CNAME

; <<>> DiG 9.18.13 <<>> _acme-challenge.bender-doh.applied-privacy.net CNAME
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 20502
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;_acme-challenge.bender-doh.applied-privacy.net. IN CNAME

;; ANSWER SECTION:
_acme-challenge.bender-doh.applied-privacy.net. 86400 IN CNAME bender-doh.acme-dns-challenge.applied-privacy.net.

;; AUTHORITY SECTION:
acme-dns-challenge.applied-privacy.net. 300 IN SOA get.desec.io. get.desec.io. 2023035286 86400 3600 2419200 3600

;; Query time: 114 msec
;; SERVER: 127.0.0.1#53(127.0.0.1) (UDP)
;; MSG SIZE rcvd: 167

Correct me if I understand it not correctly. whether you query CNAME or A record should not make a difference in NXDOMAIN status. But in any case the answer is not there. How does it change ACME process when there is NXDOMAIN and not just no-answer NOERROR response?

_acme-challenge.bender-doh.applied-privacy.net exists with cname. Its cname target returns NXDOMAIN. So yes, it is a bit confusing what is the final result. What exactly is the stub in this case? libresolv library? getaddrinfo() cannot query cname itself, it can do that via A query however.

What is the point of querying just CNAME? Does it have a specific reason?

Unbound seems proactive to fetch actually useful record instead of just intermediate CNAME. I am not sure that has to be strictly wrong. The result it delivers is similar. It tells there is CNAME and its target does not exist. It just seem the stub does not check actual contents of message except rcode. Can stub resolver do anything useful with information that there is CNAME not leading to final destination?

Note: it would be much easier if you could share just pcap containing the problem instead of only text description.

Hi Petr,

thanks for your reply and your questions.

Petr Menšík via Unbound-users:

Correct me if I understand it not correctly. whether you query CNAME
or A record should not make a difference in NXDOMAIN status. But in
any case the answer is not there. How does it change ACME process
when there is NXDOMAIN and not just no-answer NOERROR response?

That CNAME DNS query is used by lego - an ACME client - to find
the DNS record it has to update (the ACME DNS TXT challenge).
Lego's CNAME support used to be experimental and is now enabled by default.

The NXDOMAIN answer results in lego concluding "there is no CNAME".
The impact of that unexpected NXDOMAIN answer is that lego will attempt
to use the provided DNS API key to create a TXT record it has no
permissions for. It only has permissions for the target of the existing
CNAME.
For this reason the NOERROR and its answer is important, even if the
final record in that CNAME chain does not exist. It is lego's job to
create it.

_acme-challenge.bender-doh.applied-privacy.net exists with cname. Its
cname target returns NXDOMAIN. So yes, it is a bit confusing what is
the final result. What exactly is the stub in this case? libresolv
library?

It is running lego on a FreeBSD server.

I hope the text also helps with answering your other questions below, if
it is not clear please let me know and I will try to rephrase.

What is the point of querying just CNAME? Does it have a specific
reason?

Unbound seems proactive to fetch actually useful record instead of
just intermediate CNAME I am not sure that has to be strictly wrong.
The result it delivers is similar. It tells there is CNAME and its
target does not exist.

If unbound is just trying to be useful then it should still be consistent and provide the same answer if you ask it twice - which is not the case currently.

It just seem the stub does not check actual
contents of message except rcode. Can stub resolver do anything
useful with information that there is CNAME not leading to final
destination?

Note: it would be much easier if you could share just pcap containing
the problem instead of only text description.

I actually was hoping to achieve the opposite, because looking at the
text does not require people to have a pcap parser and open a file from a mailing list but you got the gist of it anyway.

thanks,
Christoph

There really seem to be issue in unbound when querying cname.

I created test record, pointing at another domain, non-exiting name.

kdig cnametest.bleve.fi. CNAME

;; ->>HEADER<<- opcode: QUERY; status: NXDOMAIN; id: 46683
;; Flags: qr rd ra ad; QUERY: 1; ANSWER: 0; AUTHORITY: 1; ADDITIONAL: 0

;; QUESTION SECTION:
;; cnametest.bleve.fi. IN CNAME

;; AUTHORITY SECTION:
bleve.fi. 3462 IN SOA
foo-ns.foobar.fi. hostmaster.foobar.fi. 1679142493 28800 7200 864000
28800

;; Received 97 B
;; Time 2023-03-31 11:13:51 EEST
;; From 2001:998:2e::1@53(UDP) in 0.8 ms

If I query from authoritative server directly, I get correct answer.

It looks like unbound errorously try to follow cname to non-existing
record even when cname itself is queried. CNAME should only be followed
if something != cname is queried.

I am using dnssec-trigger-0.17-7.fc36.x86_64 and unbound-1.17.1-1.fc36.x86_64 on Fedora 36. But I cannot reproduce the behaviour, even if I flush cache by "unbound-control flush_zone ." It is returning consistently CNAME with NOERROR. Does it happen only when the unbound does not have forwarders and is iterating itself? I keep getting CNAME with NOERROR.

$ kdig cnametest.bleve.fi. CNAME
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 33690
;; Flags: qr rd ra ad; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 0

;; QUESTION SECTION:
;; cnametest.bleve.fi. IN CNAME

;; ANSWER SECTION:
cnametest.bleve.fi. 7200 IN CNAME nxdomain.foobar.fi.

;; Received 66 B
;; Time 2023-03-31 12:58:20 CEST
;; From 127.0.0.1@53(UDP) in 0.5 ms

Does it happen only after unbound is fresh started? Are there steps to reproduce on the running instance?

I have tried to reproduce it on my own unbound-1.17.1-1.fc36.x86_64, but it does not behave like you have described after flushing the cache. Not to me. I just guess there might be something else required, but not sure what. Is there something in unbound logs, which would make hint why it forwarded A query instead? Can you try increasing verbosity by unbound-control verbosity <newvalue> and query the name afterwards?

$ unbound-control flush_zone . && dig _acme-challenge.bender-doh.applied-privacy.net CNAME
ok removed 310 rrsets, 218 messages and 16 key entries

; <<>> DiG 9.18.13 <<>> _acme-challenge.bender-doh.applied-privacy.net CNAME
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23092
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;_acme-challenge.bender-doh.applied-privacy.net. IN CNAME

;; ANSWER SECTION:
_acme-challenge.bender-doh.applied-privacy.net. 86400 IN CNAME bender-doh.acme-dns-challenge.applied-privacy.net.

;; Query time: 177 msec
;; SERVER: 127.0.0.1#53(127.0.0.1) (UDP)
;; WHEN: Fri Mar 31 13:06:33 CEST 2023
;; MSG SIZE rcvd: 119

Hi Petr,

thanks for your reply and your questions.

Petr Menšík via Unbound-users:

Correct me if I understand it not correctly. whether you query CNAME
or A record should not make a difference in NXDOMAIN status. But in
any case the answer is not there. How does it change ACME process
when there is NXDOMAIN and not just no-answer NOERROR response?

That CNAME DNS query is used by lego - an ACME client - to find
the DNS record it has to update (the ACME DNS TXT challenge).
Lego's CNAME support used to be experimental and is now enabled by default.

The NXDOMAIN answer results in lego concluding "there is no CNAME".
The impact of that unexpected NXDOMAIN answer is that lego will attempt
to use the provided DNS API key to create a TXT record it has no
permissions for. It only has permissions for the target of the existing
CNAME.
For this reason the NOERROR and its answer is important, even if the
final record in that CNAME chain does not exist. It is lego's job to
create it.

Okay, I would have expected TXT query before the update, but okay. The problem is I see different behaviour on the same version as you have. So the primary reason you are quering this name is to prepare UPDATE query. But you want to update only CNAME target if there is any, not the original name itself.

Anyway, the answer contains CNAME in ANSWER section, even if the status is NXDOMAIN. So even with this unusual reply the software should be able to decipher where the queried name leads to. The only part wrong in the answer is NXDOMAIN status. I admit it usually arrives with empty answer. Not sure the answer with CNAME present is against RFCs.

_acme-challenge.bender-doh.applied-privacy.net exists with cname. Its
cname target returns NXDOMAIN. So yes, it is a bit confusing what is
the final result. What exactly is the stub in this case? libresolv
library?

It is running lego on a FreeBSD server.

I hope the text also helps with answering your other questions below, if
it is not clear please let me know and I will try to rephrase.

If unbound is just trying to be useful then it should still be consistent and provide the same answer if you ask it twice - which is not the case currently.

I am running it on Fedora 36. I doubt it should have different results.

Yes, I agree it should return the same answer. It could differ only in minor differences like used opt parameters. But not different status. If you ask the first or the third time, result should differ only in TTL or something insignificant.

Note: it would be much easier if you could share just pcap containing
the problem instead of only text description.

I actually was hoping to achieve the opposite, because looking at the
text does not require people to have a pcap parser and open a file from a mailing list but you got the gist of it anyway.

thanks,
Christoph

Okay, it might be just my preference. The thing is those packet descriptions are not compact, it makes the message quite long. dig-like output would be better, but that is more difficult to get from pcap file.

Try the query I just listed, should work with bind dig too.
If you query bleve.fi authoritative dns servers to get correct answer.

cname query only fails if cname target gives NXDOMAIN.

For example following query works correctly because destination of the
cname exists.

kdig _443._tcp.bleve.fi. cname

This is obviously a bug, very special case which resolver need to
handle different way than normal cname resolution. Also cloudflare,
quad9, and google resolvers seem to have same problem. Seem to be
special case not handled by most dns resolver.

dnsmasq and bind seem to be able to handle that query correctly.

I am using dnssec-trigger-0.17-7.fc36.x86_64 and
unbound-1.17.1-1.fc36.x86_64 on Fedora 36. But I cannot reproduce the
behaviour, even if I flush cache by "unbound-control flush_zone ." It
is returning consistently CNAME with NOERROR. Does it happen only
when the unbound does not have forwarders and is iterating itself? I
keep getting CNAME with NOERROR.

  > $ kdig cnametest.bleve.fi. CNAME

Try the query I just listed, should work with bind dig too.
If you query bleve.fi authoritative dns servers to get correct answer.

cname query only fails if cname target gives NXDOMAIN.

I have tried on my unbound and it never returns NXDOMAIN to me. The result is the same with kdig or dig, that makes no difference. I get NOERROR, not NXDOMAIN.

$ kdig cnametest.bleve.fi. CNAME | head -2
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 35718
;; Flags: qr rd ra ad; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 0

For example following query works correctly because destination of the
cname exists.

kdig _443._tcp.bleve.fi. cname

This is obviously a bug, very special case which resolver need to
handle different way than normal cname resolution. Also cloudflare,
quad9, and google resolvers seem to have same problem. Seem to be
special case not handled by most dns resolver.

dnsmasq and bind seem to be able to handle that query correctly.

dnsmasq does not handle CNAMEs at all. It requires upstream recursive server to do the job and just passes the result to a client. bind can to proper iteration job from root hints however.

If it is a bug, I would suggest creating issue at https://github.com/NLnetLabs/unbound/

But maybe more precise steps should be described when it returns NXDOMAIN. Just flushing the cache and doing your query does not seem to be enough for me.

> cname query only fails if cname target gives NXDOMAIN.

I have tried on my unbound and it never returns NXDOMAIN to me. The
result is the same with kdig or dig, that makes no difference. I get
NOERROR, not NXDOMAIN.

All unbounds here without forwarders set up, is that the difference?

I have tried on my unbound and it never returns NXDOMAIN to me. The
result is the same with kdig or dig, that makes no difference. I get
NOERROR, not NXDOMAIN.
All unbounds here without forwarders set up, is that the difference?

I have tried it inside a Rawhide container.

# unbound-control forward
off (using root hints)

# dig @localhost cnametest.bleve.fi. CNAME

; <<>> DiG 9.18.13 <<>> @localhost cnametest.bleve.fi. CNAME
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55072
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;cnametest.bleve.fi. IN CNAME

;; ANSWER SECTION:
cnametest.bleve.fi. 7118 IN CNAME nxdomain.foobar.fi.

;; Query time: 0 msec
;; SERVER: ::1#53(localhost) (UDP)
;; WHEN: Fri Mar 31 16:20:26 CEST 2023
;; MSG SIZE rcvd: 77

Just after fresh restart, it is NOERROR. As it is later. Indeed, the query unbound sends to cnametest.bleve.fi is A? query. But the response delivered to dig is a correct one. Tested with unbound-1.17.1-2.fc38.x86_64.

Frame 641: 89 bytes on wire (712 bits), 89 bytes captured (712 bits) on interface virbr0, id 0
Ethernet II, Src: 7e:85:92:43:88:71 (7e:85:92:43:88:71), Dst: RealtekU_02:bd:85 (52:54:00:02:bd:85)
Internet Protocol Version 4, Src: 192.168.122.184, Dst: 87.239.120.11
User Datagram Protocol, Src Port: 46986, Dst Port: 53
Domain Name System (query)
Transaction ID: 0x4302
Flags: 0x0010 Standard query
Questions: 1
Answer RRs: 0
Authority RRs: 0
Additional RRs: 1
Queries
cnametest.bleve.fi: type A, class IN
Additional records
[Response In: 719]

It responds to it with nameservers of bleve.fi. But to those servers it already sends CNAME query, not A? Attaching my pcap.

When I did dig @localhost ns bleve.fi. before cnametest, it returned SERVFAIL the first time. Only then it responded with NOERROR. So no, I do not know how to get NXDOMAIN response from unbound. I get similar results for the original query.

(attachments)

cnametest-bleve.fi-filtered.pcapng (6.61 KB)

My understanding is this:

If a dig command is directed to a resolver with type=CNAME specified and the resolver responds with anything other than the asked for CNAME information, this may indeed be a bug. I’m not sure of the results if the CNAME target exists in cache. Another way to see similar results would be to submit a type=ANY (default) with the +norecurse switch.

I’d be interested to see the results with the +norecurse switch on.

Bob

Hmh. Now I have more info. This is some kind of issue with unbound
cache. if I run unbound-control reload, that causes unbound time to
time fail query. And even when failed, next query will succeed.

Hmh. Now I have more info. This is some kind of issue with unbound
cache. if I run unbound-control reload, that causes unbound time to
time fail query. And even when failed, next query will succeed.

Thanks for confirming what I'm observing.

I've filed a github issue now:

best regards,
Christoph