message is bogus, non secure rrset with Unbound as local caching resolver

Hi,

sorry for the rather longwinded email. In the interest of saving some
time, here is a short summary:

We get the error "message is bogus, non secure rrset" from Unbound in
some cases when resolving a wildcard CNAME record. The cause appears to
be an upstream BIND resolver that in some cases returns an authority
section containing NS-records but no RRSIG-record for those records.

A longer version of the questions are at the end, but in short:

* Is this a bug in Unbound (it should handle those types of responses)
  or in BIND (it should not generate those kind of responses).
* Could (and should?) Unbound be extended to deal with this type of
  responses (no matter whether they are legal or not)?

Now the longwinded version:

We have an Unbound server running as a local caching resolver on a
server. This instance is configured to forward requests to two
resolvers, one running Unbound and one running BIND.

Both Unbound servers are running 1.4.22 from Debian Jessie. The BIND
server is running version 9.9.8P2.

This error occurs frequently when doing lookups for a domain
"pingapi.paas.uninett.no". This is handled by a wildcard CNAME
pointing at "paas-lb.uninett.no". Since this is a wildcard CNAME, is
must be authenticated with a NSEC3-record in the authority section.

As far as we can tell, the problem occurs because the BIND server is
occasionally returning responses with NS-records in the
authority-section that does not include the RRSIG records.

There are actually two errors from Unbound due to this:

The first is when doing a request for the "pingapi.paas.uninett.no".
In that case, the response from the resolver running BIND looks
something like this:

  ;; ANSWER SECTION:
  pingapi.paas.uninett.no. 85 IN CNAME paas-lb.uninett.no.
  pingapi.paas.uninett.no. 85 IN RRSIG CNAME [...]
  paas-lb.uninett.no. 158 IN A 158.38.213.52
  paas-lb.uninett.no. 158 IN RRSIG A [...]
  
  ;; AUTHORITY SECTION:
  st5rjlutdrm3le5lla3r11bu3qu2qk06.uninett.no. 85 IN NSEC3 [...]
  st5rjlutdrm3le5lla3r11bu3qu2qk06.uninett.no. 85 IN RRSIG NSEC3 [...]
  uninett.no. 3408 IN NS server.nordu.net.
  uninett.no. 3408 IN NS benoni.uit.no.
  uninett.no. 3408 IN NS nac.no.
  uninett.no. 3408 IN NS ns.uninett.no.
  uninett.no. 3408 IN NS nn.uninett.no.

Note that the signature for the NSEC3-record is present, but no
signature for the NS-records. At that point, Unbound rejects the
response, and tries a different server:

  info: validator operate: query pingapi.paas.uninett.no. A IN
  debug: CNAME response was wildcard expansion and did not prove original data did not exist
  info: validate(cname): sec_status_bogus
  debug: iterator[module 1] operate: extstate:module_finished event:module_event_pass
  info: resolving pingapi.paas.uninett.no. A IN

Once the query hits the resolver running Unbound, it succeeds, and the
local resolver moves on to resolving the "paas-lb.uninett.no" domain.
At that point, it may query the resolver running BIND, and will get a
response looking like:

  ;; ANSWER SECTION:
  paas-lb.uninett.no. 299 IN A 158.38.213.52
  paas-lb.uninett.no. 299 IN RRSIG A [...]
  
  ;; AUTHORITY SECTION:
  uninett.no. 1118 IN NS ns.uninett.no.
  uninett.no. 1118 IN NS server.nordu.net.
  uninett.no. 1118 IN NS nac.no.
  uninett.no. 1118 IN NS nn.uninett.no.
  uninett.no. 1118 IN NS benoni.uit.no.

Now I am a bit unclear about what happens, but as far as I can tell,
Unbound sort-of accepts this response, since the authority-section
isn't necessary to validate the answer. However, it then fails with
the error from the subject line:

  [...]
  info: reply from <.> 158.38.212.107#53
  info: query response was CNAME
  info: resolving pingapi.paas.uninett.no. A IN
  info: processQueryTargets: pingapi.paas.uninett.no. A IN
  info: sending query: paas-lb.uninett.no. A IN
  debug: sending to target: <.> 2001:700:0:ff00::1#53
  debug: cache memory msg=89289 rrset=121419 infra=3788 val=77277
  debug: iterator[module 1] operate: extstate:module_wait_reply event:module_event_reply
  info: iterator operate: query pingapi.paas.uninett.no. A IN
  info: iterator operate: chased to paas-lb.uninett.no. A IN
  info: response for pingapi.paas.uninett.no. A IN
  info: reply from <.> 2001:700:0:ff00::1#53
  info: query response was ANSWER
  info: finishing processing for pingapi.paas.uninett.no. A IN
  debug: validator[module 0] operate: extstate:module_restart_next event:module_event_moddone
  info: validator operate: query pingapi.paas.uninett.no. A IN
  info: validator operate: chased to . TYPE0 CLASS0
  info: validate(cname): sec_status_secure
  info: validate(positive): sec_status_secure
  info: message is bogus, non secure rrset uninett.no. NS IN

My guess is that it somehow tries to combine the response it got from
the server running BIND, which contains an authority-section with NS
records but no RRSIG, and the response it got from Unbound, containing
the NSEC3-record and its RRSIG record.

At that point, Unbound has a response containing three items in its
authority-section:

* The NSEC3-record for the wildcard CNAME
* The RRSIG-record for the NSEC3-record
* The NS-records, without any RRSIG-record

I see that Unbound has some code for dealing with the case where the
authority-section contains NS-records without RRSIG records, but that
code does handle this case.

The questions I have are:

* Is this a bug in BIND or in Unbound, or something else? I am not
  clear on what recursive resolvers supposed to do with RRSIG-records
  in the authority section. The DNSSEC specification is very clear on
  what authorative servers should do (i.e. always include them; if not
  possible to include them also drop the record they sign). I have
  however not been able to determine what the behavior should be when
  the server is a recursive resolver.

* Could the code in Unbound be extended to deal with this case as well?
  As mentioned, I see that code was added to deal with the case where
  there are only unsigned NS-records in the authority-section. Could
  that code be made generic, so that it will always strip unsigned
  NS-records, even when there are other records present?

Best regards,
Olav Morken
UNINETT

> sorry for the rather longwinded email. In the interest of saving some
> time, here is a short summary:
>
>
Hi Olav,

Would mind trying the DNSViz command-line tool [1] against the resolvers to
see if anything shows up? After install, run:

dnsviz probe -s x.x.x.x pingapi.paas.uninett.no | dnsviz grok -plwarning
dnsviz probe -s x.x.x.x pingapi.paas.uninett.no | dnsviz graph -Thtml -O

(substitute x.x.x.x for the BIND and unbound resolvers, in turn)

I'm curious if anything shows up there.

Unfortunately, the BIND server only tends to return responses where the
authority-section has NS-records but no RRSIG-record during the night.
I suspect it has something to do with traffic levels and what other
systems are accessing it. It makes it all a bit hard to troubleshoot.
The main source of information for troubleshooting has been a
combination of PCAP-files and log files.

I have grabbed a capture from the Unbound resolver that I have attached
to this email. If I ever happen to catch the BIND resolver having this
behavior, I'll try to catch the output from it as well, but I won't
make any promises.

The output of `dnsviz -grok -plwarning` only contains:

Analyzing pingapi.paas.uninett.no
Analyzing paas.uninett.no
Analyzing uninett.no
Analyzing no
Analyzing .
Analyzing paas-lb.uninett.no

The HTML output from the DNSViz on the Unbound server is available here:

  https://uninett.box.com/s/3uz8fz7055oe788yrf0en3dmx651eyg1

(Changed from an attachment due to size restrictions on the list.)

Best regards,
Olav Morken
UNINETT / Feide

Are you sure this is not the bind wildcard bug? Can you try to resolve
something like pwouters.fedorahosted.org. That's an expanded wildcard.

If so, this is the same bug as:

https://bugzilla.redhat.com/show_bug.cgi?id=824219

We have a test for this, but I don't this dnssec-trigger has included
this test yet.

Paul

I wasn't aware of that bug report, but after having looked at it now, I
don't think it is the same issue. I have no problems resolving
pwouters.fedorahosted.org, even if I limit Unbound to only using the
BIND upstream server.

Looking at the bug report, the Bind version here is more recent
(9.9.8P2). I have also never never seen any problems with the
NSEC3-record or its RRSIG-record.

The error message mentioned in comment #4 is also different:

  info: validator: response has failed AUTHORITY rrset: fedorapeople.org. NS IN
  info: validate(positive): sec_status_bogus

The error I get is:

  info: validate(cname): sec_status_secure
  info: validate(positive): sec_status_secure
  info: message is bogus, non secure rrset uninett.no. NS IN

As far as I can tell, the problem here is caused by extra NS-records in
the authority-section that do not include the RRSIG element for the
NS-records, but I can't really say that for certain.

Best regards,
Olav Morken
UNINETT

This sounds a lot like a problem we discussed last year. See
https://unbound.net/pipermail/unbound-users/2015-February/003757.html

As I said back then, I think it's wrong to discard the entire response if
parts of it are bogus. Unbound should keep the valid parts because it
knows there is nothing wrong with them.

Does Unbound use CD=1 when forwarding? If so, it should expect to receive
partially bogus answers and should handle them gracefully.

Tony.

Unfortunately, the BIND server only tends to return responses where
the authority-section has NS-records but no RRSIG-record
during the night. I suspect it has something to do with
traffic levels and what other systems are accessing it. It
makes it all a bit hard to troubleshoot. The main source of
information for troubleshooting has been a combination of
PCAP-files and log files.

Are you sure this is not the bind wildcard bug? Can you try to resolve
something like pwouters.fedorahosted.org. That's an expanded wildcard.

A couple of responses to an 'a' query for this name follows
attached below. In both cases you'll see the Authority section
contains the NS RRSET but not the RRSIG covering the NS RRSET,
something we're not quite sure is "right" (but have not yet found
the scripture on), and which Olav suspects is triggering Unbound
to be unhappy about the response.

If so, this is the same bug as:

https://bugzilla.redhat.com/show_bug.cgi?id=824219

You mean the ISC RT#21409 which is mentioned in there, or
something else? The recursor Olav's machine is forwarding to
(oliven.uninett.no) is running BIND 9.9.8-P2, and according to
its CHANGES file, that bug was squashed in the run-up to 9.9.3b2:

3444. [bug] The NOQNAME proof was not being returned from cached
                        insecure responses. [RT #21409]

Or is "the bind wildcard bug" something else? If so please
provide more information.

Best regards,

- Håvard

Hi Havard,

Unfortunately, the BIND server only tends to return responses
where the authority-section has NS-records but no RRSIG-record
during the night. I suspect it has something to do with
traffic levels and what other systems are accessing it. It
makes it all a bit hard to troubleshoot. The main source of
information for troubleshooting has been a combination of
PCAP-files and log files.

Are you sure this is not the bind wildcard bug? Can you try to
resolve something like pwouters.fedorahosted.org. That's an
expanded wildcard.

A couple of responses to an 'a' query for this name follows
attached below. In both cases you'll see the Authority section
contains the NS RRSET but not the RRSIG covering the NS RRSET,
something we're not quite sure is "right" (but have not yet found
the scripture on), and which Olav suspects is triggering Unbound to
be unhappy about the response.

The "right" thing is to have RRSIGs for all elements of the answer and
authority sections. This is mandated by RFC4034,4035. All the RRsets
in the answer and authority section MUST validate to mark the response
as valid.

That contradicts explicitly your idea to keep valid parts surrounded
by invalid parts.

I think it is a bug in BIND that it transmits the NS set without its
RRSIGs in the authority section (in a reply that is not a referral).

However, I think it is not unreasonable to extend the compatibility
code in Unbound for this. The error that Olav quotes is simply
Unbound enforcing that 'all RRsets MUST validate' rule, telling you
which one failed. The NS set is gratuitous though, in the answer,
hence perhaps compatibility is an option. Not so, for, say, NSEC or
SOA RRs.

Best regards, Wouter

The "right" thing is to have RRSIGs for all elements of the
answer and authority sections. This is mandated by
RFC4034,4035. All the RRsets in the answer and authority
section MUST validate to mark the response as valid.

FYI, I've submitted a tentative bug report to the BIND maintainers
based on my message and the one I'm replying to here, RT#41844.

Regards,

- Håvard

The "right" thing is to have RRSIGs for all elements of the
answer and authority sections. This is mandated by
RFC4034,4035. All the RRsets in the answer and authority
section MUST validate to mark the response as valid.

FYI, I've submitted a tentative bug report to the BIND maintainers
based on my message and the one I'm replying to here, RT#41844.

And... They're not having it:

  This is not a bug. Section 3.1.1 applies to authoritative nameservers
  not intermediate caching nameservers. In this case you are seeing the
  referral which is unsigned being returned from the cache.

Regards,

- Håvard

>
> info: validate(cname): sec_status_secure
> info: validate(positive): sec_status_secure
> info: message is bogus, non secure rrset uninett.no. NS IN
>
> As far as I can tell, the problem here is caused by extra NS-records in
> the authority-section that do not include the RRSIG element for the
> NS-records, but I can't really say that for certain.

This sounds a lot like a problem we discussed last year. See
https://unbound.net/pipermail/unbound-users/2015-February/003757.html

It look similar, in that it is caused by extra records, but as far as I
know there shouldn't be any DLV involved here. The uninett.no-zone is
properly delegated from the parent zone.

I also tested with the most recent version from subversion trunk, which
includes the fix mentioned in that thread, but got the same result.

Does Unbound use CD=1 when forwarding? If so, it should expect to receive
partially bogus answers and should handle them gracefully.

I checked, and it does set the CD-flag. The full dig command line to
simulate the queries that Unbound sends appear to be:

  dig -4 +qr +noadflag +recurse +cdflag +bufsize=4096 +dnssec pingapi.paas.uninett.no @dns-resolver1.uninett.no

I.e. the packets have the RD, CD and DO flags set.

I grabbed the output from dig yesterday evening. If anyone is curious, I
uploaded it here:

  https://gist.github.com/olavmrk/c62f099736dbc5ef514a

Best regards,
Olav Morken
UNINETT

>
> > sorry for the rather longwinded email. In the interest of saving some
> > time, here is a short summary:
> >
> >
> Hi Olav,
>
> Would mind trying the DNSViz command-line tool [1] against the resolvers to
> see if anything shows up? After install, run:
>
> dnsviz probe -s x.x.x.x pingapi.paas.uninett.no | dnsviz grok -plwarning
> dnsviz probe -s x.x.x.x pingapi.paas.uninett.no | dnsviz graph -Thtml -O
>
> (substitute x.x.x.x for the BIND and unbound resolvers, in turn)
>
> I'm curious if anything shows up there.

[...]

I have grabbed a capture from the Unbound resolver that I have attached
to this email. If I ever happen to catch the BIND resolver having this
behavior, I'll try to catch the output from it as well, but I won't
make any promises.

I managed to check yesterday evening, and the output between the two
upstream resolvers is identical.

Best regards,
Olav Morken
UNINETT

I forgot to mention this, but I also did a quick test where I patched[1]
of Unbound to not set the CD-flag in its queries, and at that point DNS
resolution worked. Checking packet captures shows that BIND does not
include the NS-records in that case.

[1] https://gist.github.com/olavmrk/f9e9c68ec2932e026b4e

Best regards,
Olav Morken
UNINETT

A couple of responses to an 'a' query for this name follows
attached below. In both cases you'll see the Authority
section contains the NS RRSET but not the RRSIG covering the
NS RRSET, something we're not quite sure is "right" (but have
not yet found the scripture on), and which Olav suspects is
triggering Unbound to be unhappy about the response.

The "right" thing is to have RRSIGs for all elements of the
answer and authority sections. This is mandated by RFC4034,
4035.

Following the "not a bug" response from the BIND maintainers
yesterday evening, can you please point to chapter and verse
mandating this behaviour for non-authoritative recursive
resolvers?

The closest I've been able to find is section 3.1.1 in RFC 4035,
and that section only applies to authoritative name servers.

In the absence of any forwarding configuration, unbound always
expects to speak with authoritative name servers, and then it
makes sense to enforce the text from 3.1.1. However, when it
uses forwarding it will send all its requests via one or more
non-authoritative recursive resolvers, and then insisting on
RRSig on all RRsets in the authoritative section no longer makes
sense. Instead, unbound should adhere to the rule "if you need a
specific RRset (that you've not already been given) to proceed,
you had better ask for it explicitly".

Therefore, in the absence of any contradictory quote from the
standards, I conclude that the current behaviour of unbound in a
forwarding setup is a bug in unbound.

Best regards,

- Håvard

If the compatibility code can be extended, that would be great! The
alternative at the moment seems to be to use less diversity in the
upstream resolvers, but that is unfortunate from a reliability point of
view.

Best regards,
Olav Morken
UNINETT

Hi Havard,

A couple of responses to an 'a' query for this name follows
attached below. In both cases you'll see the Authority section
contains the NS RRSET but not the RRSIG covering the NS RRSET,
something we're not quite sure is "right" (but have not yet
found the scripture on), and which Olav suspects is triggering
Unbound to be unhappy about the response.

The "right" thing is to have RRSIGs for all elements of the
answer and authority sections. This is mandated by RFC4034,
4035.

Following the "not a bug" response from the BIND maintainers
yesterday evening, can you please point to chapter and verse
mandating this behaviour for non-authoritative recursive
resolvers?

RFC4035 3.2.3 for validators, all RRsets in answer and authority
sections should be authentic ...

https://tools.ietf.org/html/rfc4035#section-3.2.3

The closest I've been able to find is section 3.1.1 in RFC 4035,
and that section only applies to authoritative name servers.

Yes and I had not counted on getting a partial referral mixed into
another reply from a forwarder.

Best regards, Wouter

  info: validate(cname): sec_status_secure
  info: validate(positive): sec_status_secure
  info: message is bogus, non secure rrset uninett.no. NS IN

As far as I can tell, the problem here is caused by extra NS-records in
the authority-section that do not include the RRSIG element for the
NS-records, but I can't really say that for certain.

This sounds a lot like a problem we discussed last year. See
https://unbound.net/pipermail/unbound-users/2015-February/003757.html

Yep, indeed, this does appear to be exactly the same root cause,
although DLV isn't really relevant to the problem. Unbound appears to
enforce compliance to section 3.1.1 of RFC 4035 even when it's
querying a forwarder, which is a non-authoritative recursive resolver.
In fact, the entire 3.1 section only talks about the behaviour of
authoritative name servers, while section 3.2 talks about recursive
name servers. Thus, unbound enforcing 3.1.1 on responses to forwarded
queries is just Wrong, and a bug in unbound.

As I said back then, I think it's wrong to discard the entire response if
parts of it are bogus. Unbound should keep the valid parts because it
knows there is nothing wrong with them.

Come to think of it, anything you get from a recursive resolver are
possibly cached hints, including what you get in the Answer section.
If a validating resolver needs other RRsets than those supplied in the
answer (all sections) or what it has in the cache, it should
explicitly ask for them.

Granted, modern recursive name servers used for forwarding will set
"DNSSEC OK" in outgoing queries (as mandated by RFC 3225), and will
supply any cached related DNSSEC material in the reply when queried
with the "DNSSEC OK" flag set. However, it does not have to comply to
section 3.1 of RFC 4035 when composing the reply.

Does Unbound use CD=1 when forwarding? If so, it should expect to receive
partially bogus answers and should handle them gracefully.

Yep, as Olav replied, and the pcaps I capture on the BIND recursor
agrees: CD=1 is set in the forwarded queries.

Regards,

- Håvard

Come to think of it, anything you get from a recursive resolver are
possibly cached hints, including what you get in the Answer section.

It isn't quite that bad due to the RFC 2181 trustworthiness ranking.

> Does Unbound use CD=1 when forwarding? If so, it should expect to receive
> partially bogus answers and should handle them gracefully.

Yep, as Olav replied, and the pcaps I capture on the BIND recursor
agrees: CD=1 is set in the forwarded queries.

CD=1 is the wrong thing when querying a forwarder. When a domain is partly
broken, queries that work with CD=0 can be forced to fail with CD=1.

Tony.

That's about setting the ad bit. Was it set in this response?

Tony.

Come to think of it, anything you get from a recursive resolver are
possibly cached hints, including what you get in the Answer section.

It isn't quite that bad due to the RFC 2181 trustworthiness ranking.

Mm, yes, but that predates DNSSEC (no?) and especially if the
local resolver wants to do its own DNSSEC validation, it can't
really in that context lend more credence to information received
in one section over another. Especially if the information is
received over an unsecured channel (which it typically will be).

> Does Unbound use CD=1 when forwarding? If so, it should expect to receive
> partially bogus answers and should handle them gracefully.

Yep, as Olav replied, and the pcaps I capture on the BIND recursor
agrees: CD=1 is set in the forwarded queries.

CD=1 is the wrong thing when querying a forwarder. When a
domain is partly broken, queries that work with CD=0 can be
forced to fail with CD=1.

Relly? I interpreted the use of CD=1 as "I want to do my own
DNSSEC validation, and therefore don't want or need the
validation service which could be provided by the forwarder",
especially as noted above when the communication isn't secured.
It should not make much of a difference wrt. the validity of the
end result whether the forwarder or the unbound resolver does the
DNSSEC validation?

Regards,

- Håvard

> Following the "not a bug" response from the BIND maintainers
> yesterday evening, can you please point to chapter and verse
> mandating this behaviour for non-authoritative recursive
> resolvers?

RFC4035 3.2.3 for validators, all RRsets in answer and authority
sections should be authentic ...

https://tools.ietf.org/html/rfc4035#section-3.2.3

That's about setting the ad bit. Was it set in this response?

Nope AD=1 was not set in the response, and I would guess that is
as expected, since CD=1 was set in the query.

Regards,

- Håvard