Problem with undead upstrems

Florian_Streibelt · February 27, 2023, 10:24am

Hi all,

I am new to unbound and this list, but was unable to find a solution for my problem in the documentation and by searching.

My issue is a set of authoritative nameservers that host a domain a customer tries to resolve.

Everything works fine, until we try to resolve a DS record within that zone. All queries for DS are being ignored by the authoritatives of that domain and just get dropped without any answer. Thus unbound marks all of the servers unresponsive and will refuse to resolve anything within that zone, although queries for other record types are happily answered by the servers.

I assume there is no way to tell unbound to ignore failing DS queries for the "liveness check" or as an emergency workaround filter DS queries for a set of upstream servers?

Basically a combination of rpz matching the nameserver names and record type would to the trick, but that unfortunately is not defined in the rpz syntax and nothing similar seems to be implemented.

Using knot and its LUA support I was able to implement a workaround, but ideally I don't want to manually keep lists of broken servers up to date.

A feature or change in the way how unbound decides a server to be unresponsive would be a good solution in my opinion, e.g. when only DS is dropped move to the next server and skip this one only for DS in the future with a SERVFAIL or something, but as long as it respons with A/AAAA or others don't remove it from the working set...

Happy for any hints how to handle that case. Of course I am already trying to reach out to the operators of the upstream servers.

Florian

subbink-sidn · February 27, 2023, 12:37pm

Hi all,

Hello Florian,

I am new to unbound and this list, but was unable to find a solution
for my problem in the documentation and by searching.

My issue is a set of authoritative nameservers that host a domain a
customer tries to resolve.

Everything works fine, until we try to resolve a DS record within
that zone. All queries for DS are being ignored by the authoritatives
of that domain and just get dropped without any answer. Thus unbound
marks all of the servers unresponsive and will refuse to resolve
anything within that zone, although queries for other record types
are happily answered by the servers.

Did you have a look what https://dnsviz.net thinks of the domain?

I assume there is no way to tell unbound to ignore failing DS queries
for the "liveness check" or as an emergency workaround filter DS
queries for a set of upstream servers?

You could try to add the domain with the domain-insecure option [1].

Basically a combination of rpz matching the nameserver names and
record type would to the trick, but that unfortunately is not defined
in the rpz syntax and nothing similar seems to be implemented.

Using knot and its LUA support I was able to implement a workaround,
but ideally I don't want to manually keep lists of broken servers up
to date.

A feature or change in the way how unbound decides a server to be
unresponsive would be a good solution in my opinion, e.g. when only
DS is dropped move to the next server and skip this one only for DS
in the future with a SERVFAIL or something, but as long as it respons
with A/AAAA or others don't remove it from the working set...

This seems a bad idea, because it is a workaround for people who are
serving DNS with a broken setup.
I my opinion the domain should be broken when it is broken.

Happy for any hints how to handle that case. Of course I am already
trying to reach out to the operators of the upstream servers.

Without any information about the domain name itself, people can only
give general hints.

[1]
https://nlnetlabs.nl/documentation/unbound/unbound.conf/#domain-insecure

heidnes · February 27, 2023, 1:00pm

I am new to unbound and this list, but was unable to find a solution
for my problem in the documentation and by searching.

My issue is a set of authoritative nameservers that host a domain a
customer tries to resolve.

Everything works fine, until we try to resolve a DS record within that
zone. All queries for DS are being ignored by the authoritatives of
that domain and just get dropped without any answer. Thus unbound
marks all of the servers unresponsive and will refuse to resolve
anything within that zone, although queries for other record types are
happily answered by the servers.

I suspect you are falling victim to one of the more odd and
perhaps unexpected quirks of DNSSEC.

The DS records for a given name are in fact not authoritative in
the zone named by the owner name of the DS record, but are
instead authoritative in the parent (delegating) zone(!)

All the other record types (including DNSKEY) for that name are
authoritative in the zone named by that same name.

Hope this helps you figuring out the rest.

Best regards,

- Håvard

Florian_Streibelt · February 27, 2023, 2:45pm

I know that. But that is not my issue, in fact it is completely unrelated to DNSSEC.

It is just being triggered by querying DS records for certain domains via our unbound.

The upstream nameservers will drop DS queries on the network layer and not respond at all.

Our customer for some reason is sending DS queries to our unbound(s) for these domains.

Unbound then tries to query the servers and gets no response.

As a result it marks them all as unresponsive and then will not resolve any other records hosted on these nameservers, as they are internally marked as down, responding with a SERVFAIL until the timer is expired to re-query these servers.

Florian

heidnes · February 27, 2023, 3:22pm

I know that. But that is not my issue, in fact it is completely
unrelated to DNSSEC.

Ah.

It is just being triggered by querying DS records for certain domains
via our unbound.

The upstream nameservers will drop DS queries on the network layer and
not respond at all.

Our customer for some reason is sending DS queries to our unbound(s)
for these domains.

Unbound then tries to query the servers and gets no response.

As a result it marks them all as unresponsive and then will not
resolve any other records hosted on these nameservers, as they are
internally marked as down, responding with a SERVFAIL until the timer
is expired to re-query these servers.

I'm assuming your upstream name servers are providing recursive
service to you. If that's the case, to me it then sounds like
the upstream name servers do not implement DNSSEC; refusing to
look up "unusual" / "new" record types is a violation of the
standard, I would think -- perhaps even irrespective of whether
they implement DNSSEC or not.

"Pick another upstream" would be my suggestion, if that's at all
feasible. Either that, or do your own recursive resolution, and
don't rely on someone else bodging it for you

Regards,

- Håvard

Florian_Streibelt · February 27, 2023, 3:31pm

No, again that is not my issue.

All of the servers that dns.com operates are dropping queries for the Ressource Record Type DS.

They are the authoritative servers for dns.com as well as for the parent zone of the zone our customer wants to resolve and the zone itself.

We are providing recursion for our customer.

Our customer sends us DS queries, we try to query the respective servers but they will drop the queries silently which will make our unbound mark these servers as unresponsive and not query them any further.

When all authoritative servers for these domains are being marked unresponsive, our unbound will respond SRVFAIL to all queries that would be sent to those servers, making it impossible to resolve anything within zones hosted on those servers.

Florian

heidnes · February 27, 2023, 4:24pm

"Pick another upstream" would be my suggestion, if that's at all
feasible. Either that, or do your own recursive resolution, and
don't rely on someone else bodging it for you

No, again that is not my issue.

Sorry for at least initially not fully comprehending the
situation...

All of the servers that dns.com operates are dropping queries for the
Ressource Record Type DS.

That is an error.

If a publishing name server receives a query for an RR type which
doesn't exist at the given name, but other data (other RR types)
exists on the queried-for name, the correct thing is to return an
empty NOERROR response. If the queried-for name doesn't exist,
but the publishing name server is authoritative for the zone
where the name would reside, the correct response is a reply with
an NXDOMAIN error code. If the publishing name server isn't
authoritative for the queried-for name, and doesn't provide
recursive service to you, a valid response would be an empty
reply with the error code REFUSED. Note that in none of these
cases is "failure to respond" a valid behaviour, perhaps modulo
rate limiting.

Failing to provide a response for "unusual" or "new" resource
record queries (some might characterize DS records as "new",
others would disagree, me among them...) is not adhering to the
spec.

The affected publishing name servers get what they deserve from
your unbound recursor -- the error is not with unbound, but with
the publishing name servers for dns.com.

With a name such as dns.com, one would have expected that someone
in the owning organization would know better than to use a DNS
name server implementation which has such a basic protocol bug.
Ref.:

$ dig dns.com. ns +short
m2.dns.com.
m1.dns.com.
$ dig @m1.dns.com. dns.com. ds +norec

; <<>> DiG 9.16.33 <<>> @m1.dns.com. dns.com. ds +norec
; (5 servers found)
;; global options: +cmd
;; connection timed out; no servers could be reached

$ dig @m2.dns.com. dns.com. ds +norec

; <<>> DiG 9.16.33 <<>> @m2.dns.com. dns.com. ds +norec
; (5 servers found)
;; global options: +cmd
;; connection timed out; no servers could be reached

$

but

$ dig @m2.dns.com. dns.com. ns +norec +short
m1.dns.com.
m2.dns.com.
$

Regards,

- Håvard

PaulWouters · February 27, 2023, 4:33pm

Then if they do not respond properly for DS records or with proof of
non-existence, then that implementation is broken and there is not much
you can do. But this means they should also fail to work for google dns
on 8.8.8.8, or on quad9 at 9.9.9.9. That is, your customer should really
move their domain elsewhere.

Perhaps you can try a local override, eg:

local-zone: <your-parentzone> ds always_nxdomain
local-zone: <your-customerzone> ds always_nxdomain

But I don't really know if that will work.

Another option might be to run an unbound instance with val-permissive-mode=yes
and then on your regular resolver, use a forward-zone: for your
parentzone and customer zone to that unbound instance.

Paul

George_Yorgos_Thessa · February 28, 2023, 10:02am

Hi Florian,

No, again that is not my issue.

All of the servers that dns.com operates are dropping queries for the Ressource Record Type DS.

They are the authoritative servers for dns.com as well as for the parent zone of the zone our customer wants to resolve and the zone itself.

We are providing recursion for our customer.

Then if they do not respond properly for DS records or with proof of
non-existence, then that implementation is broken and there is not much
you can do. But this means they should also fail to work for google dns
on 8.8.8.8, or on quad9 at 9.9.9.9. That is, your customer should really
move their domain elsewhere.

Perhaps you can try a local override, eg:

local-zone: <your-parentzone> ds always_nxdomain
local-zone: <your-customerzone> ds always_nxdomain

But I don't really know if that will work.

What could work would be:
  local-zone: <your-parentzone> typetransparent
  local-data: "<your-customerzone> IN DS <ds-data>"
  domain-insecure: <your-customerzone>

These are DS answers to the clients.
You would need to provide fake DS data though, not sure if that is desirable and how that would break the clients that ask for it.
If the zones do not do DNSSEC in the first place (so there is no chain to break) you could try that.

Best regards,
-- Yorgos

Florian_Streibelt · February 28, 2023, 12:17pm

Perhaps you can try a local override, eg:

local-zone: <your-parentzone> ds always_nxdomain
local-zone: <your-customerzone> ds always_nxdomain

But I don't really know if that will work.

What could work would be:
  local-zone: <your-parentzone> typetransparent
  local-data: "<your-customerzone> IN DS <ds-data>"
  domain-insecure: <your-customerzone>

These are DS answers to the clients.
You would need to provide fake DS data though, not sure if that is
desirable and how that would break the clients that ask for it.
If the zones do not do DNSSEC in the first place (so there is no chain
to break) you could try that.

Something like that looks exactly like what I need/could abuse for that.

DNSSEC with that domain(s) probably is broken anyway, and returning NODATA would be sufficient here.

I really hate manually keeping track of these domains but nobody seems to be able to fix it and its hard to explain that using any of the open resolvers the domains work, only when using our resolvers it breaks.

Thanks for the help!

Another option might be to run an unbound instance with val-permissive-mode=yes
and then on your regular resolver, use a forward-zone: for your
parentzone and customer zone to that unbound instance.

The problem is that "something" at our customer is explicitely creating these DS queries.

Again thanks for all the help and effort trying to understand my problem.

I'll try to reply if the solution works.

Florian

George_Yorgos_Thessa · February 28, 2023, 12:30pm

Hi Florian,

I really hate manually keeping track of these domains but nobody seems to be able to fix it and its hard to explain that using any of the open resolvers the domains work, only when using our resolvers it breaks.

Another thing you can do if you know these auth servers is to
"frequently" issue:
unbound-control flush_infra <IP>
for each upstream server.

Then the list you need to maintain is significantly smaller I guess.

Best regards,
-- Yorgos

PaulWouters · February 28, 2023, 12:39pm

A standard validating resolver would do this. Any bind, unbound or systemd-resolved would do this.

Paul

Florian_Streibelt · February 28, 2023, 1:15pm

Yap, and it seems that the fact that there is 'no answer' given the amount of DS queries that are being sent in our direction is relatively high, leading to constant blocking of the domains. So let me rephrase my last email to 'creating huge amounts of repeated DS queries' - sorry for being unspecific.

Florian