auth-zones and DNS NOTIFY

Hello,

thanks for the auth-zone feature in 1.7!

Unfortunately, for my setup it's missing DNS NOTIFY recognition.
Are there plans to implement DNS NOTFY?
If so, for my setup it was important that notifies are (optionally) also
repected if their source address is different than the master definition(s).

Thanks,

-harry

Hi Harry,

Yes, DNS NOTIFY is implemented in the current code repo version. You
can specify additional sources with allow-notify.

Best regards, Wouter

Great, thanks a lot!.
Found time to update some production systems, but unfortunately zone transfer seem to work only initially, then I see these messages logged:
unbound: [14927:0] error: ./services/authzone.c at 6102 could not pthread_mutex_lock(&xfr->lock): Resource deadlock avoided
unbound: [14927:0] error: ./services/authzone.c at 3454 could not pthread_mutex_lock(&xfr->lock): Resource deadlock avoided

Increasing log level to 3 doesn't show more useful.

After the error occurs, unbound returns "error response SERVFAIL" for all queries which match stub-zones: and all quieries matching auth-zones: get the old records (no xfer any more).

Any idea where the problem could come from?
Will try to make all stub-zones auth-zones and see if that changes anything....

Thanks,

-Harry

Couldn't find out more, sorry, no config change I made had any effect.

I'm running 1.7.1 on FreeBSD inside a jail and use "allow-notify:", since the transfer takes a different route (via tunnel) than the notify source.
The incoming notify triggers the error(-log) and the stall for stub-zones.

I had to remove auth-zones: for now to get my setup back into working condition.

My intention was to serve auth-zones without using a zonefile, but it doesn't make any difference whether I use one or not.
There seems to be a locking problem when a xfer starts after a notify was received. Unfortunately nothing I can easily track, since I'm not used to debuggers and don't even have a system where I could install one at firsthand.

I hope someone can take care of that issue.
The dedlock error quoted above corresponds to auth_xfer_timer() for line 6102:

struct auth_xfer* xfr = (struct auth_xfer*)arg;
struct module_env* env;
log_assert(xfr->task_nextprobe);
lock_basic_lock(&xfr->lock);
env = xfr->task_nextprobe->env;
if(env->outnet->want_to_quit) {
lock_basic_unlock(&xfr->lock);
return; /* stop on quit */
}

     /\* see if zone has expired, and if so, also set auth\_zone expired \*/

and auth_zones_notify() for line 3454:

/* see which zone this is */
lock_rw_rdlock(&az->lock);
xfr = auth_xfer_find(az, nm, nmlen, dclass);
if(!xfr) {
lock_rw_unlock(&az->lock);
/* no such zone, refuse the notify */
*refused = 1;
return 0;
}
lock_basic_lock(&xfr->lock);
lock_rw_unlock(&az->lock);

     /\* check access list for notifies \*/

But no way for me to get any further, sorry.

-harry

Repeat by testing with auth-zone as a prefetch for root seems to yield similar results after 12 to 24 hours.

LOG
unbound: [18768:0] error: can't bind socket: Permission denied for ::

CONF
auth-zone:
   name: "."
   master: "lax.xfr.dns.icann.org"
   master: "iad.xfr.dns.icann.org"
   url: "http://www.internic.net/domain/root.zone"
   fallback-enabled: yes
   for-downstream: no
   for-upstream: yes
   zonefile: "root.zone"

auth-zone:
   name: "arpa"
   master: "lax.xfr.dns.icann.org"
   master: "iad.xfr.dns.icann.org"
   url: "http://www.internic.net/domain/arpa.zone"
   fallback-enabled: yes
   for-downstream: no
   for-upstream: yes
   zonefile: "arpa.zone"

auth-zone:
   name: "in-addr.arpa"
   master: "lax.xfr.dns.icann.org"
   master: "iad.xfr.dns.icann.org"
   url: "http://www.internic.net/domain/in-addr.arpa.zone"
   fallback-enabled: yes
   for-downstream: no
   for-upstream: yes
   zonefile: "in-addr.arpa.zone"

auth-zone:
   name: "ip6.arpa"
   master: "lax.xfr.dns.icann.org"
   master: "iad.xfr.dns.icann.org"
   url: "http://www.internic.net/domain/ip6.arpa.zone"
   fallback-enabled: yes
   for-downstream: no
   for-upstream: yes
   zonefile: "ip6.arpa.zone"

without having looked for such errors, I just checked my zones (., arpa, in-addr.arpa and ip6.arpa) *are* current.
At far as visible: the SOA serial for . and arpa change every day.
difference: I use nameserver from the zones ns record and don't have a url configured.

unbound-1.7.1 + patch https://unbound.nlnetlabs.nl/pipermail/unbound-users/2018-May/005279.html

Andreas

Hi Harry,

Kudos!
Highly appreciate your work and support! Far better than many commercial competitors…
Will drop a note after testing, which might not happen before next week :frowning:

One short question: Does this also address/explain the stall/outage (more precisly the "error response SERVFAIL") for stub-zone: matching queries?

Thanks,

-harry

Hi,

Hi Harry,

Hi Harry,

Yes, DNS NOTIFY is implemented in the current code repo version. You
can specify additional sources with allow-notify.

Great, thanks a lot!.
Found time to update some production systems, but unfortunately zone
transfer seem to work only initially, then I see these messages logged:

Thank you very much for the detailed report. I found the deadlock
problem and fixed it for the upcoming release.

There is a patch as well in case that is useful for you. The routine
simply forgot to unlock in one of the cases for an incoming NOTIFY
message. This explains why the other report did not encounter the
problem.

Index: services/authzone.c

--- services/authzone.c (revision 4703)
+++ services/authzone.c (working copy)
@@ -3425,8 +3425,10 @@
{
/* if the serial of notify is older than we have, don't fetch
* a zone, we already have it */
- if(has_serial && !xfr_serial_means_update(xfr, serial))
+ if(has_serial && !xfr_serial_means_update(xfr, serial)) {
+ lock_basic_unlock(&xfr->lock);
return;
+ }
/* start new probe with this addr src, or note serial */
if(!xfr_start_probe(xfr, env, fromhost)) {
/* not started because already in progress, note the serial */

Best regards, Wouter

Kudos!
Highly appreciate your work and support! Far better than many
commercial competitors…
Will drop a note after testing, which might not happen before next week :frowning:

One short question: Does this also address/explain the stall/outage
(more precisly the "error response SERVFAIL") for stub-zone: matching
queries?

Possibly, not getting updates would explain servfail, because an expired
auth zone cannot deliver the information for resolution. But it may not
be this, it could be some other issue.

Best regards, Wouter

Dear all, Wouter,

sorry for bringing it up again, but I'm having real-world problems with this nice new auth-zone: and allow-notify: feature :wink:

My auth-zone: has two master: definitions.
It seems that the second defintion is probed first, when a NOTIFY comes in (at least if the NOTIFY is not from one of the master); haven't verified/falsified, neither by code inspection nor by testing beyond lowest level yet. As long as it's a static and documented behaviour everything is fine.

But unfortunately unbound stops probe/xfer-attempts if the fisrt master selected/probed doesn't return a higher serial than the NOTIFY posted.
If the NOTIFY matched a allow-notify: definition (not coming from [one of] the master), it should continue and probe the second (etc.) master I think. Whether it's sensible to also probe all masters in case the NOTIFY came from one of them is beyond my consideration scope atm. But in case the NOTIFY came from non-master, the circumstance/decision (allow-notify:) itself legitimates probing all masters in case the first responded with not higher serail than NOTIFY posted, imho.

Real world: ActiveDirectory e.g. or any other multi-master backend which needs more than 1 ms to replicate upstream.

What do oyu think?

Thanks,

-harry

P.S.: I still have another severe problem with auth-zone: and CNAME RRs. As soon as I keep for-downstream: yes, CNAMEs pointing to other zones aren't resolved, although unbound is authoritative for the(se) other zone(s) too!
That's unique to unbound afaik.
Is this really intended by design?

Hi Harry,

Yes, DNS NOTIFY is implemented in the current code repo version. You
can specify additional sources with allow-notify.

Dear all, Wouter,

sorry for bringing it up again, but I'm having real-world problems with this nice new auth-zone: and allow-notify: feature :wink:

My auth-zone: has two master: definitions.
It seems that the second defintion is probed first, when a NOTIFY comes in (at least if the NOTIFY is not from one of the master); haven't verified/falsified, neither by code inspection nor by testing beyond lowest level yet. As long as it's a static and documented behaviour everything is fine.

But unfortunately unbound stops probe/xfer-attempts if the fisrt master selected/probed doesn't return a higher serial than the NOTIFY posted.
If the NOTIFY matched a allow-notify: definition (not coming from [one of] the master), it should continue and probe the second (etc.) master I think. Whether it's sensible to also probe all masters in case the NOTIFY came from one of them is beyond my consideration scope atm. But in case the NOTIFY came from non-master, the circumstance/decision (allow-notify:) itself legitimates probing all masters in case the first responded with not higher serail than NOTIFY posted, imho.

Sorry for this stupid and misleading serial comparing confusion:
It's not about probe-serial not higher than NOTIFY serial, but probe-serial beeing _lower_ than NOTIFY.
Just a mis-explanation, the problem itself should be explained correctly, I hope. Just the comparing I wrote is nonsense (since NOTIFY sends the new serial, which in case of multi-master backends, ins't replicated yet, so the probe-return is _lower_, not equal – sorry).

Hi Harry,

Hi Harry,

Yes, DNS NOTIFY is implemented in the current code repo version. You
can specify additional sources with allow-notify.

Dear all, Wouter,

sorry for bringing it up again, but I'm having real-world problems
with this nice new auth-zone: and allow-notify: feature :wink:

My auth-zone: has two master: definitions.
It seems that the second defintion is probed first, when a NOTIFY
comes in (at least if the NOTIFY is not from one of the master);
haven't verified/falsified, neither by code inspection nor by testing
beyond lowest level yet. As long as it's a static and documented
behaviour everything is fine.

Thank you for the bugreport. I have fixed the code, so that I does not
stop the probe when a master replies with the current serial. Instead,
it'll continue and probe the masters, until one has an update. If all
of them respond with the current serial, it assumes it is up to date and
waits (the SOA timer).

The first master that gets a query is the same master that sent the
NOTIFY. After that it should scan them in order they appeared in config.

(The code is in the repository, pick up services/authzone.c and
services/authzone.h if you want to have the update).

Best regards, Wouter

Dear Wouter,

thanks a lot for all the nice improvements!

I hadn't found time to start over with my unbound deployments for some time, but did so last weekend.
And now I'm nagging again :wink:

It's again about auth-zone: and notify resp. TCP transfer.
Without inspecting the code, I guess my issues are tightly related:

auth-zone:
name: "a.b.c.de."
master: 169.254.0.53
master: 169.254.0.54
allow-notify: 172.17.2.231
allow-notify: 172.17.2.232
a.b.c.de. get's notify from non-master, but listed in allow-notify:
Log:
... unbound[68691]: [68691:0] info: received NOTIFY serial 2019031715 for a.b.c.de. from 172.17.2.232 port 57053

_For my test case_, both masters were reachable by UDP, but the first master doesn't respond to TCP (axfr).
Then the second master never get's asked, just this is logged:
... peleus unbound[68691]: [68691:0] debug: tcp took too long, dropped

So far just a not very realistic test case,
but I guess the following problem does have the exactly same root cause.
I use the same _zonfile-less_ auth-zone: from above (having two masters defined) and start unbound (without using a zonefile).
Problem:
If he first master is down, the second master never get's any TCP axfr attempt and unbound will permanently return SERVFAIL, instead ot trying the second – available – master for loading the zone at startup!

If the second master is down, this is no issue of course.
But for my planned setup it's crucial that auth-zones get loaded from *any* available master.

So I guesst an additional TCP/axfr timer was needed (post notify, resp. at startup unrelated to notify) to continue asking multiple masters.
Some timer must already be in place, since this line is logged:
... peleus unbound[68691]: [68691:0] debug: tcp took too long, dropped
But afterwards, the other master(s) should be contacted, not continuing with the first for AXFR.
I'd highly appreciate if that timeout was adjustable, or at lease reduced. As far as I remember it was in a minute range, while 10-20s would better fit, I think.

Do you think this is worth adding/fixing?

Thanks,

-harry

Hi Harry,

Hi Harry,

Hi Harry,

Yes, DNS NOTIFY is implemented in the current code repo version. You
can specify additional sources with allow-notify.

Dear all, Wouter,

sorry for bringing it up again, but I'm having real-world problems
with this nice new auth-zone: and allow-notify: feature :wink:

My auth-zone: has two master: definitions.
It seems that the second defintion is probed first, when a NOTIFY
comes in (at least if the NOTIFY is not from one of the master);
haven't verified/falsified, neither by code inspection nor by testing
beyond lowest level yet. As long as it's a static and documented
behaviour everything is fine.

Thank you for the bugreport. I have fixed the code, so that I does not
stop the probe when a master replies with the current serial. Instead,
it'll continue and probe the masters, until one has an update. If all
of them respond with the current serial, it assumes it is up to date and
waits (the SOA timer).

The first master that gets a query is the same master that sent the
NOTIFY. After that it should scan them in order they appeared in config.

(The code is in the repository, pick up services/authzone.c and
services/authzone.h if you want to have the update).

Dear Wouter,

thanks a lot for all the nice improvements!

I hadn't found time to start over with my unbound deployments for some
time, but did so last weekend.
And now I'm nagging again :wink:

It's again about auth-zone: and notify resp. TCP transfer.
Without inspecting the code, I guess my issues are tightly related:

auth-zone:
name: "a.b.c.de."
master: 169.254.0.53
master: 169.254.0.54
allow-notify: 172.17.2.231
allow-notify: 172.17.2.232
a.b.c.de. get's notify from non-master, but listed in allow-notify:
Log:
... unbound[68691]: [68691:0] info: received NOTIFY serial 2019031715
for a.b.c.de. from 172.17.2.232 port 57053

_For my test case_, both masters were reachable by UDP, but the first
master doesn't respond to TCP (axfr).
Then the second master never get's asked, just this is logged:
... peleus unbound[68691]: [68691:0] debug: tcp took too long, dropped

So far just a not very realistic test case,
but I guess the following problem does have the exactly same root cause.
I use the same _zonfile-less_ auth-zone: from above (having two masters
defined) and start unbound (without using a zonefile).
Problem:
If he first master is down, the second master never get's any TCP axfr
attempt and unbound will permanently return SERVFAIL, instead ot trying
the second – available – master for loading the zone at startup!

So, it turns out there is an issue with tcp timeouts, this is fixed now.
Also logging has been improved for auth zones. And there is a fix for
using incorrect socket type for SOA probes.

I think this may have fixed your bug. When I hit tcp timeouts, they
worked fine and the next master was attempted. But the logging of that
now makes that visible.

The timeout is at 10s now for these transfers. The design is to load
from any master that can be contacted. It attempts first to contact the
master with the IP address that the NOTIFY or SOA-probe packet
indicates. Then it tries all of them, in sequence.

Best regards, Wouter