Startup race leading to NOT IMPL on AXFR?

Folks, running nsd 4.1.14 as the hidden master for the domain
"exim.org", noticed something weird after rebooting for kernel security
updates. Have checked doc/ChangeLog for 4.1.15 and don't see anything
relevant.

Seems to be a race in startup which can result in returning NOT IMPL?

System is Ubuntu (14.04.5 LTS), has static IPv4 and IPv6 assigned as
fixed values in /etc/network/interfaces but the default route is learnt
via route advertisements. Because I need to wait for some read-only
bind-mounts constructing the chroot to be available, I have
/etc/init/nsd.override anyway to specify the start-up condition and it's
guarded on `... and net-device-up IFACE!=lo`.

We have unbound listening on localhost:53, so nsd is being given
explicit IPs to bind.

Since the last-reboot-before-today I also have a
`/etc/network/if-up.d/010wait-for-ipv6` which is trying to wait on
IPv6 being configured, as Upstart seems to emit as soon as IPv4 is
ready, before even the statically configured IPv6 is ready.

In previous reboots, nsd would fail to start because it couldn't bind
the addresses for serving, because they weren't yet available. Because
the wait-script is buggy (I now know what's needed to fix it) we only
got a "bit" of a delay, so the first attempt nsd failed as before, but
the third attempt succeeded: the IPv6 address was present, so nsd could
start. But routing was not yet present.

[2017-03-09 02:42:02.037] nsd[1081]: notice: nsd starting (NSD 4.1.14)
[2017-03-09 02:42:02.050] nsd[1081]: info: setup SSL certificates
[2017-03-09 02:42:02.277] nsd[1087]: info: zonefile zonefiles/db.exim.org is not modified
[2017-03-09 02:42:02.277] nsd[1087]: info: zonefile zonefiles/db.testdns.exim.org is not modified
[2017-03-09 02:42:02.277] nsd[1087]: info: zonefile zonefiles/unsigned.invalid254.testdns.exim.org is not modified
[2017-03-09 02:42:02.277] nsd[1087]: notice: nsd started (NSD 4.1.14), pid 1081
[2017-03-09 02:42:02.278] nsd[1081]: error: xfrd: sendto 2a02:898:31::53:0 failed Network is unreachable
[2017-03-09 02:42:02.278] nsd[1081]: error: xfrd: zone exim.org: could not send notify #1 to 2a02:898:31::53:0
[2017-03-09 02:42:02.278] nsd[1081]: error: xfrd: sendto 2a02:898:31::53:0 failed Network is unreachable
[2017-03-09 02:42:02.278] nsd[1081]: error: xfrd: zone testdns.exim.org: could not send notify #1 to 2a02:898:31::53:0
[2017-03-09 02:42:02.278] nsd[1081]: error: xfrd: sendto 2a02:898:31::53:0 failed Network is unreachable
[2017-03-09 02:42:02.278] nsd[1081]: error: xfrd: zone invalid254.testdns.exim.org: could not send notify #1 to 2a02:898:31::53:0

Seeing this, I check that things are working with a NOTIFY:

[2017-03-09 02:46:57.186] nsd[1081]: info: new control connection from 127.0.0.1
[2017-03-09 02:46:57.210] nsd[1081]: info: remote control connection authenticated
[2017-03-09 02:46:57.247] nsd[1081]: info: control cmd: notify testdns.exim.org
[2017-03-09 02:46:57.248] nsd[1081]: info: remote control operation completed
[2017-03-09 02:46:57.745] nsd[3709]: info: axfr for testdns.exim.org. from 2604:a880:800:a1::419:1001
[2017-03-09 02:46:57.908] nsd[3709]: error: failed reading from 2604:a880:800:a1::419:1001 tcp: Connection reset by peer
[2017-03-09 02:46:58.740] nsd[3709]: info: axfr for testdns.exim.org. from 2604:a880:800:a1::419:1001
[2017-03-09 02:46:58.901] nsd[3709]: error: failed reading from 2604:a880:800:a1::419:1001 tcp: Connection reset by peer
[2017-03-09 02:46:59.743] nsd[3709]: info: axfr for testdns.exim.org. from 2604:a880:800:a1::419:1001
[2017-03-09 02:46:59.906] nsd[3709]: error: failed reading from 2604:a880:800:a1::419:1001 tcp: Connection reset by peer

I check that server's logs; it's "NSD 4.1.7" from OS vendor packages,
and it sees:

[2017-03-09 02:46:57.649] nsd[1692]: error: xfrd: zone testdns.exim.org received error code NOT IMPL from [hidden-master]

... and similarly before that for the startup notifies: they did seem to
make it out in the end, but the secondary had been unable to transfer.

I restart nsd 4.1.14 on the hidden master and the problem disappears.

I can't prove the two are connected in causation, but what I see is
"server experiences issues sending on IPv6 at startup" is followed by
"secondary gets NOT IMPL on transfer".

Are they linked? Is there an issue here?

(Meanwhile, I'm going to defer nsd startup until we also have an IPv6
route).

Thanks,
-Phil

Hi Phil,

There are 2 different things here. See my responses below.

In previous reboots, nsd would fail to start because it couldn't bind
the addresses for serving, because they weren't yet available. Because
the wait-script is buggy (I now know what's needed to fix it) we only
got a "bit" of a delay, so the first attempt nsd failed as before, but
the third attempt succeeded: the IPv6 address was present, so nsd could
start. But routing was not yet present.

For this you may want to use the option "ip-transparent: yes" to allow
NSD to bind to addresses not yet available. As soon as they become
available, NSD can use them.

[2017-03-09 02:46:57.649] nsd[1692]: error: xfrd: zone
testdns.exim.org received error code NOT IMPL from [hidden-master]

This might be a red herring. NSD as a slave tries IXFR first. However,
the master is also NSD, and it cannot provide IXFR (because it doesn't
have any journals), so it returns NOTIMPL. The slave then falls back to
AXFR. If you want to avoid the initial IXFR (because it will never
work), set this on the slave:

request-xfr: AXFR <master> <key>

Regards,
Anand Buddhdev

[ Quoting <anandb@ripe.net> in "Re: [nsd-users] startup race leadin..." ]

[2017-03-09 02:46:57.649] nsd[1692]: error: xfrd: zone
testdns.exim.org received error code NOT IMPL from [hidden-master]

This might be a red herring. NSD as a slave tries IXFR first. However,
the master is also NSD, and it cannot provide IXFR (because it doesn't
have any journals), so it returns NOTIMPL. The slave then falls back to
AXFR. If you want to avoid the initial IXFR (because it will never
work), set this on the slave:

NOTIMPL when doing an IXFR? I thought the proper reponse was just to send
an AXFR at that point. At least RFC 1995 Section 4 says this. Has this been
updated by a more recent doc?

If the master doesn't support IXFR and force-feeds the slave with a full
AXFR, then the slave has choice. Instead, if the master sends NOTIMPL,
then the slave can choose to try another master first, for example. I'm
guessing this was the thinking behind the implementation.

Regards,
Anand