We noticed that one of our slave NSD servers
stopped updating its zones, and are trying to
find out why. The problem we faced is that
there's no understanding why it says what it
says. Even after looking at the code it does
not makes clear
Here are the error messages for one domain:
11:25:35 panda nsd[1094]: xfrd: zone corpit.ru, from 192.168.177.15@54: tsig error (Bad Time)
11:25:35 panda nsd[1094]: xfrd: zone corpit.ru, from 192.168.177.15@54: bad tsig signature
11:37:18 panda nsd[1094]: xfrd: zone corpit.ru received error code SERVER NOT AUTHORITATIVE FOR ZONE from 192.168.177.15@54
(yes we run nsd on a non-standard port, that's not a problem).
I can only guess the main error is "Bad Time", and
the other two are the causes (but again I can be
wrong). But even for the first "BADTIME" error, -
is it coming from the DNSSEC stuff (if yes, what the
problem is?), or from the usage of authorization key
when doing XFR?
11:25:35 panda nsd[1094]: xfrd: zone corpit.ru, from 192.168.177.15@54:
tsig error (Bad Time)
11:25:35 panda nsd[1094]: xfrd: zone corpit.ru, from 192.168.177.15@54:
bad tsig signature
11:37:18 panda nsd[1094]: xfrd: zone corpit.ru received error code
SERVER NOT AUTHORITATIVE FOR ZONE from 192.168.177.15@54
(yes we run nsd on a non-standard port, that's not a problem).
I can only guess the main error is "Bad Time", and
the other two are the causes (but again I can be
wrong). But even for the first "BADTIME" error, -
is it coming from the DNSSEC stuff (if yes, what the
problem is?), or from the usage of authorization key
when doing XFR?
TSIG requires the time on the primary and secondary to be synchronised
to within 5 minutes. Check the system time on your two servers. One of
them has probably drifted more than 5 minutes. If you're not already
running something like ntp or chrony, you should do that to keep the
time accurate on these servers.
The time is syncronized within microseconds. We're running time sync
daemons on both, and both works as expected. This is the first thing
I checked - if the time is okay - yes it is, ticking exactly in line
on both servers.
I'm trying to rebuild NSD with #ifdebug turned on in the place where
BADTIME is returned, but so far it doesn't work, - apparently it tries
to print to a closed filedescriptor
And after actually enabling the debug stuff in there I found out that
the time differs by exactly 3 hours - which is our GMT offset. Which
lead me to realize that the problem apparently is wrong timezone
picked by one of the NSDs, - after restart of NSD on master, the
secondary were finally able to transfer the zones.
There was an issue still, - but this looks more like a prob is with
systemd integration (debian), not in nsd itself:
Aug 01 12:24:15 panda.tls.msk.ru systemd[1]: Starting Name Server Daemon...
Aug 01 12:24:15 panda.tls.msk.ru nsd[3195]: nsd starting (NSD 4.3.5)
Aug 01 12:24:15 panda.tls.msk.ru nsd[3195]: [2021-08-01 12:24:15.242] nsd[3195]: notice: nsd starting (NSD 4.3.5)
Aug 01 12:24:15 panda.tls.msk.ru nsd[3196]: nsd started (NSD 4.3.5), pid 3195
Aug 01 12:24:15 panda.tls.msk.ru nsd[3196]: [2021-08-01 12:24:15.274] nsd[3196]: notice: nsd started (NSD 4.3.5), pid 3195
Aug 01 12:24:16 panda.tls.msk.ru nsd[3195]: zone corpit.ru serial 2020071628 is updated to 2020071632
Aug 01 12:24:16 panda.tls.msk.ru nsd[3195]: [2021-08-01 12:24:16.375] nsd[3195]: info: zone corpit.ru serial 2020071628 is updated to 2020071632
.... a few more zones updated ....
Aug 01 12:25:45 panda.tls.msk.ru systemd[1]: nsd.service: start operation timed out. Terminating.
Aug 01 12:25:45 panda.tls.msk.ru nsd[3196]: signal received, shutting down...
Aug 01 12:25:45 panda.tls.msk.ru nsd[3196]: [2021-08-01 12:25:45.228] nsd[3196]: warning: signal received, shutting down...
Aug 01 12:25:45 panda.tls.msk.ru systemd[1]: nsd.service: Failed with result 'timeout'.
Aug 01 12:25:45 panda.tls.msk.ru systemd[1]: Failed to start Name Server Daemon.
Aug 01 12:25:45 panda.tls.msk.ru systemd[1]: nsd.service: Scheduled restart job, restart counter is at 1.
Aug 01 12:25:45 panda.tls.msk.ru systemd[1]: Stopped Name Server Daemon.
Aug 01 12:25:45 panda.tls.msk.ru systemd[1]: Starting Name Server Daemon...
Aug 01 12:25:45 panda.tls.msk.ru nsd[3240]: nsd starting (NSD 4.3.5)
Aug 01 12:25:45 panda.tls.msk.ru nsd[3240]: [2021-08-01 12:25:45.512] nsd[3240]: notice: nsd starting (NSD 4.3.5)
Aug 01 12:25:45 panda.tls.msk.ru nsd[3241]: nsd started (NSD 4.3.5), pid 3240
Aug 01 12:25:45 panda.tls.msk.ru nsd[3241]: [2021-08-01 12:25:45.543] nsd[3241]: notice: nsd started (NSD 4.3.5), pid 3240
Aug 01 12:27:15 panda.tls.msk.ru systemd[1]: nsd.service: start operation timed out. Terminating.
Aug 01 12:27:15 panda.tls.msk.ru nsd[3241]: signal received, shutting down...
Aug 01 12:27:15 panda.tls.msk.ru nsd[3241]: [2021-08-01 12:27:15.728] nsd[3241]: warning: signal received, shutting down...
Aug 01 12:27:15 panda.tls.msk.ru systemd[1]: nsd.service: Failed with result 'timeout'.
Aug 01 12:27:15 panda.tls.msk.ru systemd[1]: Failed to start Name Server Daemon.
Aug 01 12:27:15 panda.tls.msk.ru systemd[1]: nsd.service: Scheduled restart job, restart counter is at 2.
Aug 01 12:27:15 panda.tls.msk.ru systemd[1]: Stopped Name Server Daemon.
Aug 01 12:27:15 panda.tls.msk.ru systemd[1]: Starting Name Server Daemon...
Aug 01 12:27:16 panda.tls.msk.ru nsd[3266]: nsd starting (NSD 4.3.5)
(for some reason some lines are repeated in the log but with different prefix,
this is yet another thing to look at).
Anyway, the initial - main - issue is now solved, and I'm trying to understand
what actually happened.
And after actually enabling the debug stuff in there I found out that
the time differs by exactly 3 hours - which is our GMT offset. Which
lead me to realize that the problem apparently is wrong timezone
picked by one of the NSDs, - after restart of NSD on master, the
secondary were finally able to transfer the zones.
Don't you set all your servers to UTC?
There was an issue still, - but this looks more like a prob is with
systemd integration (debian), not in nsd itself: