NSD serves old serial after restart

Hi!

I have upgraded from nsd 4.3.5 to 4.6. After the restart of the server, it serves an old zone. For example:

NSD 4.3.5:

07:31:13 nsd-pl[811535]: notify for kepno.pl. from X.X.X.20 serial 1660716049

07:31:13 nsd-pl[811535]: notify for kepno.pl. from XXXX:XXXX:9::5 serial 1660716049

07:31:13 nsd-pl[3084]: xfrd: zone kepno.pl committed “received update to serial 1660716049 at 2022-08-17T07:31:13 from X.X.X.20 TSIG verified with key foobar”

07:31:13 nsd-pl[3089]: zone kepno.pl. received update to serial 1660716049 at 2022-08-17T07:31:13 from X.X.X.20 TSIG verified with key foobar of 2403 bytes in 9.8e-05 seconds

07:31:13 nsd-pl[811535]: notify for kepno.pl. from X.X.X.4 serial 1660716049

07:31:13 nsd-pl[811535]: notify for kepno.pl. from XXXX:XXXX:8::5 serial 1660716049

07:31:14 nsd-pl[3084]: zone kepno.pl serial 1660716048 is updated to 1660716049

07:46:24 nsd-pl[3089]: writing zone kepno.pl to file kepno.pl.zone

09:46:22 nsd-pl[1008051]: notify for kepno.pl. from XXXX:XXXX:9::5 serial 1660716050

09:46:22 nsd-pl[1008051]: notify for kepno.pl. from X.X.X.20 serial 1660716050

09:46:22 nsd-pl[3084]: xfrd: zone kepno.pl committed “received update to serial 1660716050 at 2022-08-17T09:46:22 from XXXX:XXXX:9::5 TSIG verified with key foobar”

09:46:22 nsd-pl[1008051]: notify for kepno.pl. from XXXX:XXXX:8::5 serial 1660716050

09:46:22 nsd-pl[1008051]: notify for kepno.pl. from X.X.X.4 serial 1660716050

09:46:27 nsd-pl[3089]: zone kepno.pl. received update to serial 1660716050 at 2022-08-17T09:46:22 from XXXX:XXXX:9::5 TSIG verified with key foobar of 840 bytes in 0.000108 seconds

09:46:28 nsd-pl[3084]: zone kepno.pl serial 1660716049 is updated to 1660716050

→ NSD 4.3.5 serves serial 1660716050

Now, upgrade to 4.6 and restart NSD:

10:32:04 nsd-pl[1072241]: zone kepno.pl read with success

10:32:04 nsd-pl[1072241]: rehash of zone kepno.pl. with parameters 1 0 12 e831662b2ffa02c1

10:32:10 nsd-pl[1072240]: zone kepno.pl serial 1660716050 is updated to 1660716049

→ Why is the serial going backwards?

nsd-control -c /etc/nsd/nsd-pl.conf zonestatus kepno.pl

zone: kepno.pl

state: ok

served-serial: “1660716049 since 2022-08-17T10:32:10”

commit-serial: “1660716050 since 2022-08-17T10:55:43”

wait: “1159 sec between attempts”

I can fix it with force_transfer:

11:14:25 nsd-pl[1072240]: xfrd: zone kepno.pl committed “received update to serial 1660716050 at 2022-08-17T11:14:25 from X.X.X.4 TSIG verified with key foobar”

11:14:26 nsd-pl[1072241]: rehash of zone kepno.pl. with parameters 1 0 12 e831662b2ffa02c1

11:14:26 nsd-pl[1072241]: zone kepno.pl. received update to serial 1660716050 at 2022-08-17T11:14:25 from X.X.X.4 TSIG verified with key foobar of 31937 bytes in 0.005712 seconds

11:14:26 nsd-pl[1072240]: zone kepno.pl serial 1660716049 is updated to 1660716050

nsd-control -c /etc/nsd/nsd-pl.conf zonestatus kepno.pl

zone: kepno.pl

state: ok

served-serial: “1660716050 since 2022-08-17T11:14:25”

commit-serial: “1660716050 since 2022-08-17T11:14:25”

wait: “279 sec between attempts”

Is this a bug or a feature?

Thanks

Klaus

Hi Klaus,

NSD 4.3.5:
07:31:13 nsd-pl[811535]: notify for kepno.pl. from X.X.X.20 serial 1660716049
07:31:13 nsd-pl[811535]: notify for kepno.pl. from XXXX:XXXX:9::5 serial 1660716049
07:31:13 nsd-pl[3084]: xfrd: zone kepno.pl committed "received update to serial 1660716049 at 2022-08-17T07:31:13 from X.X.X.20 TSIG verified with key foobar"
07:31:13 nsd-pl[3089]: zone kepno.pl. received update to serial 1660716049 at 2022-08-17T07:31:13 from X.X.X.20 TSIG verified with key foobar of 2403 bytes in 9.8e-05 seconds
07:31:13 nsd-pl[811535]: notify for kepno.pl. from X.X.X.4 serial 1660716049
07:31:13 nsd-pl[811535]: notify for kepno.pl. from XXXX:XXXX:8::5 serial 1660716049
07:31:14 nsd-pl[3084]: zone kepno.pl serial 1660716048 is updated to 1660716049
07:46:24 nsd-pl[3089]: writing zone kepno.pl to file kepno.pl.zone

Here, the zone kepno.pl has been saved with serial 1660716049.

09:46:22 nsd-pl[1008051]: notify for kepno.pl. from XXXX:XXXX:9::5 serial 1660716050
09:46:22 nsd-pl[1008051]: notify for kepno.pl. from X.X.X.20 serial 1660716050
09:46:22 nsd-pl[3084]: xfrd: zone kepno.pl committed "received update to serial 1660716050 at 2022-08-17T09:46:22 from XXXX:XXXX:9::5 TSIG verified with key foobar"
09:46:22 nsd-pl[1008051]: notify for kepno.pl. from XXXX:XXXX:8::5 serial 1660716050
09:46:22 nsd-pl[1008051]: notify for kepno.pl. from X.X.X.4 serial 1660716050
09:46:27 nsd-pl[3089]: zone kepno.pl. received update to serial 1660716050 at 2022-08-17T09:46:22 from XXXX:XXXX:9::5 TSIG verified with key foobar of 840 bytes in 0.000108 seconds
09:46:28 nsd-pl[3084]: zone kepno.pl serial 1660716049 is updated to 1660716050
-> NSD 4.3.5 serves serial 1660716050

NSD has internally updated to serial 1660716050, but not yet saved it to disk. By default, NSD writes out zone files only once per hour.

Now, upgrade to 4.6 and restart NSD:
10:32:04 nsd-pl[1072241]: zone kepno.pl read with success
10:32:04 nsd-pl[1072241]: rehash of zone kepno.pl. with parameters 1 0 12 e831662b2ffa02c1
10:32:10 nsd-pl[1072240]: zone kepno.pl serial 1660716050 is updated to 1660716049
--> Why is the serial going backwards?

NSD read the zone from disk, and it still had the previous serial number, so that's what got loaded into memory. Eventually, NSD would have noticed that it's outdated and would have done an XFR to update it.

Before restarting NSD, it is good practice to write zones to disk. Or configure it to save an updated zone immediately to disk, by setting "zonefiles-write" to a low value, so that zone files on disk are as up to date as possible.

[snip]

Is this a bug or a feature?

Feature :wink:

Regards,
Anand

Hi Anand!

> -> NSD 4.3.5 serves serial 1660716050

NSD has internally updated to serial 1660716050, but not yet saved it to
disk. By default, NSD writes out zone files only once per hour.

> Now, upgrade to 4.6 and restart NSD:
> 10:32:04 nsd-pl[1072241]: zone kepno.pl read with success
> 10:32:04 nsd-pl[1072241]: rehash of zone kepno.pl. with parameters 1 0 12
e831662b2ffa02c1
> 10:32:10 nsd-pl[1072240]: zone kepno.pl serial 1660716050 is updated to
1660716049
> --> Why is the serial going backwards?

NSD read the zone from disk, and it still had the previous serial
number, so that's what got loaded into memory.

That makes sense.

Eventually, NSD would
have noticed that it's outdated and would have done an XFR to update it.

That's the weird thing. NSD knows that the version on disk is older than the previously served one (serial is goind backwards: "1660716050 is updated to 1660716049"). Hence it could check the serial on the primary and fetch the latest version.

I also tried "nsd-control transfer ....", but that also did not triggered an XFR. Only "force_transfer" triggered an XFR. From my understanding, "transfer: try to update slave zones to newer serial" should also trigger an XFR as the primary has a higher serial then the current served one.

Before restarting NSD, it is good practice to write zones to disk. Or
configure it to save an updated zone immediately to disk, by setting
"zonefiles-write" to a low value, so that zone files on disk are as up
to date as possible.

Thanks for the tip
Klaus

Hi Klaus,

That's the weird thing. NSD knows that the version on disk is older
than the previously served one (serial is goind backwards: "1660716050
is updated to 1660716049"). Hence it could check the serial on the
primary and fetch the latest version.

Except that NSD keeps a state file as well, in which it maintains timers about when next to query the primary. For this zone, that time had not yet arrived.

If you stop NSD, delete the state file, and start NSD, then it will have no memory of the timers. After loading all the zones, it will immediately query their primaries and update them if needed. But this has the downside of causing a thundering herd towards the primaries in the case where you have lots of secondary zones.

I also tried "nsd-control transfer ....", but that also did not
triggered an XFR. Only "force_transfer" triggered an XFR. From my
understanding, "transfer: try to update slave zones to newer serial"
should also trigger an XFR as the primary has a higher serial then the
current served one.

Hmm. So NSD thinks the zone is up to date, because it has a newer serial in the state file, but is actually serving an old serial. Could be a subtle bug. I'll see if I can reproduce it on our test server.

Regards,
Anand

Hi Anand!

Thanks for your help.

> I also tried "nsd-control transfer ....", but that also did not
> triggered an XFR. Only "force_transfer" triggered an XFR. From my
> understanding, "transfer: try to update slave zones to newer serial"
> should also trigger an XFR as the primary has a higher serial then the
> current served one.

Hmm. So NSD thinks the zone is up to date, because it has a newer serial
in the state file, but is actually serving an old serial. Could be a
subtle bug. I'll see if I can reproduce it on our test server.

It also confuses me, that the commited serial is higher than the served serial:
root@tld-all-fam1:/home/darilion# nsd-control -c /etc/nsd/nsd-shared.conf zonestatus cy
zone: cy
        state: ok
        served-serial: "2022081705 since 2022-08-17T12:07:26"
        commit-serial: "2022081706 since 2022-08-17T12:31:40"
        wait: "6845 sec between attempts"

I also have in the zone settings: max-refresh-time: 300
I would expect that at least after 5 minutes NSD should perform a SOA query against the primary, detect the higher serial, and then perform the XFR. But maybe NSD is comparing the "commit-serial" with the primary-serial and this doing nothing,

I checked with tcpdump: on "nsd-control transfer cy" it performs an IXFR request with serial in the SOA=2022081706.

So something is going wrong here. If NSD has the 2022081706 zone local available, then it should serve it. If only the 2022081705 is available on disk, NSD should perform the serial check against the primary (IXFR request) with the serial of the local served zone, which would be 2022081705 and not 2022081706.

regards
Klaus

It also confuses me, that the commited serial is higher than the served serial:
root@tld-all-fam1:/home/darilion# nsd-control -c /etc/nsd/nsd-shared.conf
zonestatus cy
zone: cy
        state: ok
        served-serial: "2022081705 since 2022-08-17T12:07:26"
        commit-serial: "2022081706 since 2022-08-17T12:31:40"
        wait: "6845 sec between attempts"

I also have in the zone settings: max-refresh-time: 300
I would expect that at least after 5 minutes NSD should perform a SOA query
against the primary, detect the higher serial, and then perform the XFR. But
maybe NSD is comparing the "commit-serial" with the primary-serial and this
doing nothing,

I checked with tcpdump: on "nsd-control transfer cy" it performs an IXFR
request with serial in the SOA=2022081706.

So something is going wrong here. If NSD has the 2022081706 zone local
available, then it should serve it. If only the 2022081705 is available on disk,
NSD should perform the serial check against the primary (IXFR request) with
the serial of the local served zone, which would be 2022081705 and not
2022081706.

Now, as the primary had a new serial 2022081707, NSD now serves 2022081707, but again there are "updated failed" log messages.

15:06:36 nsd[2657770]: notify for cy. from X.X.X.20 serial 2022081707
15:06:36 nsd[2657770]: notify for cy. from XXXX:XXXX:9::5 serial 2022081707
15:06:36 nsd[2391509]: xfrd: zone cy committed "received update to serial 2022081707 at 2022-08-17T15:06:36 from X.X.X.20 TSIG verified with key foobar"
15:06:36 nsd[2391509]: xfrd: zone cy: soa serial 2022081707 update failed, restarting transfer (notified zone)
15:08:40 nsd[2659700]: notify for cy. from X.X.X.4 serial 2022081707
15:08:40 nsd[2659700]: notify for cy. from XXXX:XXXX:8::5 serial 2022081707
15:08:40 nsd[2391509]: xfrd: zone cy committed "received update to serial 2022081707 at 2022-08-17T15:08:40 from X.X.X.4 TSIG verified with key foobar"
15:08:40 nsd[2391512]: zone cy. received update to serial 2022081707 at 2022-08-17T15:08:40 from X.X.X.4 TSIG verified with key foobar of 701 bytes in 8.9e-05 seconds
15:08:40 nsd[2391509]: zone cy serial 2022081705 is updated to 2022081707

Further, it is not trustworthy at all, and I suspect that this NSD now serves a broken zone file:
- NSD served serial 2022081705
- NSD request IXFR with serial 2022081706
- the primary has 2022081707, hence sending an IXFR with the differences from 2022081706 to 2022081707

I suspect that only the diff from 2022081706 to 2022081707 was applied on top of 2022081705. Hence, I request now full AXFR to be on the safe side.

This is very scary
Thanks
Klaus

I just created a bug report: https://github.com/NLnetLabs/nsd/issues/227

It seems that this bug does not corrupt the local zone, but prevents that NSD as Secondary fetches newer zones.

regards
Klaus

I presume that's to reduce I/O? Is there any reason why updates aren't
written as soon as the file they are replacing is older than 1 hour?

In other words, write as soon as possible unless we already had an update
written in the last hour.

Sorry if this is a dumb question!

Cheers, Jamie

Hi Jamie!

Von: Jamie Landeg-Jones <jamie@catflap.org>
Gesendet: Mittwoch, 17. August 2022 23:50
An: nsd-users@lists.nlnetlabs.nl; Klaus Darilion <klaus.darilion@nic.at>;
anandb@ripe.net
Betreff: Re: [nsd-users] NSD serves old serial after restart

> NSD has internally updated to serial 1660716050, but not yet saved it to
> disk. By default, NSD writes out zone files only once per hour.

I presume that's to reduce I/O?

Yes. Consider a very huge zone which gets updated with IXFR every few seconds. That would be heavy to write the zone to disk after every update.

Is there any reason why updates aren't
written as soon as the file they are replacing is older than 1 hour?

AFAIK this is actually the case, when default zonefile-write=1h

In my opinion, it would be useful to have a zonefile-write-onshutdown=true option to force a "write" when the daemon is shut down. I think this is what Bind9 for example does.

regards
Klaus

I just created a bug report: https://github.com/NLnetLabs/nsd/issues/227

It seems that this bug does not corrupt the local zone, but prevents that NSD as Secondary fetches newer zones.

Thanks Klaus, we're on it!
Thank you Anand for giving excellent and helpful feedback (in general actually)!

Regards,

-- Willem