An old NSD daemon, stuck with old data?

Stephane_Bortzmeyer · November 22, 2005, 10:14am

We experienced a funny problem with a nsd name server. Serial numbers
seem to oscillate between an old value and the current one:

% check_soa ma
ns2.nic.fr has serial number 2005112202
% check_soa ma
ns2.nic.fr has serial number 2005111902

nsd logs show nothing and the zone did not change between the two tests.

I notice old daemons on the machines:

[bortzmeyer@ns2 ~]$ ps auxww|grep nsd
nsd 24699 0.0 21.9 459176 455580 ? S Nov21 0:43 /usr/local/nsd/sbin/nsd -a 192.93.0.4 -a 2001:660:3005:1::1:2 -n 15
nsd 24720 4.3 21.9 459752 455964 ? S Nov21 71:25 /usr/local/nsd/sbin/nsd -a 192.93.0.4 -a 2001:660:3005:1::1:2 -n 15
nsd 31290 0.0 0.0 0 0 ? Z Nov21 0:43 [nsd] <defunct>
nsd 18064 1.9 21.9 459844 456164 ? S 10:25 0:44 /usr/local/nsd/sbin/nsd -a 192.93.0.4 -a 2001:660:3005:1::1:2 -n 15
nsd 18074 5.2 21.9 460304 456408 ? S 10:25 1:55 /usr/local/nsd/sbin/nsd -a 192.93.0.4 -a 2001:660:3005:1::1:2 -n 15

Killing them all (they require a -KILL) apparently solved the
problem. Is it possible that the "old" daemons were still receiving
some of the UDP requests (I did not test with TCP, unfortunately) and
replied with old data?

NSD 2.3.0
CentOS release 4.2 (Final)
Linux 2.6.9-22.0.1.ELsmp

Robert_Martin-Legene · November 22, 2005, 10:29am

If you had managed to get a list of which processes had which files
open, before you had done the kill, this would have told you if it was
technically possible for a process to receive traffic on the port. I use
lsof myself for that information. I know others exist, but I forgot
their name. Try this "next time". (lsof compiles on most systems).

This is how it looks in Solaris 9:

robert@thunder (0)$ lsof -Pn|head -1; lsof -Pn|grep UDP.\*:53
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
named 398 root 20u IPv4 0x300038a99b0 0t0 UDP 127.0.0.1:53 (Idle)
named 398 root 24u IPv4 0x300039990d0 0t0 UDP 172.16.11.53:53 (Idle)
named 398 root 26u IPv4 0x30003827b68 0t0 UDP *:53 (Idle)

Whether nsd itself has a feature which enables the behaviour you
experienced, I don't know. I will leave it to others to elaborate on
that.

Miek_Gieben1 · November 22, 2005, 10:54am

[On 22 Nov, @11:14, Stephane Bortzmeyer wrote in "An old NSD daemon, stuck with ..."]

We experienced a funny problem with a nsd name server. Serial numbers
seem to oscillate between an old value and the current one:

% check_soa ma
ns2.nic.fr has serial number 2005112202
% check_soa ma
ns2.nic.fr has serial number 2005111902

nsd logs show nothing and the zone did not change between the two tests.

We will try to simulate this here, but more details would certainly
be helpfull.

What I gathered from Jaap was that something went haywire with a
failed AXFR (tsig related) and that possibly the reload failed?

Stephane_Bortzmeyer · November 22, 2005, 11:15am

a message of 65 lines which said:

What I gathered from Jaap was that something went haywire with a
failed AXFR (tsig related) and that possibly the reload failed?

Apparently, there was two different and (probably) unrelated problem:

1) A NTP => TSIG problem, which is now fixed,

2) The oscillation problem, which is also fixed by a different method
(stop ; start). This is the one about which I ask (since the first has
probably nothing to do with NSD).

During the first problem, nsd still reloaded the non-TSIG zones (like
".ma") without trouble.

Miek_Gieben1 · November 22, 2005, 11:39am

[On 22 Nov, @12:15, Stephane Bortzmeyer wrote in "Re: An old NSD daemon, stuck w ..."]

> What I gathered from Jaap was that something went haywire with a
> failed AXFR (tsig related) and that possibly the reload failed?

Apparently, there was two different and (probably) unrelated problem:

1) A NTP => TSIG problem, which is now fixed,

okay, this greatly simplifies any test environment to be setup

2) The oscillation problem, which is also fixed by a different method
(stop ; start). This is the one about which I ask (since the first has
probably nothing to do with NSD).

Ah, okay. So basicly we are looking at some race condition in the
SIGHUP handling in NSD...

by (stop; start) you mean: nsdc stop ; nsdc start ?

During the first problem, nsd still reloaded the non-TSIG zones (like
".ma") without trouble.

that would be of the virtue of the nsdc script,

Miek_Gieben1 · November 22, 2005, 1:17pm

[On 22 Nov, @12:39, Miek Gieben wrote in "Re: An old NSD daemon, stuck w ..."]

> 1) A NTP => TSIG problem, which is now fixed,

okay, this greatly simplifies any test environment to be setup

> 2) The oscillation problem, which is also fixed by a different method
> (stop ; start). This is the one about which I ask (since the first has
> probably nothing to do with NSD).

Ah, okay. So basicly we are looking at some race condition in the
SIGHUP handling in NSD...

by (stop; start) you mean: nsdc stop ; nsdc start ?

I can reproduce this by doing the following:

1 setup an NSD instance with a zone
2 run the following script:
        while true; do
                kill -HUP $NSD_PID & kill -HUP $NSD_PID
                echo "KILL $NSD_PID"
                sleep 1
        done
3 wait

So it looks to be happening when NSD is in the NSD_RELOAD state and
then receives /another/ SIG_HUP. (This is also on Linux btw).

grtz Miek

Miek_Gieben1 · November 24, 2005, 10:27am

[On 22 Nov, @11:14, Stephane Bortzmeyer wrote in "An old NSD daemon, stuck with ..."]

problem. Is it possible that the "old" daemons were still receiving
some of the UDP requests (I did not test with TCP, unfortunately) and

You are hit by a tricky race condition in the NSD code. The following
patch minimizes the race window. This patch will be in nsd 2.3.2.

For nsd 2.3.3 I will refactor the sig_handler.

Index: nsd.c