Make nsdc more reliable for restart

Hello,

while running nsd as a secondary nameserver with +1000
domains we discovered that the default nsdc(8) was
not able to reliable restart nsd.
Reason I think is that, by using the PID file, it sends
it's signal to only 1 of the default 3 processes.
Afterwards it only checks against this 1 process while
the other 2 still may be running causing trouble on
start up.

The patch below fixes it for us (was tested in a lab
environment with 10.000 domains).

Alf

--- usr.sbin/nsd/nsdc.sh.in.orig Fri Aug 10 09:37:33 2012
+++ usr.sbin/nsd/nsdc.sh.in Fri Aug 10 09:34:56 2012
@@ -188,18 +188,18 @@
   try=1

   while [ $try -ne 0 ]; do
- if [ ${try} -gt 50 ]; then
+ if [ ${try} -gt 60 ]; then
       echo "nsdc stop failed"
       return 1
     else
       if [ $try -eq 1 ]; then
         kill -TERM ${pid}
       else
- kill -TERM ${pid} >/dev/null 2>&1
+ pkill -TERM nsd >/dev/null 2>&1
       fi

       # really stopped?
- kill -0 ${pid} >/dev/null 2>&1
+ pkill -0 nsd >/dev/null 2>&1
       if [ $? -eq 0 ]; then
         controlled_sleep ${try}
         try=`expr ${try} + 1`

Hi Alf,

while running nsd as a secondary nameserver with +1000
domains we discovered that the default nsdc(8) was
not able to reliable restart nsd.
Reason I think is that, by using the PID file, it sends
it's signal to only 1 of the default 3 processes.
Afterwards it only checks against this 1 process while
the other 2 still may be running causing trouble on
start up.

The patch below fixes it for us (was tested in a lab
environment with 10.000 domains).

The "pkill" command is not available on all systems. Linux distros ship
with it these days, and MacOS X introduced it with Mountain Lion (10.8),
but it may not be available on other systems. Therefore your patch is
not portable.

Regards,

Anand

Aha! I have run into this as well, especially in combination
with opendnssec. I had filed a bug report, but there were issues
reproducing it. I'm glad I'm not crazy!

Paul

Hi Alf,

while running nsd as a secondary nameserver with +1000
domains we discovered that the default nsdc(8) was
not able to reliable restart nsd.
Reason I think is that, by using the PID file, it sends
it's signal to only 1 of the default 3 processes.
Afterwards it only checks against this 1 process while
the other 2 still may be running causing trouble on
start up.

I wondered whether there's a particular reason that only the
master is signalled, or is this purely due to lack of a portable
pkill-type program?

The patch below fixes it for us (was tested in a lab
environment with 10.000 domains).

The "pkill" command is not available on all systems. Linux distros ship
with it these days, and MacOS X introduced it with Mountain Lion (10.8),
but it may not be available on other systems. Therefore your patch is
not portable.

Some OS have "killall" that does the same as pkill, but other
OS have a different "killall" that behaves slightly differently :wink:

The patch did not address my issue actually.

[root@nohats ~]# pidof nsd
4697 4696 4677
[root@nohats ~]# ls /var/run/nsd
[root@nohats ~]# nsdc stop
nsd is not running

somehow nsd gets signaled and deletes its pid, but won't write a new
one. There are two methods my nsd is getting signalled. One is via
an hourly cron running (if necc) a nsdc patch and nsdc reload. When
doing this manually, it works fine and the reload signals nsd and a
new pidfile is created:

[root@nohats ~]# pidof nsd
1301 1300 1298
[root@nohats ~]# cat /var/run/nsd/nsd.pid 1298
[root@nohats ~]# kill -HUP 1298
[root@nohats ~]# cat /var/run/nsd/nsd.pid 1304
[root@nohats ~]# pidof nsd
1305 1304 1300
[root@nohats ~]# kill -HUP 1304
[root@nohats ~]# pidof nsd
1313 1312 1300
[root@nohats ~]# cat /var/run/nsd/nsd.pid 1312

The second method is by opendnssec, configured to use:

/etc/opendnssec/conf.xml: <NotifyCommand>sudo /sbin/service nsd restart</NotifyCommand>

[root@nohats ~]# su - ods
-bash-4.1$ cat /var/run/nsd/nsd.pid 1312
-bash-4.1$ sudo /sbin/service nsd restart
Stopping nsd: [ OK ]
Starting nsd: [ OK ]
-bash-4.1$ cat /var/run/nsd/nsd.pid 1494
-bash-4.1$ pidof nsd
1497 1496 1494

So it all looks fine, but after a while something happens and the
pidfile is either wrong or gone, and then all of these fail. But
even with the pkill patch applied to /usr/sbin/nsdc, this still
happens.

Paul

The patch is far from ideal (what would happen if you have more then 1
nsd running?). However we use this in production for roughly a year
and it survived 300+ restarts.
Since the secondary domains only exist in memory I don't see any harm
in killing all instances of nsd at once, pkill them without even
looking at the pid-file should be fine too.

We run OpenBSD everywhere so I sent this to sthen@openbsd who then suggested
to post it here to get more feedback, that's why it's not portable:P

We run the default nsdc on our primary servers where the problem
does not exist.