NSD as slave leaves a child process behind

Hi,

We've run into problems with NSD v3.2.3 (on Red Hate Enterprise Linux
5.4 x86_64) failing to kill one of it's children processes while NSD
is reloading after it has received an update to a zone from master.
Everything seems to running fine, but 'nsdc stop' and 'nsdc patch'
etc. really don't work because the reload updates NSD the pid file but
the child that was left behind really is handling all the queries
and zone updates.

Our NSD is a slave to some thousands of zones and because of the way we
automatically create NSD configuration (we configure NSD to use every
other NS of the zone as master even though typically only one of the other
NS servers of a zone actually allows an AXFR for us), our NSD typically
has n*10 zone transfers in SYN_SENT state destined to fail after connect
timeout and for a lot of AXFR requests our NSD gets REFUSED response.

I'd appreciate help hunting down the cause for this problem; we need to
get 'nsdc stop && nsdc start' working to be able to restart NSD reliably
after generating new nsd.conf automatically. (Of course I could write
a script which kills NSD some other way instead of 'nsdc stop' but I'd
prefer fixing NSD/nsdc/our environment.)

Hi Ville,

It may look like a child process is not killed after a reload, but isn't
the process you are referring to (20933), the xfrd process? By default,
NSD has three processes: a parent process, a child process (answering
queries) and a xfrd process. During a NSD reload, the xfrd process
should not be killed, merely the communication channels with the parent
are updated.

Of course, if you run nsdc stop, no processes should be kept alive. Is
this the case?

What does nsdc stop output?

Notice that nsdc stop && nsdc start differs from nsdc restart. The
latter attempts to start nsd after all processes are shut down.

Best regards,

Matthijs Mekking
NLnet Labs

Ville Mattila wrote:

Hi,

It may look like a child process is not killed after a reload, but isn't
the process you are referring to (20933), the xfrd process? By default,
NSD has three processes: a parent process, a child process (answering
queries) and a xfrd process. During a NSD reload, the xfrd process
should not be killed, merely the communication channels with the parent
are updated.

I see. The process 20933 sure looks like xfrd, even though it's CMD
label in 'ps -ef' output is nsd. (BTW, is there a BSD's setproctitle()
equivalent in Linux that NSD perhaps could use to indicate the roles of
each process?)

Perhaps this is irrelevant, but one thing that stands out still is how
come the xfrd process becomes permanently the owner of all NSD listen
sockets (as shown by 'netstat -anp' in my original message) after reload
and not the main process?

Of course, if you run nsdc stop, no processes should be kept alive. Is
this the case?

Yes. 'nsdc stop' leaves xfrd process running and returns ok.

What does nsdc stop output?

Nothing really:

% sudo /usr/sbin/nsdc -c /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf stop
% echo $?
0

Notice that nsdc stop && nsdc start differs from nsdc restart. The
latter attempts to start nsd after all processes are shut down.

Yes I've noticed.. But using 'nsdc restart' makes no difference in this
case, because it only makes sure the main nsd process (main = the
one whose pid is stored in pidfile and used from there by nscd) has died.

Could you please consider either of the following patches attached inline
for nsdc. They could be useful because I think many users want
to run 'nsdc patch' every time NSD is stopped and that seems to cause
some problems also:
   FreeBSD Ports bug report 'dns/nsd: fix race when stopping nsd'
   http://www.freebsd.org/cgi/query-pr.cgi?pr=130294

Option 1: Make 'nsdc stop' use do_controlled_stop() instead of do_stop():

---8<------8<------8<------8<---
--- nsdc.sh.in.orig 2009-09-09 12:10:55.000000000 +0300
+++ nsdc.sh.in 2009-09-09 12:15:05.000000000 +0300
@@ -356,7 +356,7 @@
         fi
         ;;
  stop)
- do_stop
+ do_controlled_stop
         ;;
  stats)
         signal "USR1"
---8<------8<------8<------8<---

Option 2: Do not change 'nsdc stop' implementation, but let users run
'nsdc controlled-stop' if they wish to make sure nsd main process has
exited before nsdc returns:

---8<------8<------8<------8<---
--- nsdc.sh.in.orig 2009-09-15 14:45:10.000000000 +0300
+++ nsdc.sh.in 2009-09-15 14:47:51.000000000 +0300
@@ -73,6 +73,7 @@
         echo "commands:"
         echo " start Start nsd server."
         echo " stop Stop nsd server."
+ echo " controlled-stop Stop nsd server and try to make sure it
really exits."
         echo " reload Nsd server reloads database file."
         echo " rebuild Compile database file from zone files."
         echo " restart Stop the nsd server and start it again."
@@ -358,6 +359,9 @@
  stop)
         do_stop
         ;;
+controlled-stop)
+ do_controlled_stop
+ ;;
  stats)
         signal "USR1"
         ;;
---8<------8<------8<------8<---

Regards,
Ville Mattila

Note that the fedora/rhel scripts run nsd patch before running nsdc stop.
They also run rebuild on start (in case zones got removed from the config
but still live in a db file)

Paul

Hi,

Of course, if you run nsdc stop, no processes should be kept alive. Is
this the case?

Yes. 'nsdc stop' leaves xfrd process running and returns ok.

What does nsdc stop output?

Nothing really:

% sudo /usr/sbin/nsdc -c /v/net/ns-secondary.funet.fi/etc/nsd/nsd-isar.conf stop
% echo $?
0

We've now worked around this problem by modifying the nsd init script
to NOT run 'nsdc patch' before 'nsdc stop' in restarts. No problems
in xfrd shutdown for 3+ weeks and 50+ restarts.

So it seems running 'nsdc patch' (which can take several seconds in our
environment) could prevent nsd/xfrd process termination by 'nsdc stop'.
Can you confirm this? What can we do to help finding a real resolution
to this?

Regards,

Hi Ville,

It could be that the shutdown signal was successfully send to the xfrd
process, but that it is quite busy at the moment.

I have done some testing with 10.000 zones of 7 RRs a zone. It sometimes
takes a couple of seconds before the xfrd process really terminated. But
eventually, the xfrd process should shut down. This might happen after
the nsdc stop command returned 0.

Best regards,

Matthijs

Ville Mattila wrote: