Edge case on nsdc?

Hello,

We have this in our NSD logs occasionally:

[1214740996] nsd[93921]: warning: nsd is already running as 93888, continuing
[1214740996] nsd[93922]: error: can't bind the socket: Address already in use
[1214741027] nsd[94418]: error: can't bind the socket: Address already in use
[1214741057] nsd[94932]: error: can't bind the socket: Address already in use

I think this is because we have a script monitoring to make sure NSD is running at all time and attempts to start it... even though NSD is already running.

In the nsdc.sh script we see the following:

signal() {
         if [ -s ${pidfile} ]
         then
                 kill -"$1" `cat ${pidfile}` && return 0
         else
                 echo "nsd is not running"
         fi
         return 1
}

But it seems like NSD restarts itself regularly, getting a new process ID when it does so. In this case, we have the possibility for the following to happen:

- nsdc.sh reads the contents of pidfile

- NSD restarts, getting a new PID

- nsdc.sh sends a signal to test NSD using the old PID, which fails, so nsdc claims NSD is not running

Is this possible?

It is possible to work around this with a little more sophistication, I think:

signal() {
  while true
  do
    # if there is no PID file, NSD is not running
    if [ ! -s ${pidfile} ]
    then
      return 1
                 fi

    # if we can send the signal to the PID, then NSD is running
                 # (or some other process with that PID, but we hope not...)
    PID=`cat ${pidfile}`
    if kill -"$1" $PID
    then
       return 0
    fi

    # double-check NSD did not restart between the time we read the PID
    # and the time we sent the signal
    CHECK_PID=`cat ${pidfile}`
    if [ $PID -eq $CHECK_PID ]
    then
      echo "nsd is not running"
      return 1
    fi
  done
}

Matthijs,

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Shane,

[1214740996] nsd[93921]: warning: nsd is already running as 93888,
continuing
[1214740996] nsd[93922]: error: can't bind the socket: Address already
in use
[1214741027] nsd[94418]: error: can't bind the socket: Address already
in use
[1214741057] nsd[94932]: error: can't bind the socket: Address already
in use

This occurs when you call nsd manually (eg without nsdc, NSD control
script). Because NSD is already running, it can't bind the socket, and
server initialization for this process fails. Because server
initialization fails, it tries to remove the pidfile. Hence, later you
will only see the socket bind error, and no longer the 'already running'
warning. (and therefore, nsdc running will tell you it is not running)

I changed in nsd.c that the pidfile is written only after succeeding
server initialization.

Cool.

I think this is because we have a script monitoring to make sure NSD is
running at all time and attempts to start it... even though NSD is
already running.

What script do you use for monitoring NSD? nsdc also can be used for
this. nsdc running to check if nsd is running, if it returns 1 (not
running), you can do nsdc start.

We use nsdc for this. The script basically does:

while true; do
     if ! nsdc running; then
         nsdc start
     fi
     sleep 15
done

In the nsdc.sh script we see the following:

signal() {
       if [ -s ${pidfile} ]
       then
               kill -"$1" `cat ${pidfile}` && return 0
       else
               echo "nsd is not running"
       fi
       return 1
}

But it seems like NSD restarts itself regularly, getting a new process
ID when it does so. In this case, we have the possibility for the
following to happen:

- nsdc.sh reads the contents of pidfile

- NSD restarts, getting a new PID

- nsdc.sh sends a signal to test NSD using the old PID, which fails, so
nsdc claims NSD is not running

Is this possible?

As far as I know, when NSD restarts (because it received a dedicated
signal), it takes care of updating the pidfile.

When you use "nsdc patch", you get an implicit "nsdc reload". We run this from a cron job.

nsdc reload issues a SIGHUP to NSD.

This eventually ends up in the server_main() function in server.c, which calls fork(), and therefore gets a new pid, which it then writes into the pidfile.

So, the scenario is:

Time 1: NSD, running as PID A, writes into pidfile
Time 2: nsdc reads PID A from pidfile
Time 3: NSD gets a SIGHUP, forks a new process with PID B, and exits the old process
Time 4: nsdc sends a signal to PID A, which no longer exists
Time 5: nsdc returns "server not running" even though the server is running.

It is possible to work around this with a little more sophistication, I
think:

signal() {
   while true
   do
       # if there is no PID file, NSD is not running
       if [ ! -s ${pidfile} ]
       then
           return 1
               fi

       # if we can send the signal to the PID, then NSD is running
               # (or some other process with that PID, but we hope
not...)
       PID=`cat ${pidfile}`
       if kill -"$1" $PID
       then
            return 0
       fi

       # double-check NSD did not restart between the time we read the PID
       # and the time we sent the signal
       CHECK_PID=`cat ${pidfile}`
       if [ $PID -eq $CHECK_PID ]
       then
           echo "nsd is not running"
           return 1
       fi
   done
}

Could you try the trunk release? I think it already fixes this issue.
Make sure your control script first checks if nsd is running (nsdc
running) and if not start it (nsdc start).

The fix you made makes sense, and should be included.

But I am reasonably sure there is nothing that the server can do to fix this problem (mind you I am a bit sleep-deprived right now, so no promises). :wink:

I think the script needs to work like I coded it here, where it checks the PID of the server did not change while it was checking.