I just thought I would try and load test nsd 2.3.4 on a Sun T2000 running
Solaris 10 01/06 but I am having a few problems.
If I specify the number of servers to be more than 12 I get no response
from nsd when I try to query it. With 12 or less servers it works fine. I
am specifying the number of servers by editing nsdc and adding -N 12 to
the flags line. The server I am using has 32 virtual processors so in the
end I want to use -N 32. For example after starting with -N 13, dig gives
the following
dig @localhost jadjadjad.example.com soa
; <<>> DiG 9.2.4 <<>> @localhost jadjadjad.example.com soa
;; global options: printcmd
;; connection timed out; no servers could be reached
with -N 12 I get a response.
Has anyone else tried running nsd with so many servers?
BTW - I also noticed that if I start nsd using nsdc start and then try and
start it again nsdc correctly reports that nsd is already running. However
for some reason the pid file gets removed. For example
bash-3.00# /opt/nsd/sbin/nsdc start
bash-3.00# ls -l /opt/nsd/etc/nsd/nsd.pid
-rw-r--r-- 1 nsd other 6 May 31 15:27
/opt/nsd/etc/nsd/nsd.pid
bash-3.00# /opt/nsd/sbin/nsdc start
[1149085664] nsd[26337]: warning: nsd is already running as 26320,
continuing
bash-3.00# ls -l /opt/nsd/etc/nsd/nsd.pid
/opt/nsd/etc/nsd/nsd.pid: No such file or directory
bash-3.00# /opt/nsd/sbin/nsdc stop
nsd is not running
bash-3.00# ps -ef | grep nsd
nsd 26324 26320 0 15:27:35 ? 0:00 /opt/nsd/sbin/nsd -f
/opt/nsd/etc/nsd/nsd.db -P /opt/nsd/etc/nsd/nsd.pid
root 26343 26314 0 15:27:59 pts/2 0:00 grep nsd
nsd 26326 26320 0 15:27:35 ? 0:00 /opt/nsd/sbin/nsd -f
/opt/nsd/etc/nsd/nsd.db -P /opt/nsd/etc/nsd/nsd.pid
...
I've tried to reproduce this, and on a AMD linux 2.6.16 system I get
replies using 9 servers, but not with 10 servers. (with the
freshly-released nsd 2.3.5 by the way).
The code does nothing special with the number of servers it forks. Each
server select()s on the port. With 10 servers, none of the servers come
out of select(). With 9 or fewer, one comes out of select() and handles
the udp message and goes back into select().
There is no immediately solution, I cannot unblock select(). Both
Solaris and linux then have this feature.
As for your problem with killing them off, when you start it doubly,
i.e. you start one when the old one is still running, then NSD detects
the old NSD, and continues to attempt to start as well. It overwrites
the pidfile with its own pid. But then fails because it cannot bind to
the port (it is in use by the old NSD and you start on the same port),
and then it exits and unlinks the pidfile. This removes the pidfile.
I cannot easily fix this, as there is only one pid file, and two NSDs
running.
[On 02 Jun, @13:10, Wouter Wijngaards wrote in "Re: no response from nsd 2.3.4 ..."]
Hi Jad,
I've tried to reproduce this, and on a AMD linux 2.6.16 system I get
replies using 9 servers, but not with 10 servers. (with the
freshly-released nsd 2.3.5 by the way).
just curious, does nsd 3-dev exhibit the same problem?
[On 02 Jun, @13:10, Wouter Wijngaards wrote in "Re: no response from nsd 2.3.4 ..."]
Hi Jad,
I've tried to reproduce this, and on a AMD linux 2.6.16 system I get
replies using 9 servers, but not with 10 servers. (with the
freshly-released nsd 2.3.5 by the way).
just curious, does nsd 3-dev exhibit the same problem?
Yes it has the exact same problem, 9 servers give answer, 10 don't. I
can see all 10 servers respond to NSD3 IPC pipes with zone state
information, but none unblock when I send a query to the udp port.
[On 02 Jun, @13:51, Wouter Wijngaards wrote in "Re: no response from nsd 2.3.4 ..."]
Yes it has the exact same problem, 9 servers give answer, 10 don't. I
can see all 10 servers respond to NSD3 IPC pipes with zone state
information, but none unblock when I send a query to the udp port.
Setting the fds nonblocking does not help either.
hmm, *nasty*
Is this related to the thundering herd problem? [*]. If so, you
have some (linux only I thought) options to let the kernel
know you only need one process to wake up.
Are any other processes listening on the same port, according to netstat?
The kernel will wake one of the waiting processes. If that one process
isn't one of the waiting nsd processes, the packet may never be read.
No. Apart from testing using some random port > 10000, netstat shows no
others for it. I think it is the number of processes with a select() on
the port that causes it. It reliable succeeds with num=9, it reliable
fails with num=10.
[On 02 Jun, @14:18, Wouter Wijngaards wrote in "Re: no response from nsd 2.3.4 ..."]
others for it. I think it is the number of processes with a select() on
the port that causes it. It reliable succeeds with num=9, it reliable
fails with num=10.
doing some testing on my homemachine (amdX2 64, 2 gig memory):