no response from nsd 2.3.4 with more than 12 server processes

Hi,

I just thought I would try and load test nsd 2.3.4 on a Sun T2000 running
Solaris 10 01/06 but I am having a few problems.

If I specify the number of servers to be more than 12 I get no response
from nsd when I try to query it. With 12 or less servers it works fine. I
am specifying the number of servers by editing nsdc and adding -N 12 to
the flags line. The server I am using has 32 virtual processors so in the
end I want to use -N 32. For example after starting with -N 13, dig gives
the following

dig @localhost jadjadjad.example.com soa
; <<>> DiG 9.2.4 <<>> @localhost jadjadjad.example.com soa
;; global options: printcmd
;; connection timed out; no servers could be reached

with -N 12 I get a response.

Has anyone else tried running nsd with so many servers?

BTW - I also noticed that if I start nsd using nsdc start and then try and
start it again nsdc correctly reports that nsd is already running. However
for some reason the pid file gets removed. For example
bash-3.00# /opt/nsd/sbin/nsdc start
bash-3.00# ls -l /opt/nsd/etc/nsd/nsd.pid
-rw-r--r-- 1 nsd other 6 May 31 15:27
/opt/nsd/etc/nsd/nsd.pid
bash-3.00# /opt/nsd/sbin/nsdc start
[1149085664] nsd[26337]: warning: nsd is already running as 26320,
continuing
bash-3.00# ls -l /opt/nsd/etc/nsd/nsd.pid
/opt/nsd/etc/nsd/nsd.pid: No such file or directory
bash-3.00# /opt/nsd/sbin/nsdc stop
nsd is not running
bash-3.00# ps -ef | grep nsd
     nsd 26324 26320 0 15:27:35 ? 0:00 /opt/nsd/sbin/nsd -f
/opt/nsd/etc/nsd/nsd.db -P /opt/nsd/etc/nsd/nsd.pid
    root 26343 26314 0 15:27:59 pts/2 0:00 grep nsd
     nsd 26326 26320 0 15:27:35 ? 0:00 /opt/nsd/sbin/nsd -f
/opt/nsd/etc/nsd/nsd.db -P /opt/nsd/etc/nsd/nsd.pid
...

Thanks
John

Hi Jad,

I've tried to reproduce this, and on a AMD linux 2.6.16 system I get
replies using 9 servers, but not with 10 servers. (with the
freshly-released nsd 2.3.5 by the way).

The code does nothing special with the number of servers it forks. Each
server select()s on the port. With 10 servers, none of the servers come
out of select(). With 9 or fewer, one comes out of select() and handles
the udp message and goes back into select().

There is no immediately solution, I cannot unblock select(). Both
Solaris and linux then have this feature.

So, I can reproduce your problem, and I will be looking into it. Entered
as http://www.nlnetlabs.nl/bugs/show_bug.cgi?id=134
Thank you for the report.

As for your problem with killing them off, when you start it doubly,
i.e. you start one when the old one is still running, then NSD detects
the old NSD, and continues to attempt to start as well. It overwrites
the pidfile with its own pid. But then fails because it cannot bind to
the port (it is in use by the old NSD and you start on the same port),
and then it exits and unlinks the pidfile. This removes the pidfile.
I cannot easily fix this, as there is only one pid file, and two NSDs
running.

Thank you for the report,
  Wouter

jad@nominet.org.uk wrote:

[On 02 Jun, @13:10, Wouter Wijngaards wrote in "Re: no response from nsd 2.3.4 ..."]

Hi Jad,

I've tried to reproduce this, and on a AMD linux 2.6.16 system I get
replies using 9 servers, but not with 10 servers. (with the
freshly-released nsd 2.3.5 by the way).

just curious, does nsd 3-dev exhibit the same problem?

Miek Gieben wrote:

[On 02 Jun, @13:10, Wouter Wijngaards wrote in "Re: no response from nsd 2.3.4 ..."]

Hi Jad,

I've tried to reproduce this, and on a AMD linux 2.6.16 system I get
replies using 9 servers, but not with 10 servers. (with the
freshly-released nsd 2.3.5 by the way).

just curious, does nsd 3-dev exhibit the same problem?

Yes it has the exact same problem, 9 servers give answer, 10 don't. I
can see all 10 servers respond to NSD3 IPC pipes with zone state
information, but none unblock when I send a query to the udp port.

Setting the fds nonblocking does not help either.

ciao,
   Wouter

[On 02 Jun, @13:51, Wouter Wijngaards wrote in "Re: no response from nsd 2.3.4 ..."]

Yes it has the exact same problem, 9 servers give answer, 10 don't. I
can see all 10 servers respond to NSD3 IPC pipes with zone state
information, but none unblock when I send a query to the udp port.

Setting the fds nonblocking does not help either.

hmm, *nasty* :frowning:
Is this related to the thundering herd problem? [*]. If so, you
have some (linux only I thought) options to let the kernel
know you only need one process to wake up.

[*]
http://www.catb.org/jargon/html/T/thundering-herd-problem.html

Are any other processes listening on the same port, according to netstat?

The kernel will wake one of the waiting processes. If that one process isn't one of the waiting nsd processes, the packet may never be read.

Arnt

Arnt Gulbrandsen wrote:

Are any other processes listening on the same port, according to netstat?

The kernel will wake one of the waiting processes. If that one process
isn't one of the waiting nsd processes, the packet may never be read.

No. Apart from testing using some random port > 10000, netstat shows no
others for it. I think it is the number of processes with a select() on
the port that causes it. It reliable succeeds with num=9, it reliable
fails with num=10.

ciao,
   Wouter

[On 02 Jun, @14:18, Wouter Wijngaards wrote in "Re: no response from nsd 2.3.4 ..."]

others for it. I think it is the number of processes with a select() on
the port that causes it. It reliable succeeds with num=9, it reliable
fails with num=10.

doing some testing on my homemachine (amdX2 64, 2 gig memory):

% ../../nsd -u miekg -N 50 -p 5353 -f soa.db
% ps x |grep nsd |wc
     52 674 3641

answers just fine:

;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 31403

It indeed stops answering when I use 150 instances,

Hi,

Bug has been found with Miek heroically running -N 300 on his machine.
Array out of bounds. Fix is to change i to 0 in line 608:

#ifdef INET6
               if (hints[i].ai_family == AF_UNSPEC) {
# ifdef IPV6_V6ONLY

Hi,

That fixed it. Thanks for all your help. When I have done the testing I
will post and let you know how it performs.

Thanks
John