NSD CPU time consumption in slave config

hi there,

I have running a nsd test server with about 100.000 zones as master and
one as slave in my test environment. The master works pretty good but
the slave for all these zone don't.
The startup takes about an hour in which the nsd answers to all requests
with servfail instead of refusing the connection. In this time it is
fetching all serial from the master according to some straces I made to
the master process. But the master process keeps consuming 100% CPU time
after that. Here is a short strace clip of what it is doing:

gettimeofday({1183623774, 898647}, NULL) = 0
pselect6(10, [9], [9], [], {0, 101353000}, {0, 8}) = 1 (out [9], left
{0, 101353000})
write(9, "\10\0\0\0\0\21\1\v\3\n\6\0\5domain1\3com\0", 23) = 23
gettimeofday({1183623774, 923962}, NULL) = 0
pselect6(10, [9], [9], [], {0, 76038000}, {0, 8}) = 1 (out [9], left {0,
76038000})
write(9, "\10\0\0\0\0\27\1\21\3\20\f\0\vdomain34\3com\0", 29) = 29
gettimeofday({1183623774, 949287}, NULL) = 0
pselect6(10, [9], [9], [], {0, 50713000}, {0, 8}) = 1 (out [9], left {0,
50713000})
write(9, "\10\0\0\0\0\27\1\21\3\20\f\0\vdomainxy\3com\0", 29) = 29
gettimeofday({1183623774, 974594}, NULL) = 0
pselect6(10, [9], [9], [], {0, 25406000}, {0, 8}) = 1 (out [9], left {0,
25406000})
write(9, "\10\0\0\0\0\34\1\26\3\25\21\0\20domsdy\3co"..., 34) = 34
gettimeofday({1183623774, 999916}, NULL) = 0
pselect6(10, [9], [9], [], {0, 84000}, {0, 8}) = 1 (out [9], left {0,
84000})
write(9, "\10\0\0\0\0\33\1\25\3\24\20\0\17asdddd23489jfkl\3com"..., 33) = 33
gettimeofday({1183623775, 25220}, NULL) = 0

It looks like it is nonstop trying to refresh zones.
Any suggestions ?

Jan

Hi Jan,

Thank you for your sharing your experiences.

Jan Boysen - servage.net wrote:

I have running a nsd test server with about 100.000 zones as master and
one as slave in my test environment. The master works pretty good but
the slave for all these zone don't.
The startup takes about an hour in which the nsd answers to all requests
with servfail instead of refusing the connection.

The startup is slow for both the master and the slave?

In this time it is
fetching all serial from the master according to some straces I made to
the master process. But the master process keeps consuming 100% CPU time
after that. Here is a short strace clip of what it is doing:

When the master is at 100% CPU, does it still answer queries? (And does
the slave answer any queries at all?)

Can you share some info about hardware setup and memory usage
statistics, etc?

Regards,

Mark

- --
Mark Santcroos
NLnet Labs
http://www.nlnetlabs.nl/

Mark Santcroos wrote:

Hi Jan,

Thank you for your sharing your experiences.

hi,

Jan Boysen - servage.net wrote:

I have running a nsd test server with about 100.000 zones as master and
one as slave in my test environment. The master works pretty good but
the slave for all these zone don't.
The startup takes about an hour in which the nsd answers to all requests
with servfail instead of refusing the connection.

The startup is slow for both the master and the slave?

no the master starts up within few seconds.
Only the slave configurated server is so slow.

In this time it is
fetching all serial from the master according to some straces I made to
the master process. But the master process keeps consuming 100% CPU time
after that. Here is a short strace clip of what it is doing:

When the master is at 100% CPU, does it still answer queries? (And does
the slave answer any queries at all?)

I have taken a look.
The first hour or so the slave answers with servfail which is annoying
too I think.. would it not be better if it were refusing the connection
instead ?

After it seems to have finished the first run of serial syncing (I dont
get any new XFER notices in the logfile) It answers on request with the
correct informations but still is at 100% CPU load.

Can you share some info about hardware setup and memory usage
statistics, etc?

sure...

here some infos:

-Master-

CPU : Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz (dual core)
MemTotal: 2046920 kB
OS: Linux / Fedora Core6 x86_64
Kernel: 2.6.20-1.2944.fc6 SMP
DNS: 3.0.5 - compiled from official sources.

Mem: 2046920k total, 1760232k used, 28668k free, 166768k buffers
Swap: 1052248k total, 5836k used, 1046412k free, 916348k cached

-Slave-

CPU : Intel(R) Xeon(R) CPU 5110 @ 1.60GHz (dual core)
MemTotal: 1034120 kB
OS: Linux / Fedora Core7 x86_64
Kernel: 2.6.21-1.3228.fc7 SMP
DNS: 3.0.5 - compiled from official sources.

Mem: 1034120k total, 1013440k used, 20680k free, 195392k buffers
Swap: 1052152k total, 68k used, 1052084k free, 305696k cached

When I looked at it (perhaps the developers will correct me if I'm
wrong), I concluded that it worked this way because of the way nsd
drops privilege after binding to the low-numbered port. It seemed to
me that because it wanted to drop privs as soon as possible, it had to
adopt some strategy of handling the requests, so SERVFAIL was the
answer. I agree it's sort of ugly, though.

A

Andrew Sullivan wrote:

The first hour or so the slave answers with servfail which is annoying
too I think.. would it not be better if it were refusing the connection
instead ?

When I looked at it (perhaps the developers will correct me if I'm
wrong), I concluded that it worked this way because of the way nsd
drops privilege after binding to the low-numbered port. It seemed to
me that because it wanted to drop privs as soon as possible, it had to
adopt some strategy of handling the requests, so SERVFAIL was the
answer. I agree it's sort of ugly, though.

Yes, we drop privileges as soon as possible, so ports can't be bound
later on. We might be able to do this later, but that would make the
problem only slightly less annoying.

Although i'm not yet sure, i have the feeling there's a bug causing this
performance problem (i think it shouldn't be that slow to synchronize),
and i think we should be trying to find a solution for that rather than
ameliorate the behaviour when it's not yet finished.

Jelte