All,
We had NSD misconfigured, and were checking TSIG keys for notify messages that we received from our masters. This was a misconfiguration because the masters are BIND, which does not support TSIG securing notify messages.
When we changed the configuration to use NOKEY, and now we accept notifies. Yay!
But, NSD now leaks memory. 
For our large zones, we get updates once a minute or so. It is possible NSD leaks memory even without notifies, but since it is 15x slower (or whatever the refresh time is on the zone) we don't notice it so much. It is also possible that it has something to do with starting a notify when another is in progress. Or, something else. (I have no clue, just guessing.)
We see this on both FreeBSD (with 64-bit NSD) and Linux (with both 32-bit and 64-bit NSD). Eventually the processes run out of memory and die. Usually they first stop accepting new IXFR because they don't have enough memory, like this:
[1214352780] nsd[12973]: warning: signal received, reloading...
[1214352780] nsd[14463]: info: memory recyclebin holds 571968 bytes
[1214352780] nsd[14463]: error: malloc failed: Cannot allocate memory
[1214352781] nsd[12973]: error: handle_reload_cmd: reload closed cmd channel
[1214352781] nsd[12973]: warning: Reload process 14463 failed with status 256, continuing with old database
[1214352781] nsd[20195]: error: xfrd: zone org: soa serial 2008220593 update failed restarting transfer (notified zone)
So, my questions are:
1. Is this a known bug?
2. If it is not a previously known bug, has anybody else seen this?
If this is something new, I welcome advice on how to debug it. My current thinking is to simply try valgrind.
Cheers,
Hi Shane,
We see this on both FreeBSD (with 64-bit NSD) and Linux (with both 32-bit
and 64-bit NSD).
You forgot to mention what version(s) do you use.
Ondrej.
Sorry, this is with 3.0.7 (in production, answering external queries) and 3.1.0 (lab testing only, no queries at all now).
Shane Kerr wrote:
Hi Shane,
We see this on both FreeBSD (with 64-bit NSD) and Linux (with both
32-bit
and 64-bit NSD).
You forgot to mention what version(s) do you use.
Sorry, this is with 3.0.7 (in production, answering external queries)
and 3.1.0 (lab testing only, no queries at all now).
Hmm, you may be right, though it doesn't seem to be in the notification
code itself; just a low refresh rate seems to make memory usage grow by
four bytes if a zone is actually changed.
We're working on it.
Jelte
Jelte Jansen wrote:
Shane Kerr wrote:
We see this on both FreeBSD (with 64-bit NSD) and Linux (with both
32-bit
and 64-bit NSD).
You forgot to mention what version(s) do you use.
Sorry, this is with 3.0.7 (in production, answering external queries)
and 3.1.0 (lab testing only, no queries at all now).
Hmm, you may be right, though it doesn't seem to be in the notification
code itself; just a low refresh rate seems to make memory usage grow by
four bytes if a zone is actually changed.
Ok, as it turns out, the big problem was that nsd-patch wasn't called
often enough. However, I actually did find a small memory leak in a part
of the newer code;
If nsd is compiled with NSEC3 support (by default in the latest release
3.1.0), there is a small memory leak in the slave when a non-NSEC3 zone
is updated.
Attached is a patch for people running into this. Alternatively, you
could use --disable-nsec3 as a build-time configure option. The patch is
also in trunk revision 2744 for people who are using subversion, and
will be in the next patch release (3.1.1).
Jelte
(attachments)
nsd_nsec3_memory_leak.patch (544 Bytes)
Jelte,
Ok, as it turns out, the big problem was that nsd-patch wasn't called
often enough. However, I actually did find a small memory leak in a part
of the newer code;
If nsd is compiled with NSEC3 support (by default in the latest release
3.1.0), there is a small memory leak in the slave when a non-NSEC3 zone
is updated.
This leak is something like 4K per update, so for most users can be ignored (in our case we get an update every minute so this is like 6M per day, so we'll either use this patch or disable NSEC3 until we actually use it).
FYI, I went ahead and tracked our server's memory growth with 3.1.0 and with the Subversion version 2476 and found the following:
3.1.0 svn-r2476
NS RSS MB VSZ MB RSS MB VSZ MB
---- -------- -------- -------- --------
ns0a 85 143 49 109
ns0a 1904 1911 1905 1911
ns0a 2038 2045 2053 2060
ns0a 2187 2194 2187 2194
ns0a 2321 2327 2336 2342
ns0a 2469 2476 2469 2476
ns0a 2603 2610 2618 2625
ns0a 2752 2758 2752 2758
ns0a 2886 2892 2900 2907
This is the size of the process, taken each 10 minutes. The growth is basically identical (slightly more for the Subversion version, but that probably has to do with the actual changes to the zone that we are getting).
Apologies for the unnecessary worries! 