Fork-failed only on certain servers

Hello!

We use NSD as slave on ~ 20 server. One in a while, if there is huge
IXFR, the fork fails. Frankly, it fails only on 4 of these identical 20
servers.

Those VMs are really identical: Same dom0, same amount of RAM, CPUs,
Diskspace, Kernel, sysctl settings, NSD settings.

When I compare a failled server with a good server: RAM usage before the
IXFR, was on both server 10.5GB. Both have 25G RAM installed - hence
there should be sufficient RAM available - the IXFR was ~2GB.

NSD logs look identical, except that the fork failed on one (see below).

Do have any hints whe the fork fails on some VMs?

thanks
Klaus

Good Server:
05:44:58 vie nsd[22157]: notify for xxx. from 1.2.3.4 serial 2019092314
05:44:58 vie nsd[22157]: notify for xxx. from 1234::5 serial 2019092314
05:49:45 vie nsd[669]: xfrd: zone xxx committed "received update to
serial 2019092314 at 2019-09-23T05:49:45 from 1.2.3.20 TSIG verified
with key mykey"
05:51:14 vie nsd[672]: rehash of zone xxx. with parameters 1 0 5
939fffb0948cbf34
05:51:27 vie nsd[672]: nsec3 xxx 1 %
05:51:33 vie nsd[672]: nsec3 xxx 17 %
05:51:39 vie nsd[672]: nsec3 xxx 25 %
05:51:45 vie nsd[672]: nsec3 xxx 31 %
05:51:49 vie nsd[22157]: notify for xxx. from 1.2.3.4 serial 2019092315
05:51:49 vie nsd[22157]: notify for xxx. from 1234::5 serial 2019092315
05:51:49 vie nsd[669]: xfrd: zone xxx committed "received update to
serial 2019092315 at 2019-09-23T05:51:49 from 1.2.3.4 TSIG verified with
key mykey"
05:51:49 vie nsd[22157]: notify for xxx. from 1.2.3.20 serial 2019092315
05:51:49 vie nsd[22157]: notify for xxx. from 2345::5 serial 2019092315
05:51:51 vie nsd[672]: nsec3 xxx 39 %
05:51:57 vie nsd[672]: nsec3 xxx 45 %
05:52:03 vie nsd[672]: nsec3 xxx 54 %
05:52:09 vie nsd[672]: nsec3 xxx 61 %
05:52:15 vie nsd[672]: nsec3 xxx 68 %
05:52:21 vie nsd[672]: nsec3 xxx 77 %
05:52:27 vie nsd[672]: nsec3 xxx 84 %
05:52:33 vie nsd[672]: nsec3 xxx 91 %
05:52:39 vie nsd[672]: nsec3 xxx 98 %
05:52:41 vie nsd[672]: zone xxx. received update to serial 2019092314 at
2019-09-23T05:49:45 from 1.2.3.20 TSIG verified with key mykey of
1815276647 bytes in 411.196 seconds
05:52:45 vie nsd[669]: xfrd: zone xxx committed "received update to
serial 2019092315 at 2019-09-23T05:52:45 from 2345::5 TSIG verified with
key mykey"
05:52:57 vie nsd[672]: zone xxx. received update to serial 2019092315 at
2019-09-23T05:51:49 from 1.2.3.4 TSIG verified with key mykey of 792413
bytes in 0.03947 seconds
05:53:05 vie nsd[669]: zone xxx serial 2019092314 is updated to 2019092315.

Failed Server:
05:44:59 nyc nsd[344]: notify for xxx. from 1234::5 serial 2019092314
05:44:59 nyc nsd[344]: notify for xxx. from 1.2.3.4 serial 2019092314
05:49:54 nyc nsd[10937]: xfrd: zone xxx committed "received update to
serial 2019092314 at 2019-09-23T05:49:54 from 2345::5 TSIG verified with
key mykey"
05:51:14 nyc nsd[10939]: rehash of zone xxx. with parameters 1 0 5
939fffb0948cbf34
05:51:25 nyc nsd[10939]: nsec3 xxx 1 %
05:51:31 nyc nsd[10939]: nsec3 xxx 17 %
05:51:37 nyc nsd[10939]: nsec3 xxx 25 %
05:51:43 nyc nsd[10939]: nsec3 xxx 31 %
05:51:49 nyc nsd[10939]: nsec3 xxx 38 %
05:51:49 nyc nsd[344]: notify for xxx. from 1.2.3.4 serial 2019092315
05:51:49 nyc nsd[344]: notify for xxx. from 1234::5 serial 2019092315
05:51:50 nyc nsd[344]: notify for xxx. from 2345::5 serial 2019092315
05:51:50 nyc nsd[344]: notify for xxx. from 1.2.3.20 serial 2019092315
05:51:50 nyc nsd[10937]: xfrd: zone xxx committed "received update to
serial 2019092315 at 2019-09-23T05:51:50 from 1.2.3.4 TSIG verified with
key mykey"
05:51:55 nyc nsd[10939]: nsec3 xxx 45 %
05:52:02 nyc nsd[10939]: nsec3 xxx 54 %
05:52:09 nyc nsd[10939]: nsec3 xxx 62 %
05:52:15 nyc nsd[10939]: nsec3 xxx 71 %
05:52:21 nyc nsd[10939]: nsec3 xxx 78 %
05:52:27 nyc nsd[10939]: nsec3 xxx 84 %
05:52:33 nyc nsd[10939]: nsec3 xxx 90 %
05:52:39 nyc nsd[10939]: nsec3 xxx 97 %
05:52:42 nyc nsd[10939]: zone xxx. received update to serial 2019092314
at 2019-09-23T05:49:54 from 2345::5 TSIG verified with key mykey of
1815276647 bytes in 418.798 seconds
05:52:43 nyc nsd[10939]: fork failed: Cannot allocate memory
05:52:45 nyc nsd[10937]: process 10939 exited with status 256
05:52:45 nyc nsd[4570]: handle_reload_cmd: reload closed cmd channel
05:52:45 nyc nsd[4570]: Reload process 10939 failed, continuing with old
database
05:52:46 nyc nsd[10937]: xfrd: zone xxx committed "received update to
serial 2019092315 at 2019-09-23T05:52:46 from 1.2.3.20 TSIG verified
with key mykey"
05:53:10 nyc nsd[10937]: xfrd: zone xxx: soa serial 2019092315 update
failed, restarting transfer (notified zone)

Hi Klaus,

This is an interesting problem. The fact that it happens only with huge
IXFRs suggests it really is a memory allocation problem. However, the
reload itself seems to be successful. The instantiation of the new
children seems to be where NSD fails, but the new children make use of
copy-on-write memory to serve the new data and therefore shouldn't
require a lot of additional memory.

Is it the same 4 servers where forking fails? Also, does the next
reload work after zone data has been transfered again?

Best regards,
Jeroen Koekkoek