NSD TCP performance

user28 · November 8, 2007, 9:51pm

I've been contacted to evaluate NSD performance and I've identified a
little stangeness in the TCP chatter with NSD. NSD always sends the
two byte response size as a separate TCP packet (causing the requestor
to send a separate ACK) to the main body of the request.

You might Expect a TCP DNS request (omitting the possible UDP request
that fails) to go something like:

SYN -->
    <-- SYN,ACK
ACK -->
DNS QRY -->
    <-- ACK
    <-- DNS RSP
ACK -->
    <-- FIN, ACK
FIN, ACK -->

... plus or minus the optimization of the ACK's with data (which seems
to require that you send(2) before your host receives the first ACK).

But NSD always does the following:

SYN -->
    <-- SYN,ACK
ACK -->
DNS QRY -->
    <-- ACK
    <-- DNS length (2 bytes)
ACK -->
    <-- DNS RSP
ACK -->
    <-- FIN, ACK
FIN, ACK -->

Here's a very brief binary dump of the above conversation:

(attachments)

tcpdnsqry.dump (1.04 KB)

Aaron_Hopkins · November 9, 2007, 2:30am

I reported this in May 2006. Found in nsd-3.0.6/doc/TODO:

- From Aaron Hopkins: write tcp length and tcp data in one write operation,
instead of multiple calls to write. Avoids Nagle algo delay in this case.
preallocate 2 bytes in front of buffer to put them into.

There are a few other performance-related things I reported at the same time
still in the TODO that you might be interested in.

-- Aaron

Colm_MacCarthaigh1 · November 9, 2007, 3:46am

A quick and simple, not very portable, fix in the meantime might be to use the
TCP_CORK/TCP_NOPUSH socket option (depending on platform) on the socket.

Wouter · November 9, 2007, 8:02am

Colm MacCarthaigh wrote:

NSD always sends the two byte response size as a separate TCP packet
(causing the requestor to send a separate ACK) to the main body of the
request.

I reported this in May 2006. Found in nsd-3.0.6/doc/TODO:

- From Aaron Hopkins: write tcp length and tcp data in one write operation,
instead of multiple calls to write. Avoids Nagle algo delay in this case.
preallocate 2 bytes in front of buffer to put them into.

There are a few other performance-related things I reported at the same time
still in the TODO that you might be interested in.

A quick and simple, not very portable, fix in the meantime might be to use the
TCP_CORK/TCP_NOPUSH socket option (depending on platform) on the socket.

Yes, this is what NSD does. Simple code.

TCP performance is not considered important, as it should rarely be used
with DNS. UDP performance is the goal. Therefore the (otherwise
excellent) patches from Aaron did not get applied, they introduce extra
code, portability attention, where it is not needed.

The non-blocking patch is (in some form) in NSD now, Aaron, because that
fixed some race-conditions for secondary zones.

Best regards,
Wouter

user28 · November 9, 2007, 5:06pm

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Colm MacCarthaigh wrote:

NSD always sends the two byte response size as a separate TCP
packet (causing the requestor to send a separate ACK) to the main
body of the request.

I reported this in May 2006. Found in nsd-3.0.6/doc/TODO:

- From Aaron Hopkins: write tcp length and tcp data in one write
operation, instead of multiple calls to write. Avoids Nagle algo
delay in this case. preallocate 2 bytes in front of buffer to put
them into.

Yes, this is what NSD does. Simple code.

TCP performance is not considered important, as it should rarely be
used with DNS. UDP performance is the goal. Therefore the
(otherwise excellent) patches from Aaron did not get applied, they
introduce extra code, portability attention, where it is not
needed.

I'd like to have a look at this patch. Maybe the patch can be worked
ina more acceptable manner. My client is very concerned about TCP
performance because of DNSSEC being on the horizon.

Dave.

Andrew_Sullivan · November 9, 2007, 5:49pm

Of course, TCP isn't maybe the only way DNSSEC will get responses, but
it's a concern for sure. It also seems that this TCP issue is a DoS
waiting to happen, since it imposes rather more overhead, AFAICT.
Unless I've misunderstood something (and it wouldn't be the first time).

(We're the client, BTW.)

A

user28 · November 9, 2007, 6:46pm

Hrm. Hadn't thought of this, but by my count, I can send three
packets (SYN, ACK, and the request) and get the host to occupy a TCP
slot and send at least 5 packets ... possibly more (SYN-ACK, ACK, 2
byte size, Reply, FIN (probably multiple FIN if I ignore them)).
That's not a large multiplier of packets, but it does occupy resources
while this is happening.

The byte multiplier is not bad (TCP has a lot of overhead). Under
IPv4, I send 120 bytes + request and I cause the host to send 200
bytes plus the reply (or more). Under IPv6, I send 180 bytes +
request and I cause the host to send 300 bytes plus the reply (or
more).

Fixing this bug takes the attack from 5+ packets to 4+ packets. It
probably doesn't change the length of time that the TCP stack keeps
the slot open (for FINs) in any measurable way.

But from a byte count perspective it reduces the overhead transmission
from 200 bytes to 160 bytes (vs. 120 bytes from the attacker) for v4
and from 300 bytes to 240 (vs. 180 bytes from the attacker) for v6.

Dave.

Jaap · November 9, 2007, 6:46pm

Of course, TCP isn't maybe the only way DNSSEC will get responses,

The preferred way is to use EDNS0. If I remember correctly, it is a
requirement for DNSSEC (rfc 3226).

    but it's a concern for sure. It also seems that this TCP issue
    is a DoS waiting to happen, since it imposes rather more overhead,
    AFAICT.

Yes, in general TCP connections take up more resources. That's
one of the motivations for ENDS0.

jaap

Aaron_Hopkins · November 9, 2007, 6:47pm

Is this in something newer than 3.0.6? At least in the 3.0.6 source
grepping for CORK or NOPUSH results in no hits. Using them would solve the
problem of an extra round-trip to answer every TCP query at the expense of
two extra system calls per response.

-- Aaron

Andrew_Sullivan · November 9, 2007, 7:04pm

Well, then, it doesn't really seem like such a big deal: you can get
better amplification than that in other ways. <emily>Never
mind.</litella>

A

Aaron_Hopkins · November 9, 2007, 7:10pm

I didn't provide a patch for the extra round-trip per TCP answer problem. The patch http://ftp.die.net/pub/nsd/nsd-3.0.2-fewerselects-nonblocking.patch
solved two other problems:

- The use of blocking sockets in a nsd had some funky race conditions that
could've resulted in the server freezing or the extra servers started by the
-N flag not being used. This has been fixed in newer NSDs.

- Only one request is processed per socket per select(). And select() is
kind of expensive, particularly when you start handling more sockets (like
lots of TCP sockets open). Wrapping a loop around the UDP recvfrom() allows
us to amortize the cost of the select() over many requests, leading to a 23%
throughput improvement in my tests. (This will obviously vary with the
hardware, OS, etc.) The downside is that if it concentrates on one socket,
it can starve others at least briefly. The current version of the patch
limits this to 100 requests, which at 50000 requests/sec works out to up to
2ms of extra latency for the other sockets.

-- Aaron

Wouter · November 9, 2007, 8:16pm

Aaron Hopkins wrote:
<snip>

Is this in something newer than 3.0.6? At least in the 3.0.6 source
grepping for CORK or NOPUSH results in no hits. Using them would solve the
problem of an extra round-trip to answer every TCP query at the expense of
two extra system calls per response.

I am sorry you misunderstand my post.

I mean, the 5+ packet sequence is what NSD does, this includes the extra
round trip. It is simpler code. NSD does not use TCP_CORK/NOPUSH.

EDNS0 is a better way to handle DNSSEC sizes than TCP. TCP is not used
much in DNS, mostly for zone transfers or 'very large replies'.

Best regards,
Wouter

Aaron_Hopkins · November 9, 2007, 8:44pm

Ah, yes I did misunderstand. Thanks for the clarification.

-- Aaron