Two TCP segments to answer a query

This questions is aimed more at the NSD developers, but of course if
anyone knows the answer, feel free to chime in.

While writing some code to process DNS queries and responses over TCP,
one of my colleagues noticed something strange about NSD's TCP
responses. Here's what we have observed:

client: syn
server: syn + ack
client: ack
client: push + ack + query
server: ack
server: ack + 2 bytes indicating size of following dns message
client: ack
server: push + ack + response

I'm omitting the closing sequence of FINs and ACKs here.

Comparing this to a BIND server, we see:

client: syn
server: syn + ack
client: ack
client: push + ack + query
server: push + ack + 2 bytes + response

Notice how NSD uses an extra TCP segment to send just the 2 bytes
indicating the length of the response packet, whereas BIND does it all
in the same TCP segment. BIND's behaviour seems logical to me, whereas
NSD's seems... strange.

Is there any reason NSD does it this way? TCP performance isn't really
an issue for us, so I don't see any immediate need to fix this, if
indeed a fix is even needed. We'd just like to understand this
difference in behaviour.

Regards,

Anand Buddhdev
RIPE NCC

There is no strong reason why NSD _should_ do this the way BIND
does it.

TCP is a STREAM of bytes, the "packetizing" of TCP is not specified
in any standard at all. An application can use various system-
specific methods to express its preference (like TCP_CORK in linux),
but even with these specified, TCP stack is allowed to divide the
stream into packets more or less arbitrary.

So the client shoud be prepared for even the worst case scenario,
ie, should be able to read whole thing byte at a time.

The way BIND does it is merely an optimization, in an attempt to
minimize network roundtrips, and it is in no way mandatory again.

As for optimization itself. NSD should be prepared for the write
to fail with EAGAIN at any time, which means the kernel send buffer
is full, so NSD will have to repeat the write from the position
where it stopped. It is easy if we're writing just one buffer of
data. But we've two: the size and the data itself.

There are at least 2 ways to deal with it.

First, there's already mentioned TCP_CORK which can be used on
linux (if it isn't already). It is relatively easy: set it to
on before attempting to send size, and to off when done sending
this reply.

Another option is to use writev() and make it restartable from
arbitrary position. For example, the way how it is done in qemu:

http://git.qemu.org/?p=qemu.git;a=blob;f=cutils.c;hb=HEAD

(there, see do_sendv_recvv() function, and note it is writev()
but has extra argument, "offset").

But these are possible implementation details of an _optimization_,
not of a bugfix.

Thanks,

/mjt

Hi Anand,

Michael is precisely right.

For portability you cannot assume writev or TCP_CORK or other exotic
options are available. Thus the current method must work.
Implementing some optimization for TCP traffic has not been done.

Other optimizations also exist, possible some malloc()-tricks that are
also backwards compatible with older operating systems.

Unbound does have support for writev(). This does result in some code
duplication for the tcp write implementation.

Best regards,
   Wouter

Hi Michael,

There is no strong reason why NSD _should_ do this the way BIND
does it.

TCP is a STREAM of bytes, the "packetizing" of TCP is not specified
in any standard at all. An application can use various system-
specific methods to express its preference (like TCP_CORK in linux),
but even with these specified, TCP stack is allowed to divide the
stream into packets more or less arbitrary.

Thanks for your long response. I am aware of how TCP works, and didn't
really say that this was a bug in NSD. I was just trying to understand
the difference in behaviour.

Wouter confirmed what I suspected anyway, that there was no TCP
optimisation done for NSD. That's enough of an explanation.

Regards,

Anand