No answers from unbound occasionally

Hello,

FreeBSD 7-STABLE, unbound r1584, single threaded.

It seems that occasionally it does not answer (or just doesn't receive, I don't know) the queries.

With tcpdump it seems that the machine gets the query, but there is no answer from unbound.
Its statistics counters seems OK, there is no full queue, or drops according to that.

However netstat shows a high amount of "dropped due to no socket" cases:
netstat -s -p udp
udp:
    17958542831 datagrams received
    0 with incomplete header
    0 with bad data length field
    25928 with bad checksum
    16209476 with no checksum
    46927070 dropped due to no socket
    33 broadcast/multicast datagrams undelivered
    72146984 dropped due to full socket buffers
    0 not for hashed pcb
    17839442816 delivered
    18155004206 datagrams output
    0 times multicast source filter matched

I've already tried to raise the related sysctls, without any effects.

Any ideas?

Would it be possible to selectively enable verbose logging? I can reproduce the no answer case in about an hour with a query tester sending about 100 queries per second, but I can't enable the verbose log, because it would kill the machine.

Thanks,

Hi Attila,

Are those replies from authority servers? That arrive just after unbound
times out and closes the socket?

Some sort of selective verbose logging is an idea on my TODO.

Unbound will indeed not respond to particular queries. These queries
end up getting counted as 'cache hits', but really they were malformed.
Some malformed queries unbound does not reply to - such as queries with
QR=1 flag, or shorter than 12 byte queries. Since you are sending them
yourself, it seems unlikely they are this malformed.

Attila Nagy wrote:

FreeBSD 7-STABLE, unbound r1584, single threaded.

It seems that occasionally it does not answer (or just doesn't receive,
I don't know) the queries.

With tcpdump it seems that the machine gets the query, but there is no
answer from unbound.
Its statistics counters seems OK, there is no full queue, or drops
according to that.

Is it a query to port 53? Queries to other ports are not answered.

Are there 'jostled' queries? They also create dropped replies by
replacing an existing (old) one.

   72146984 dropped due to full socket buffers

Could this explain the 1 in 100qps-for-3600s that are dropped? Could
they be dropped at the query-sender (seems unlikely)?

I've already tried to raise the related sysctls, without any effects.

You tried to increase socket buffers already, I presume. Weird.

Best regards,
   Wouter

Hello,

W.C.A. Wijngaards wrote:

Are those replies from authority servers? That arrive just after unbound
times out and closes the socket?
  

I’m not sure I understand this correctly. What we do is the following:

  • we have some loaded (some thousand queries per sec) recursive nameservers (behind a load balancer) on which the clients report occasional loss of answers
  • we start a query against the servers, the program sends a configurable amount of queries for the same name (so no, the queries should be answerable from the cache) and waits 5 seconds for the answers
  • we sometimes find timeouts on the client
  • investigating this yields a capture (made on the nameservers), which has the client’s request (so there is no packet loss involved between the client and the server), but no answer from the server

The cache is quite big (the machines have 8 GiB of RAM), the TTL is high, so answers all should come from it.

Some sort of selective verbose logging is an idea on my TODO.
  

Great, it would ease debugging issues like this.

Unbound will indeed not respond to particular queries.  These queries
end up getting counted as 'cache hits', but really they were malformed.
 Some malformed queries unbound does not reply to - such as queries with
QR=1 flag, or shorter than 12 byte queries.  Since you are sending them
yourself, it seems unlikely they are this malformed.
  

The queries should be the same. Or at least, I haven’t seen any differences between the “good” and the “bad” ones in wireshark.

With tcpdump it seems that the machine gets the query, but there is no
answer from unbound.
Its statistics counters seems OK, there is no full queue, or drops
according to that.
    

Is it a query to port 53?  Queries to other ports are not answered.
  

Yes.

Are there 'jostled' queries? They also create dropped replies by
replacing an existing (old) one.
  

unbound-control currently tells this:
thread0.requestlist.avg=757.391
thread0.requestlist.max=867
thread0.requestlist.overwritten=0
thread0.requestlist.exceeded=0
I have graphs from this (munin) and I haven’t seen a single overwritten or exceeded value on any of our servers.

  
   72146984 dropped due to full socket buffers
    

Could this explain the 1 in 100qps-for-3600s that are dropped?  Could
they be dropped at the query-sender (seems unlikely)?
  

We monitor the traffic on three points: before, and after the load balancer and on the DNS servers. The queries for which we haven’t got an answer are in all three captures, so no, it gets to the server.

  
I've already tried to raise the related sysctls, without any effects.
    

You tried to increase socket buffers already, I presume. Weird.
  

Yes, I’ve tried increasing net.inet.udp.recvspace, kern.ipc.maxsockbuf.

According to netstat -m, there is no mbuf shortage:

netstat -m

5148/5097/10245 mbufs in use (current/cache/total)
4080/2470/6550/25600 mbuf clusters in use (current/cache/total/max)
4080/1424 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
9447K/6214K/15661K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

We have pf active, but its states are also monitored, and it doesn’t reach the maximum, and udp related timeouts are high:
udp.first 60s
udp.single 30s
udp.multiple 60s

Thanks,

Attila Nagy wrote:

W.C.A. Wijngaards wrote:

Are those replies from authority servers? That arrive just after unbound
times out and closes the socket?
  

I'm not sure I understand this correctly. What we do is the following:

During trying to figuring out why does the "dropped due to no socket" counter grows, I turned on net.inet.udp.log_in_vain, which logs what connections couldn't find a socket.
And I quickly understood the above. :slight_smile:
Yes, they are the answers from authoritative servers.

I don't know how long unbound waits for an answer, but I see about 10-30 lines per second per server from this. Do you think it's normal? (there are about 4k qps currently)

May 30 14:25:21 ns Connection attempt to UDP m.y.i.p:42849 from 193.230.161.4:53
May 30 14:25:21 ns Connection attempt to UDP m.y.i.p:29188 from 202.103.224.70:53
May 30 14:25:21 ns Connection attempt to UDP m.y.i.p:6351 from 202.67.10.90:53
May 30 14:25:21 ns Connection attempt to UDP m.y.i.p:10611 from 202.67.10.90:53

Thanks,

Hi Attila,

Attila Nagy wrote:

Are those replies from authority servers? That arrive just after unbound
times out and closes the socket?
  

I'm not sure I understand this correctly. What we do is the following:

During trying to figuring out why does the "dropped due to no socket"
counter grows, I turned on net.inet.udp.log_in_vain, which logs what
connections couldn't find a socket.
And I quickly understood the above. :slight_smile:
Yes, they are the answers from authoritative servers.

Ok.

I don't know how long unbound waits for an answer, but I see about 10-30
lines per second per server from this. Do you think it's normal? (there
are about 4k qps currently)

Well, unbound normally sets its smoothed roundtrip time timeout that it
calculated as the time out, and when that timer expires, it closes the
socket. It can then re-use the socket to open another (random) port to
send the next query on. This next query could well be the retry for the
just failed one, with exponential backoff on the timer.

So, it is normal that some authority servers are slow. Thus produce
timeouts. 20 / 4000 is 0.5% of cases for you, this looks very
reasonable to me.

May 30 14:25:21 ns Connection attempt to UDP m.y.i.p:42849 from
193.230.161.4:53
May 30 14:25:21 ns Connection attempt to UDP m.y.i.p:29188 from
202.103.224.70:53
May 30 14:25:21 ns Connection attempt to UDP m.y.i.p:6351 from
202.67.10.90:53
May 30 14:25:21 ns Connection attempt to UDP m.y.i.p:10611 from
202.67.10.90:53

Thanks,

Best regards,
   Wouter

Hello,

W.C.A. Wijngaards wrote:

I don't know how long unbound waits for an answer, but I see about 10-30
lines per second per server from this. Do you think it's normal? (there
are about 4k qps currently)
    
Well, unbound normally sets its smoothed roundtrip time timeout that it
calculated as the time out, and when that timer expires, it closes the
socket. It can then re-use the socket to open another (random) port to
send the next query on. This next query could well be the retry for the
just failed one, with exponential backoff on the timer.
  

I've made a capture and according to that, it was about 150 ms between unbound's query and the response. I think it's pretty normal for nameservers on the other side of the globe.

So, it is normal that some authority servers are slow. Thus produce
timeouts. 20 / 4000 is 0.5% of cases for you, this looks very
reasonable to me.
  

Yes, maybe, my problem is that I can see unanswered queries. If I understand it correctly, the above situation should not result in a loss of answer towards the resolver client.

And this is my primary concern, because I don't understand why those queries are left unanswered.

Thanks,

Attila Nagy wrote:

Hello,
I've made a capture and according to that, it was about 150 ms between
unbound's query and the response. I think it's pretty normal for
nameservers on the other side of the globe.

Yes, however unbound keeps a rtt profile for each server. So if the
server has a very fast RTT then unbound will consider 150msec to be very
late (for that server), and retries before that time is up.

So, it is normal that some authority servers are slow. Thus produce
timeouts. 20 / 4000 is 0.5% of cases for you, this looks very
reasonable to me.
  

Yes, maybe, my problem is that I can see unanswered queries. If I
understand it correctly, the above situation should not result in a loss
of answer towards the resolver client.

No it should not.

And this is my primary concern, because I don't understand why those
queries are left unanswered.

Hmm yes that is bad. Can you look in the total counts of queries that
unbound produces and see where they end up?

Best regards,
   Wouter

This suggests to me that for large, anycast-distributed servers there is an opportunity for unbound to give up to early whenever services are moved between far-distant anycast nodes.

What is the actual algorithm?

For example, if the RTT to an anycast server suddenly changes from reliably <20ms to something substantially higher, what is the threshold RTT at which we can expect unbound to stop being able to get answers?

Joe

Joe Abley wrote:

This suggests to me that for large, anycast-distributed servers there is
an opportunity for unbound to give up to early whenever services are
moved between far-distant anycast nodes.

What is the actual algorithm?

Exponential backoff.
(I think internet software should do exponential backoff because that is
good behaviour.)

For example, if the RTT to an anycast server suddenly changes from
reliably <20ms to something substantially higher, what is the threshold
RTT at which we can expect unbound to stop being able to get answers?

If it becomes more than 16x or 32x slower, then you can expect a first
servfail as lots of exponential backoff probes are failing. And then
things work again. If there is some other server in the NS set that
does not move like this, then the (1 second of) servfail does not
happen. In that case, unbound finds out the new rtt for the moved
server gradually, using the other server if the slow server does not
respond quickly enough.

Best regards,
   Wouter