I am encountering some TCP issues on a RHEL5.5 with unbound 1.4.6 (homemade), ldns 1.6.4 (EPEL5) and libevent-1.4.13 (stock:).
TCP stops working after a random amount of time (after a day or two). As a result we have timeouts or sometimes an answer after many seconds. To fix this odd behaviour I must restart unbound. I can see interesting stuff in the logs which doesn't sound good :
Could this be linked to the fact I compiled Unbound 1.4.6 with ldns 1.6.4 (saying that because Paul Wouters opened a thread about possible dangerous linking when building RPM - something you've just fixed in 1.4.7).
Any thoughts ? Except this small issue unbound is quite awesome. Many thanks for this great piece of software.
I am encountering some TCP issues on a RHEL5.5 with unbound 1.4.6
2010-11-14T20:58:50.266456+11:00 ns2 unbound: [11863:0] error: tcp
writev: Broken pipe
This indicates the remote end closed read the read end of the tcp
channel. The error printout is maybe a bit overly dramatic, perhaps
this specific error, since it only indicates another host on the
internet closed the connection, should not be printed at low verbosity.
But this does not seem to explain the tcp issues you say you have. What
is the observed problem again? What queries are giving resolution
problems and timeouts - are you sure they even use tcp - and can you do
a unbound-control dump_requestlist if the query times out (it may still
be in the requestlist and we can then see what is going on) and
unbound-control lookup (that says what unbound would probably do if it
were in the requestlist). And possibly enable a higher verbosity (which
you can do very easily without a restart; unbound-control verbosity 4;
(wait) ; unbound-control verbosity 1 ) but you would want to actually
catch the offending action in the verbosity which may get tricky (higher
verbosity can slow down the server significantly and produce Gbs of logs
so on high traffic servers you really cannot leave it on).
Could this be linked to the fact I compiled Unbound 1.4.6 with ldns
1.6.4 (saying that because Paul Wouters opened a thread about possible
dangerous linking when building RPM - something you've just fixed in
1.4.7).
No, unbound mostly uses ldns to read-write DNS to ascii, tests and other
stuff, but not for TCP.
This indicates the remote end closed read the read end of the tcp
channel. [..]
Ok. I think they happen more when unbound seems to struggle to resolve TCP queries. Indeed if recursion takes too long, the client closes the connection after a defined timeout.
But this does not seem to explain the tcp issues you say you have. What
is the observed problem again? What queries are giving resolution
problems and timeouts - are you sure they even use tcp [..]
My problem is that sometimes any TCP queries timeout because they seem stuck somewhere in unbound. It's _not_ due to a specific query. I have the problem randomly and it affects all TCP queries. My loadbalancers test my unbound clusters by doing some : "dig +tcp google.com @UNBOUND_IP". I tried locally when it seems stuck and I have the same behaviour. How can I distinguish the TCP from the UDP queries in a unbound dump_requestlist ?
For the moment, my problem hasn't occurred again but I know it will. I will try to do more debugging next time to see if it's unbound or something else.
Also I forgot to paste my current config. Each server has 1G of memory and unbound used ~760M once the cache is full. Servers are not heavily used yet (see stats below over a ~5min period of time).
This indicates the remote end closed read the read end of the tcp
channel. [..]
Ok. I think they happen more when unbound seems to struggle to resolve
TCP queries. Indeed if recursion takes too long, the client closes the
connection after a defined timeout.
Could it be that the 10 connections that unbound offers simultaneously
are simply swamped with traffic, and thus the new query cannot get in?
Since unbound is not accepting new tcp connections, the client times out
trying to establish a tcp connection?
In the config there is the setting:
# outgoing-num-tcp: 10
# incoming-num-tcp: 10
Perhaps if you increase them, say to 50, you have more capacity to
handle tcp connections?
Since you did not compile with libevent, you do have a limit there,
certainly as compared to your 950 outgoing-range. If you used libevent,
you could simply state the number you want (it takes about 65k for every
buffer); so 100 or 200 would be a lot.
@UNBOUND_IP". I tried locally when it seems stuck and I have the same
behaviour. How can I distinguish the TCP from the UDP queries in a
unbound dump_requestlist ?
It does not show that.
For the moment, my problem hasn't occurred again but I know it will. I
will try to do more debugging next time to see if it's unbound or
something else.
thanks
Also I forgot to paste my current config. Each server has 1G of memory
and unbound used ~760M once the cache is full. Servers are not heavily
used yet (see stats below over a ~5min period of time).
You seem to have more TCP traffic than others, but unbound would print a
num_tcp statistic as well; but you did not include it in your excerpt.
Or worse, there was none (it was zero?), and unbound reports it has
received no TCP queries, but you tried to? Since the connection
establishment could be the issue, a tcpdump of the tcpconnection tryout
(from the client) could be nice; to see if it managed to connect.
Since you did not compile with libevent, you do have a limit there,
certainly as compared to your 950 outgoing-range. If you used libevent,
you could simply state the number you want (it takes about 65k for every
buffer); so 100 or 200 would be a lot.
How did you know/notice that libevent wasn't working ? Because I thought I was using it...
Indeed when I increased num-tcp unbound complained about the 1024 files limit so obviously libevent is not used... But based on unbound -h and ldd output it looks like it is linked with libevent.
Did I miss the documentation part "how to enable libevent in unbound.conf ?" Should I use more than one thread or is it a bug with old EL5 libraries ?
Version 1.4.6
linked libs: libevent 1.4.13-stable (it uses epoll), ldns 1.6.6, OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008
linked modules: validator iterator
configured for i386-redhat-linux-gnu on Tue Aug 24 16:08:57 EST 2010 with options: '--build=i386-koji-linux-gnu' '--host=i386-koji-linux-gnu' '--target=i386-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--localstatedir=/var' '--sharedstatedir=/usr/com' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-ldns=' '--with-libevent' '--with-pthreads' '--with-ssl' '--disable-rpath' '--enable-debug' '--disable-static' '--with-conf-file=/etc/unbound/unbound.conf' '--with-pidfile=/var/run/unbound/unbound.pid' '--enable-sha2'
BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs@nlnetlabs.nl
Since you did not compile with libevent, you do have a limit there,
certainly as compared to your 950 outgoing-range. If you used libevent,
you could simply state the number you want (it takes about 65k for every
buffer); so 100 or 200 would be a lot.
How did you know/notice that libevent wasn't working ? Because I thought
I was using it...
Then this should not be a problem, just increase your incoming-num-tcp.
Indeed when I increased num-tcp unbound complained about the 1024 files
limit so obviously libevent is not used... But based on unbound -h and
ldd output it looks like it is linked with libevent.
What was the complaint? Perhaps it complained about having to increase
the ulimit? Otherwise, it should not complain if you have libevent (and
it looks like it works). ulimit -n shows the number of file descriptors
that you allow; unbound tries to increase it (if it has root powers) if
you configure it to use lots.
Did I miss the documentation part "how to enable libevent in
unbound.conf ?" Should I use more than one thread or is it a bug with
old EL5 libraries ?
What was the complaint? Perhaps it complained about having to increase
the ulimit? Otherwise, it should not complain if you have libevent (and
it looks like it works). ulimit -n shows the number of file descriptors
that you allow; unbound tries to increase it (if it has root powers) if
you configure it to use lots.
Yes it was an ulimit complaint about not having enough "open files". So I finally decided to change the limits.conf to nofile=4096 to root and unbound. Then the warning disappeared. It appeared because I increased num-tcp from 10 to 50 and obviously I used more than 1024 - before it was 950 + 10 + 10 so it was ok. But I thought that with libevent I didn't have to change ulimit. Something seems wrong.
It's confusing and hard to know in which mode we are, maybe it would be nice to print it when unbound starts (actually it somehow already says that with unbound -h).
To summarize, it looks like I have libevent support but it does not use it
What was the complaint? Perhaps it complained about having to increase
the ulimit? Otherwise, it should not complain if you have libevent (and
it looks like it works). ulimit -n shows the number of file descriptors
that you allow; unbound tries to increase it (if it has root powers) if
you configure it to use lots.
Yes it was an ulimit complaint about not having enough "open files". So
I finally decided to change the limits.conf to nofile=4096 to root and
unbound. Then the warning disappeared. It appeared because I increased
num-tcp from 10 to 50 and obviously I used more than 1024 - before it
was 950 + 10 + 10 so it was ok. But I thought that with libevent I
didn't have to change ulimit. Something seems wrong.
That fixed your problem.
You can also increase to 2048 + 100 + 100 or so and have more headroom.
It's confusing and hard to know in which mode we are, maybe it would be
nice to print it when unbound starts (actually it somehow already says
that with unbound -h).
It is certainly using libevent, it prints that with -h, and if you start
with high verbosity it will print this again in syslog.
So, the libevent is working fine, only your ulimit was in the way of
using many sockets. If your nofile is low, it stops unbound from
opening network sockets (as every network connection counts as an open
file for that limit). This is why unbound checks it on start and
reports if there is trouble: it first tries to adjust the ulimit, if
that fails it downsizes its config so that it can run without problems.
Unbound must open 'many network connections' to perform port
randomisation, for anti-spoof resistance.
To summarize, it looks like I have libevent support but it does not use
it
libevent is fine, its the ulimit that bothered you