Wanted: troubleshooting cluestick

Joe_Abley · September 17, 2008, 2:04am

Hi all,

Whilst watching exim logs scroll past on an MTA of my acquaintance, I saw this:

2008-09-17 01:48:03 H=(felix.hopcount.ca) [204.152.186.101] sender verify defer for <nanog-bounces@nanog.org>: host lookup did not complete
2008-09-17 01:48:03 H=(felix.hopcount.ca) [204.152.186.101] F=<nanog-bounces@nanog.org > temporarily rejected RCPT <mjabhop@monster.hopcount.ca>: Could not complete sender verify

which made me scratch my chin. The box in question is running unbound 1.0.2, installed on FreeBSD 7.0 from ports, and /etc/resolv.conf contains "nameserver 127.0.0.1". On the box in question, on an empty cache:

[monster:~]% dig @127.0.0.1 nanog.org soa +short
dns.merit.net. dns.merit.net. 2008090300 28800 14400 2419200 14400
[monster:~]%

and

[monster:~]% dig @127.0.0.1 nanog.org ns +short
dns1.merit.net.
dns2.merit.net.
dns3.merit.net.
[monster:~]%

and other such things are wonderful and as expected, but

[monster:~]% dig @127.0.0.1 nanog.org mx

; <<>> DiG 9.4.2 <<>> @127.0.0.1 nanog.org mx
; (1 server found)
;; global options: printcmd
;; connection timed out; no servers could be reached
[monster:~]%

fails, consistently. BIND9 resolvers on the same network have no such problem:

[calamari:~]% dig @127.0.0.1 version.bind ch txt +short
"9.4.2-P2"
[calamari:~]% dig @127.0.0.1 nanog.org mx

; <<>> DiG 9.4.2-P2 <<>> @127.0.0.1 nanog.org mx
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6129
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 5

;; QUESTION SECTION:
;nanog.org. IN MX

;; ANSWER SECTION:
nanog.org. 1800 IN MX 0 s0.nanog.org.

;; AUTHORITY SECTION:
nanog.org. 9454 IN NS dns2.merit.net.
nanog.org. 9454 IN NS dns3.merit.net.
nanog.org. 9454 IN NS dns1.merit.net.

;; ADDITIONAL SECTION:
s0.nanog.org. 12559 IN A 198.108.95.20
s0.nanog.org. 12559 IN AAAA 2001:48a8:6880:95::20
dns2.merit.net. 167854 IN A 198.109.64.250
dns3.merit.net. 167854 IN A 207.72.112.10
dns1.merit.net. 167854 IN A 198.108.1.43

;; Query time: 1565 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Sep 16 22:01:03 2008
;; MSG SIZE rcvd: 204

[calamari:~]%

Something seems to be going on. Unbound is logging nothing to syslog (but daemon.* is definitely being recorded; I've tested using logger(1)). I am short on ideas of other ways to gather data and figure out what is going on.

Any ideas?

Joe

[monster:~]% uname -a
FreeBSD monster.hopcount.ca 7.0-RELEASE-p4 FreeBSD 7.0-RELEASE-p4 #8: Sat Sep 6 01:12:37 UTC 2008 root@monster.hopcount.ca:/usr/obj/usr/src/sys/GENERIC i386
[monster:~]%
[monster:~]% diff -u /usr/local/etc/unbound/unbound.conf.sample \ > /usr/local/etc/unbound/unbound.conf
--- /usr/local/etc/unbound/unbound.conf.sample 2008-08-15 00:49:47.000000000 +0000
+++ /usr/local/etc/unbound/unbound.conf 2008-09-17 01:33:08.000000000 +0000
@@ -33,6 +33,8 @@
    # interface: 192.0.2.153
    # interface: 192.0.2.154
    # interface: 2001:DB8::5
+ interface: 127.0.0.1
+ interface: 199.212.90.6

    # enable this feature to copy the source address of queries to reply.
    # Socket options not be supported on all platforms. experimental.
@@ -47,6 +49,7 @@
    # outgoing-interface: 192.0.2.153
    # outgoing-interface: 2001:DB8::5
    # outgoing-interface: 2001:DB8::6
+ outgoing-interface: 199.212.90.4

    # number of ports to allocate per thread, determines the size of the
    # port range that can be open simultaneously.
@@ -140,6 +143,7 @@
    # access-control: ::0/0 refuse
    # access-control: ::1 allow
    # access-control: ::ffff:127.0.0.1 allow
+ access-control: 199.212.90.0/24 allow

    # if given, a chroot(2) is done to the given directory.
    # i.e. you can chroot to the working directory, for example,
@@ -183,7 +187,7 @@
    # use-syslog: yes

    # the pid file.
- # pidfile: "/usr/local/etc/unbound/unbound.pid"
+ # pidfile: "/var/run/unbound/unbound.pid"

    # file to read root hints from.
    # get one from ftp://FTP.INTERNIC.NET/domain/named.cache
[monster:~]%

Joe_Abley · September 17, 2008, 2:12am

And now it's working again:

[monster:~]% dig @127.0.0.1 nanog.org mx

; <<>> DiG 9.4.2 <<>> @127.0.0.1 nanog.org mx
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59427
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 2

;; QUESTION SECTION:
;nanog.org. IN MX

;; ANSWER SECTION:
nanog.org. 1550 IN MX 0 s0.nanog.org.

;; AUTHORITY SECTION:
nanog.org. 13747 IN NS dns1.merit.net.
nanog.org. 13747 IN NS dns2.merit.net.
nanog.org. 13747 IN NS dns3.merit.net.

;; ADDITIONAL SECTION:
s0.nanog.org. 14150 IN A 198.108.95.20
s0.nanog.org. 14150 IN AAAA 2001:48a8:6880:95::20

;; Query time: 49 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Sep 17 02:10:43 2008
;; MSG SIZE rcvd: 156

[monster:~]%

Arrgh. So, perhaps I should rephrase the question. Since I appear to have intermittent periods during which MX queries are timing out for no apparent reason, what instrumentation can I usefully put in place to try and figure out what is going on? I could easily trigger data collection from a process watching the exim log for lookup failures, for example, but I don't really know what to collect.

Joe

Wouter · September 17, 2008, 6:52am

Joe Abley wrote:

[monster:~]% dig @127.0.0.1 nanog.org mx

; <<>> DiG 9.4.2 <<>> @127.0.0.1 nanog.org mx
; (1 server found)
;; global options: printcmd
;; connection timed out; no servers could be reached
[monster:~]%

fails, consistently.

Can you increase the verbosity level to, say, 4 after the first failure?

one way to do this, is to upgrade to svn trunk; setup the remote control
feature (you need to set the control-interface to not only localhost, to
generate keys with unbound-control-setup, and put those keys and the
remote-control: part of the config on your MTA host to give the MTA host
unbound-control privileges) and use
  $ unbound-control verbosity 4
  $ dig ... nanog.org MX
  $ unbound-control verbosity 1
do this query, then lower verbosity to your preferred value again. This
should trigger a lot of debug info into the logfile, just for this query.

You could also increase verbosity in the logfile to 4, but prepare for
several Gb of logs, and possibly slowdown due to logging.

Another way is to try unbound-host -d -d -d -d -v nanog.org -t MX

blablabla 2>&1

The unbound-host command starts with a clean cache and so on, so may not
see the same thing.

You could try to contact the local resolver with a query for localhost.
to see if unbound itself is still responsive.
dig ... localhost. IN A
which should work fine, without a timeout. (this checks if unbound is
DoSed or something).

If you are not sure if things are logged, the unbound server always logs
the message: start of service (unbound 1.1.0). at startup. (your version
number may differ )

Best regards,
Wouter

Jaap · September 17, 2008, 8:45am

Something seems to be going on. Unbound is logging nothing to syslog
    (but daemon.* is definitely being recorded; I've tested using
    logger(1)). I am short on ideas of other ways to gather data and
    figure out what is going on.

    Any ideas?

You have dev/syslog setup in the chroot & syslog flags?

jaap