Strange SERVFAIL from unbound

Aaron_Hopkins · November 14, 2008, 10:21am

Unbound-users,

While experimenting with replacing dnscache with unbound, approximately
daily I've run into a strange situation where unbound 1.0.2 only answers
requests for zen.spamhaus.org RBL lookups with SERVFAIL for roughly half an
hour, then goes back to working normally.

I upped the verbosity and caught this in action. Hopefully this is the
right log section, as this is a fairly active mail server.

I noticed that unbound only logs ipv6 addresses here, yet this machine isn't
IPv6-capable. Are all IPv4 address expiring and it is refusing to fetch new
one for some reason? Is this some interesting interaction with "do-ip6: no"
and "harden-glue: yes", maybe?

Syslog output (all in the same second, so I stripped the time, etc):

     info: validator operate: query <2.0.0.127.zen.spamhaus.org. TXT IN>
     info: resolving <2.0.0.127.zen.spamhaus.org. TXT IN>
     info: DelegationPoint<zen.spamhaus.org.>: 22 names (0 missing), 22 addrs (0 result, 22 avail)
     info: 8.ns.spamhaus.org.*
     info: 3.ns.spamhaus.org.*
     info: 1.ns.spamhaus.org.*
     info: 0.ns.spamhaus.org.*
     info: y.ns.spamhaus.org.*
     info: x.ns.spamhaus.org.*
     info: t.ns.spamhaus.org.*
     info: s.ns.spamhaus.org.*
     info: r.ns.spamhaus.org.*
     info: q.ns.spamhaus.org.*
     info: o.ns.spamhaus.org.*
     info: m.ns.spamhaus.org.*
     info: l.ns.spamhaus.org.*
     info: k.ns.spamhaus.org.*
     info: i.ns.spamhaus.org.*
     info: h.ns.spamhaus.org.*
     info: g.ns.spamhaus.org.*
     info: f.ns.spamhaus.org.*
     info: d.ns.spamhaus.org.*
     info: c.ns.spamhaus.org.*
     info: b.ns.spamhaus.org.*
     info: a.ns.spamhaus.org.*
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: resolving (init part 2): <2.0.0.127.zen.spamhaus.org. TXT IN>
     info: resolving (init part 3): <2.0.0.127.zen.spamhaus.org. TXT IN>
     info: processQueryTargets: <2.0.0.127.zen.spamhaus.org. TXT IN>
     info: DelegationPoint<zen.spamhaus.org.>: 22 names (0 missing), 22 addrs (0 result, 22 avail)
     info: 8.ns.spamhaus.org.*
     info: 3.ns.spamhaus.org.*
     info: 1.ns.spamhaus.org.*
     info: 0.ns.spamhaus.org.*
     info: y.ns.spamhaus.org.*
     info: x.ns.spamhaus.org.*
     info: t.ns.spamhaus.org.*
     info: s.ns.spamhaus.org.*
     info: r.ns.spamhaus.org.*
     info: q.ns.spamhaus.org.*
     info: o.ns.spamhaus.org.*
     info: m.ns.spamhaus.org.*
     info: l.ns.spamhaus.org.*
     info: k.ns.spamhaus.org.*
     info: i.ns.spamhaus.org.*
     info: h.ns.spamhaus.org.*
     info: g.ns.spamhaus.org.*
     info: f.ns.spamhaus.org.*
     info: d.ns.spamhaus.org.*
     info: c.ns.spamhaus.org.*
     info: b.ns.spamhaus.org.*
     info: a.ns.spamhaus.org.*
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:1 port 53 (len 28)
     info: ip6 2001:7b8:3:1f:0:2:53:2 port 53 (len 28)

Here's the dig that produced that query:

     ; <<>> DiG 9.2.4 <<>> 2.0.0.127.zen.spamhaus.org txt
     ;; global options: printcmd
     ;; Got answer:
     ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 16072
     ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;2.0.0.127.zen.spamhaus.org. IN TXT

"dig zen.spamhaus.org ns" also produces SERVFAIL.

And here's the non-comment portion of my config (with some IPs replaced):

   server:
           verbosity: 2
           statistics-interval: 3600
           num-threads: 2
           interface: 1.2.3.4
           interface: 127.0.0.1
           outgoing-range: 256
           do-ip4: yes
           do-ip6: no
           do-udp: yes
           do-tcp: yes
           access-control: 127.0.0.0/8 allow
           access-control: 1.2.3.0/26 allow
           access-control: 192.168.84.0/24 allow
           chroot: "/var/unbound"
           username: "unbound"
           directory: "/var/unbound"
           pidfile: "/var/unbound/unbound.pid"
           hide-version: yes
           target-fetch-policy: "3 2 1 0 0"
           harden-glue: yes
           do-not-query-address: 127.0.0.0/8
           do-not-query-address: 10.0.0.0/8
           do-not-query-address: 172.16.0.0/12
           do-not-query-address: 192.168.0.0/16

Is there something I'm obviously doing wrong here? If not, is there any
more information I can provide?

Thanks!
-- Aaron

Wouter · November 14, 2008, 12:53pm

Hi Aaron,

Looked at zen.spamhaus.org; but I could not see how harden-glue would
make an impact. Or how A records could time out while the AAAA do not.
They have the same timeout value (4 hours). I think it is possible to
change the code to deal with this situation, but I would like to know
how it happened.

The 30 minutes sounds close to the 15 minute (900 second) default
timeout on lameness detections. Various kinds of badness are detected
and stored in the infrastructure cache. Again, with 22 servers, it
seems unlikely all their A records are lame.

Do you have more log information (you can send this off list) you can
share with me? Lots of data before this point, that deals with
*.spamhaus.org. If you gzip it, 100M of log is only 1 meg email.

If it happens again can you query with dig +norec a.ns.spamhaus.org ?
And dig +norec +cdflag +dnssec a.ns.spamhaus.org ?

Best regards,
Wouter

Aaron Hopkins wrote:

Florian_Weimer · November 14, 2008, 2:00pm

* W. C. A. Wijngaards:

Looked at zen.spamhaus.org; but I could not see how harden-glue would
make an impact. Or how A records could time out while the AAAA do not.
They have the same timeout value (4 hours).

Are you sure they are cached at the same time?

We had lots of fun with additional section processing for MX records
and A and AAAA records being cached at different times. It meant that
some of the time, only AAAA records ended up in the additional
section. Exim assumed that additional section processing has been
performed properly (because the AAAA RRset was present), and if the
host had no working IPv6 connection, it could not deliver the message.

In response to that, Exim was changed not to look at the additional
section. Considering the cache manipulation risk, I think additional
section processing is a failed approach anyway.

Jaco_Engelbrecht · November 14, 2008, 6:44pm

Hi,

Looked at zen.spamhaus.org; but I could not see how harden-glue would
make an impact. Or how A records could time out while the AAAA do not.
They have the same timeout value (4 hours). I think it is possible to
change the code to deal with this situation, but I would like to know
how it happened.

We've been running into the same problem - running Version 1.0.2 with libevent 1.4.8-stable, libldns 1.3.0.

Something common between the two records (zen.spamhaus.org and www.askmeofflistfortthename.net) is the fact that both of them got both an A and AAAA record. I perform a dig against our resolvers once a minute for 3 records, and only the www. askmeofflistfortthename.net record seems to be failing regularly - at least twice a day. I fix this by *drummsroll* restarting Unbound, which is not great, but I'd have to do the same with Bind9, so...

The 30 minutes sounds close to the 15 minute (900 second) default
timeout on lameness detections. Various kinds of badness are detected
and stored in the infrastructure cache. Again, with 22 servers, it
seems unlikely all their A records are lame.

I run unbound with 4 theads, and I've noticed that only 1 thread seems to be returning the SERVFAIL.

Do you maintain an infrastructure cache per thread?

Do you have more log information (you can send this off list) you can
share with me? Lots of data before this point, that deals with
*.spamhaus.org. If you gzip it, 100M of log is only 1 meg email.

What level of verbosity should I set to provide meaningful information that you can use to debug the issue?

If it happens again can you query with dig +norec a.ns.spamhaus.org ?
And dig +norec +cdflag +dnssec a.ns.spamhaus.org ?

I'll do this for our record as well, and mail you with my findings.

Cheers,
Jaco

Wouter · November 14, 2008, 8:27pm

Jaco Engelbrecht wrote:

We've been running into the same problem - running Version 1.0.2 with
libevent 1.4.8-stable, libldns 1.3.0.

I have heard very good reports of unbound with libevent 1.4.8-stable.

I run unbound with 4 theads, and I've noticed that only 1 thread seems
to be returning the SERVFAIL.

Do you have 4 cores? Otherwise you OS may schedule to one core mostly.

Do you maintain an infrastructure cache per thread?

No, one cache per process.

What level of verbosity should I set to provide meaningful information
that you can use to debug the issue?

Emailed offlist.

I'll do this for our record as well, and mail you with my findings.

Thanks!

Best regards,
Wouter

Aaron_Hopkins · November 18, 2008, 10:45am

Or how A records could time out while the AAAA do not.
They have the same timeout value (4 hours).

So unbound just cares that it has a valid address for a given nameserver,
not that any of them are A, even if it wants to use IPv4? If so, this seems
problematic.

The A and AAAA aren't looked up atomically, right? You might get one or the
other in additionals if there's room left in the packet, otherwise you have
to query for A and AAAA seperately? Isn't there a race condition here?

If I have "ip6: no" in the config, is there a reason it is handling AAAA at
all?

And if it matters, zen.spamhaus.org is a strange zone, in that it is served
by rbldnsd in lazy/minimal-answers mode that doesn't bother to fill out an
authoritative section. This apparently saves a lot of bandwidth, and the
only claimed operational difference is that they have to wait longer for
recursive servers to notice nameserver changes.

The 30 minutes sounds close to the 15 minute (900 second) default
timeout on lameness detections.

I had it happen again and the outage lasted almost 4 hours, which more
closely matches the A/AAAA TTL. I didn't manage to do any manual lookups at
the time, and I didn't leave logging enabled over the weekend, as the logs
grow way too quickly on this active nameserver.

I'll set up a testbed to try and reproduce.

If it happens again can you query with dig +norec a.ns.spamhaus.org ?
And dig +norec +cdflag +dnssec a.ns.spamhaus.org ?

I tried this out while it was operating normally, and it showed different
TTLs on A and AAAA:

a.ns.spamhaus.org. 8871 IN A 194.109.9.7
a.ns.spamhaus.org. 8871 IN A 192.150.94.204
a.ns.spamhaus.org. 10299 IN AAAA 2001:7b8:3:1f:0:2:53:1

Also, only 3 of the nameservers offer either A or AAAA results with just
+norec. I have to add +cdflag +dnssec to get As for all 22 nameservers. And for some reason, the AAAAs all have longer TTLs than the As.

-- Aaron

Wouter · November 18, 2008, 12:14pm

Hi Aaron,

Aaron Hopkins wrote:

So unbound just cares that it has a valid address for a given nameserver,
not that any of them are A, even if it wants to use IPv4? If so, this
seems
problematic.

Yes this is a bug. I have fixed it in svn trunk right now, and it is
ready for the 1.1.0 release.

If I have "ip6: no" in the config, is there a reason it is handling AAAA at
all?

Yes, because if there are only AAAA addresses for the nameservers, but
no A addresses, then the zone is unreachable and we should not go back
to the TLD servers to get a new delegation. Also, more importantly I
guess, the reverse for ip4: no configurations.

I have fixed in trunk that it fetches A when it needs them.
And also the inverted bug for people in ipv6 only environments.

And if it matters, zen.spamhaus.org is a strange zone, in that it is served
by rbldnsd in lazy/minimal-answers mode that doesn't bother to fill out an
authoritative section. This apparently saves a lot of bandwidth, and the
only claimed operational difference is that they have to wait longer for
recursive servers to notice nameserver changes.

Interesting.

the time, and I didn't leave logging enabled over the weekend, as the logs
grow way too quickly on this active nameserver.

Thanks for the information!

I'll set up a testbed to try and reproduce.

You can run svn trunk or 1.1.0 release on your production, so it does
not suffer the problem.

If it happens again can you query with dig +norec a.ns.spamhaus.org ?
And dig +norec +cdflag +dnssec a.ns.spamhaus.org ?

I tried this out while it was operating normally, and it showed different
TTLs on A and AAAA:

a.ns.spamhaus.org. 8871 IN A 194.109.9.7
a.ns.spamhaus.org. 8871 IN A 192.150.94.204
a.ns.spamhaus.org. 10299 IN AAAA 2001:7b8:3:1f:0:2:53:1

Also, only 3 of the nameservers offer either A or AAAA results with just
+norec. I have to add +cdflag +dnssec to get As for all 22 nameservers.
And for some reason, the AAAAs all have longer TTLs than the As.

So, do I understand correctly, that you only did +norec queries (any
query without +norec updated the cache so could invalidate further
results), and that you only got replies for most of the nameservers with
+norec +cdflag +dnssec ?

Seems like they are marked as bogus. But spamhaus is not DNSSEC signed?

If they are really marked bogus for some mysterious reason, then the
1.1.0 version will even stronger reject it; 1.1.0 does not query
nameservers with bogus addresses. But 1.0.2 did query bogus addresses,
so it isn't really the reason things are wrong for you.

The reason that things are going wrong is the TTL for A records much
lower than AAAA records. This is also exactly what I fixed for. I do
not really understand this massive difference in A and AAAA time values.
It could be 'coincidence' in the lookups. Jaco told me a smaller zone
had the problem more often, which supports the randomness (more likely
with less records).

I hope the fix works for you, let me know if you get new troubles.

By the way, about the huge log files, the upcoming 1.1.0 release has a
remote control tool where you can change the verbosity online, once the
problem has happened, to very high, then log a couple queries, and turn
verbosity back down again.

Best regards,
Wouter

Florian_Weimer · November 18, 2008, 2:56pm

* Wouter Wijngaards:

So, do I understand correctly, that you only did +norec queries (any
query without +norec updated the cache so could invalidate further
results), and that you only got replies for most of the nameservers with
+norec +cdflag +dnssec ?

+dnssec implies +bufsize=4096, so it's likely that it's just a side
effect.

The reason that things are going wrong is the TTL for A records much
lower than AAAA records. This is also exactly what I fixed for. I do
not really understand this massive difference in A and AAAA time values.

Uhm, why should they expire at the same time from an LRU cache?