Unbound 1.4.1 cache problems

Hello there,

I have upgraded from Unbound 1.3.0 to 1.4.1 recently on a couple of
servers. The servers are doing about 3-7k requests per second.
I have noticed that after some time of operation, servers
intermittently stop responding to queries to certain domains from
cache (return SERVFAIL). It particularly doesn't like webex.com,
sometimes vmware.com, dell.com.

# unbound-host webex.com
webex.com has address 64.68.120.99
webex.com mail is handled by 50 mx3.webex.com.
webex.com mail is handled by 10 mx1.webex2.iphmx.com.
webex.com mail is handled by 10 mx2.webex2.iphmx.com.
webex.com mail is handled by 50 mx2.webex.com.

But the A record doesnt seem to get cached:

# dig webex.com @localhost

; <<>> DiG 9.3.3rc2 <<>> webex.com @localhost
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49004
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;webex.com. IN A

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan 8 19:16:01 2010
;; MSG SIZE rcvd: 27

# unbound-control dump_cache | grep ^webex.com
webex.com. 4589 IN NS ns01.webex.com.
webex.com. 4589 IN NS ns05.webex.com.
webex.com. 4589 IN NS ns07.webex.com.
webex.com. 4589 IN NS ns03.webex.com.
webex.com. 4589 IN NS ns08.webex.com.
webex.com. 4589 IN NS ns06.webex.com.

It doesnt respond to NS queries either (No A records for
nsxx.webex.com in cache)

# dig ns webex.com @localhost

; <<>> DiG 9.3.3rc2 <<>> ns webex.com @localhost
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 54946
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;webex.com. IN NS

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan 8 20:53:18 2010
;; MSG SIZE rcvd: 27

After flushing webex.com I can query the cache again:

# unbound-control flush webex.com
ok
# dig +short webex.com @localhost
64.68.120.99

Log with increased verbosity:

an 8 21:05:32 cns01j unbound: [25052:3] info: validator operate:
query <webex.com. A IN>
Jan 8 21:05:32 cns01j unbound: [25052:3] info: resolving <webex.com. A IN>
Jan 8 21:05:32 cns01j unbound: [25052:3] info:
DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
avail)
Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns08.webex.com. A
Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns05.webex.com. A
Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns01.webex.com. A
Jan 8 21:05:32 cns01j unbound: [25052:3] info: ns06.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns07.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns03.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: resolving (init part 2):
<webex.com. A IN>
Jan 8 21:05:32 cns unbound: [25052:3] info: resolving (init part 3):
<webex.com. A IN>
Jan 8 21:05:32 cns unbound: [25052:3] info: processQueryTargets:
<webex.com. A IN>
Jan 8 21:05:32 cns unbound: [25052:3] info:
DelegationPoint<webex.com.>: 6 names (6 missing), 6 addrs (0 result, 6
avail)
Jan 8 21:05:32 cns unbound: [25052:3] info: ns08.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns05.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns01.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns06.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns07.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: ns03.webex.com. A
Jan 8 21:05:32 cns unbound: [25052:3] info: validator operate: query
<webex.com. A IN>
Jan 8 21:05:32 cns unbound: [25052:3] info: mesh_run: end 8 recursion
states (8 with reply, 0 detached), 28 waiting replies, 84 recursion
replies sent, 0 replies dropped, 0 states jostled out
Jan 8 21:05:32 cns unbound: [25052:3] info: average recursion
processing time 4.172319 sec
Jan 8 21:05:32 cns unbound: [25052:3] info: histogram of recursion
processing times
Jan 8 21:05:32 cns unbound: [25052:3] info: [25%]=0.00107789
median[50%]=0.00744727 [75%]=0.0524288
Jan 8 21:05:32 cns unbound: [25052:3] info: lower(secs) upper(secs) recursions
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.000000 0.000001 9
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.000256 0.000512 1
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.001024 0.002048 1
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.002048 0.004096 19
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.004096 0.008192 2
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.008192 0.016384 11
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.016384 0.032768 3
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.032768 0.065536 5
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.065536 0.131072 15
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.131072 0.262144 10
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.262144 0.524288 5
Jan 8 21:05:32 cns unbound: [25052:3] info: 0.524288 1.000000 2
Jan 8 21:05:32 cns unbound: [25052:3] info: 256.000000 512.000000 1

I have downgraded one of the servers back to 1.3.0 (libev 3.53), where
I don't see this problem.

Unbound 1.4.1 compiled with libev-3.9, without threads - 4 processes running

num-threads: 4
outgoing-range: 10000
msg-cache-size: 500m
msg-cache-slabs: 1
num-queries-per-thread: 10000
jostle-timeout: 4000
so-rcvbuf: 8m
rrset-cache-size: 1024m
rrset-cache-slabs: 1
cache-max-ttl: 86400

Any help would be greatly appreciated :slight_smile:

Thanks,

Hi Jakub,

I cannot know for sure since the log seems to be shortened by syslog,
but I think you may have the same problem as Haw Loeung reported. I
have since corresponded with him and I think have resolved the issue. A
patch is in svn for unbound.

The issue seems to be caused by the parent-child disagreement fix in
1.4.1, worsened by no IPv6 available (either not in the kernel or
do-ip6:no statement) - just turning ipv6 back on (also if you have no
ipv6 connectivity) may stop this bug.

Sorry for the inconvenience,
   Wouter

Thanks Wouter, the workaround with ip6 is working.

Best regards,
Jakub