Once TTL has gone, don't fetch again

Hello list,

One of colleague in Japan Unbound Users Group reports the
following problem, and I could reproduce.

Environ:
- his
- unbound-1.3.3 + ldns-1.6.1 + libevent-1.4.2-stable
- mine
- vanilla built unbound-1.3.3

Problem:
At first trial, unbound could resolve
"www.rurubu.com". After 1 hour(is its TTL), query to unbound
on "www.rurubu.com" ends with SERVFAIL.

Some condition observed:

  # authoritative answer on "www.rurubu.com" doesn't
  # contain ADDITIONAL SECTION.

  # I know that RFC 1035 says additional section is
  # "a posibilly empty list", so this is notable, but
  # no problem.

kohi@guest1[19]% dig @ns1.visualjapan.co.jp www.rurubu.com

; <<>> DiG 9.5.0-P2 <<>> @ns1.visualjapan.co.jp www.rurubu.com
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45680
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.rurubu.com. IN A

;; ANSWER SECTION:
www.rurubu.com. 3600 IN A 202.143.76.167

;; Query time: 13 msec
;; SERVER: 210.225.98.1#53(210.225.98.1)
;; WHEN: Mon Sep 7 23:35:46 2009
;; MSG SIZE rcvd: 48

  # Authority on visualjapan.co.jp is deletgated to
  # those 3 servers.

kohi@guest1[22]% dig +norec @a.dns.jp visualjapan.co.jp NS
     :
  (snip)
     :
;; AUTHORITY SECTION:
visualjapan.co.jp. 86400 IN NS ns2.visualjapan.co.jp.
visualjapan.co.jp. 86400 IN NS ns1.visualjapan.co.jp.
visualjapan.co.jp. 86400 IN NS ns-tk022.ocn.ad.jp.

  # but one(ns-tk002.ocn.ad.jp) is lame delegated.

kohi@guest1[23]% dig +norec @ns-tk022.ocn.ad.jp ns1.visualjapan.co.jp

; <<>> DiG 9.5.0-P2 <<>> +norec @ns-tk022.ocn.ad.jp ns1.visualjapan.co.jp
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 28896
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;ns1.visualjapan.co.jp. IN A

;; Query time: 13 msec
;; SERVER: 203.139.160.104#53(203.139.160.104)
;; WHEN: Mon Sep 7 23:41:15 2009
;; MSG SIZE rcvd: 39

Though I can provide log with verbosity: 5 and the result of
unbound-control dump_cache, but these have quite large
amount to post this list, so I can send them off-list if you
need, or let me known appropriate keyword to grep, or
appropriate verbosity setting.

Thanks

          Koh-ichi Ito

Hi Koh-ichi,

Yes I could reproduce this.

The trouble is this:
the TTL for ns1.visualjapan.co.jp and ns2.visualjapan.co.jp
is 3600 when visualjapan.co.jp servers are queried, not
86400 like the other IP addresses have - like the lame
server has.

So after one hour, the working server addresses have timed
out and only the lame server address is still in cache.

Unbound tries to fetch the missing timed out addresses, but
needs to ask ... the lame server for it. So, it cannot fetch
it. It looks like the domain is a lame domain.

Unbound cannot ask co.jp again because this looks like a
lame domain as in BCP123 about resolver misimplementations.

Fixes:
Change the TTL for ns1.visualjapan and ns2.visualjapan
to 86400 when you query them for themselves. Likely there
is still a second or two of trouble but not 23 hours of trouble.

remove the lame server from the NS record of visualjapan.co.jp.

fix the lame server.

Something which also caught my eye, (but is not the problem here)
is that the ns1.visualjapan servers give two IP addresses when
given in the delegation from co.jp, but only one IP address when
queried directly.

Best regards,
   Wouter

Hello, Wouter

I appreciate your quick reply as usual. Thanks a lot.

Hello again, Wouter

Yes I could reproduce this.

The trouble is this:
the TTL for ns1.visualjapan.co.jp and ns2.visualjapan.co.jp
is 3600 when visualjapan.co.jp servers are queried, not
86400 like the other IP addresses have - like the lame
server has.

Original poster in our forum agrees your description. But he
also points out that the problem doesn't occur with BIND9.

Though we understand that to fix lame delegation is proper
way, I guess this is a disadvantage to deploy unbound. Any
plan to rescue?

For example, if you find that ns-tk022.ocn.ad.jp is lame
delegated, fetch A's of ns[12].visualjapan.co.jp, and
continue to work.

Thanks in advance

          Koh-ichi Ito

Hi Koh-ichi Ito,

There is a fix in unbound svn trunk.
It detects that recently it saw other nameserver glue
and attempts to fetch it by querying the parent again.

Best regards,
   Wouter