We have received a few reports where domains have moved from one hosting
provider to another and our resolvers (all running Unbound) has been returning
old/incorrect information about these domains.
The 2 most recent reports are for the domains supre.com.au and ozcelebs.net. I
have included dig results one of our staff members have done to show what's
happening.
The TTL on the A record seems to be originally 86400 (24h).
Thus if unbound sees the record just before it is changed, the
old data stays around for 24 hours. Unbound has a builtin
cap that bounds this caching on a 24 hour term (by coincidence
exactly the same value as the TTL on spre.com.au). You see
it with a 5h ttl, so, unbound saw it 19h before. This is
exactly according to DNS spec.
If you want things in unbound cache to be flushed out earlier
than the owner intended, you can set cache-max-ttl: 86400
to a lower value instead of restarting every day.
It could also be a bug where due to a miscalculation inside
the resolver the TTL becomes -1 (or infinite), but although
such a bug is fixed recently (in svn trunk) for DNSSEC bogus
messages, my guess is you are not DNSSEC validating.
The TTL has already lapsed but it is still showing that the domain has been
delegated to the old hosting providers nextgen.net when it should be
cpanelhost.net.au and hyperservers.com.au as shown below:
The domain is still hosted on nextgen.net, with the old zone contents,
including the old NS RRset. For a successful transition, the new zone
contents needs to be served from those servers, too.
But the NS records returned are still that of the old hosting providers. Let me try and explain it
better.
For both domains, they have changed hosting providers and have redelegated their domains to the
new providers. They are not our customers but have noticed that our users are having problems
accessing their website because our resolvers are still returning old, and incorrect, information so
therefore our users are not hitting their new webservers.
They then contact us asking why this is the case and complain that other ISPs are returning the
new, and _correct_, information about their domain.
Our staff member does a dig, then waits a day making sure that the TTL reaches 0 and our
resolvers *should* lookup the latest information. But somewhere it is caching old NS records.
So for supre.com.au, it has already been delegated away to hyperservers.com.au and
cpanelhost.net.au as shown below:
Thinking about this further, if the TTL becomes -1, shouldn't it consider that cache entry stale and
look it up again? I mean, is there any reason for entries to be in the cache forever?
The domain is still hosted on nextgen.net, with the old zone contents,
including the old NS RRset.
No, it should be as follows:
======
$ dig +trace supre.com.au
<snip>
com.au. 259200 IN NS udns2.ausregistry.net.au.
com.au. 259200 IN NS udns3.ausregistry.net.au.
com.au. 259200 IN NS udns4.ausregistry.net.au.
;; Received 429 bytes from 2001:dc0:2001:a:4608::59#53(A1.AUDNS.NET.au) in 689 ms
supre.com.au. 14400 IN NS ns1.hyperservers.com.au.
supre.com.au. 14400 IN NS ns2.hyperservers.com.au.
supre.com.au. 14400 IN NS ns1.cpanelhost.net.au.
supre.com.au. 14400 IN NS ns2.cpanelhost.net.au.
;; Received 162 bytes from 211.29.133.32#53(audns.optus.net) in 3 ms
Yes, but if you ask the nextgen.net servers due to the existing cache
contents, you keep getting the information that the nextgen.net
servers are still authoritative, so it's not necessary to ask the
com.au servers for fresh data.
I don't know the industry consensus regarding appropriate resolver
here. Re-validating glue has obvious pros (it would help in your
scenario), but also cons (more load on large, delegation-centric
zones).
That's a classic problem - the cached NS records get refreshed when the
cached A records expire & get re-fetched from the cached (old)
nameservers, and as long as the old nameservers answer authoritatively,
they'll never get removed from your caching resolvers. The only
workaround we found for this behaviour is simply to restart the caches
every morning. This happened with djb's dnscache & BIND, with unbound we
simply assumed it was the same, which your case seems to acknowledge.
This is a bug and should not happen, actually -1 is a misnomer, I mean a miscalculation with a too large TTL as the result, fairly rare though it hits DNS resolvers from time to time. If this really happens, you can spot it with the unbound-control dump_cache statement - with insanely large values for the domain in question (either RR or msg).
Unbound does not lock-on to old NS servers even if they keep serving the old zone. After the NS record times out it should get the delegation again and use that until it expires. This can mean that if it fetches an NS record that holds for 24h, and at the end of that fetches the A record that holds for 24h, that 48h after the redelegation the A record is still the old one. This is a worst case scenario with the refetch of NS happening at the last time the old NS is available and the A getting refetched just before the bad NS record expires. After that the TTLs expire and the new data is used.
The phrase 'all the others do not have the problem please reload your cache' might have been used with all the other isps too No, with unbound newer it is probably more likely to be different in any case.
From your data they waited for one day. Could it be this that triggered the complaint? Different way of keeping the cache and the old servers still serving the old zone, turns the TTL until the new data is displayed larger than the one day they expected? (or did they expect even faster convergence, even though they used a 24h TTL?)
"they waited for one day" referred to one of our staff members who was
handling the report. Our staff member did a dig, waited 24 hours and did the
same dig against the same resolvers and still received incorrect information.
It's almost as if unbound was still caching the old delegation.