Unbound and newegg.com

newegg.com's NS is hosted by ultradns:

    $ unbound-host -t ns newegg.com
    newegg.com has NS record dns1.magnellmail.net.
    newegg.com has NS record pdns6.ultradns.co.uk.
    newegg.com has NS record pdns5.ultradns.info.
    newegg.com has NS record pdns4.ultradns.org.
    newegg.com has NS record pdns3.ultradns.org.
    newegg.com has NS record pdns2.ultradns.net.
    newegg.com has NS record pdns1.ultradns.net.
    $

but interestingly ultradns delegates www.newegg.com and
secure.newegg.com to other servers.

    ;; QUESTION SECTION:
    ;secure.newegg.com. IN A

    ;; AUTHORITY SECTION:
    secure.newegg.com. 30000 IN NS ns14b.newegg.com.
    secure.newegg.com. 30000 IN NS ns13b.newegg.com.

    ;; ADDITIONAL SECTION:
    ns14b.newegg.com. 30000 IN A 204.14.213.149
    ns13b.newegg.com. 30000 IN A 216.52.208.149

these servers will answer authoritatively for the A records www and
secure, but provide root referrals when asked about the AAAA records.

    $ dnsq aaaa www.newegg.com ns14b.newegg.com
    28 www.newegg.com:
    512 bytes, 1+0+13+4 records, response, noerror
    query: 28 www.newegg.com
    authority: . 3600000 NS a.root-servers.net
    authority: . 3600000 NS b.root-servers.net
    authority: . 3600000 NS c.root-servers.net
    authority: . 3600000 NS d.root-servers.net
    authority: . 3600000 NS e.root-servers.net
    authority: . 3600000 NS f.root-servers.net
    authority: . 3600000 NS g.root-servers.net
    authority: . 3600000 NS h.root-servers.net
    authority: . 3600000 NS i.root-servers.net
    authority: . 3600000 NS j.root-servers.net
    authority: . 3600000 NS k.root-servers.net
    authority: . 3600000 NS l.root-servers.net
    authority: . 3600000 NS m.root-servers.net
    additional: a.root-servers.net 3600000 A 198.41.0.4
    additional: b.root-servers.net 3600000 A 128.9.0.107
    additional: c.root-servers.net 3600000 A 192.33.4.12
    additional: d.root-servers.net 3600000 A 128.8.10.90
    $ dnsq a www.newegg.com ns14b.newegg.com
    1 www.newegg.com:
    48 bytes, 1+1+0+0 records, response, authoritative, noerror
    query: 1 www.newegg.com
    answer: www.newegg.com 120 A 204.14.213.185
    $

unbound, when asked about the AAAA then the A record, as a typical
resolver(3) client will do, responds with SERVFAILs, as it seems the
referral from the failed AAAA query somehow poisons unbound (see
attached newegg-fail.log). when asked for only the A record, unbound
doesn't receive any bad data and returns the record (see attached
newegg-success.log).

bind and dnscache handle this lameness, so include the usual
new-kid-on-the-block / abuse-of-the-robustness-principle arguments.

(attachments)

newegg-fail.log.gz (12.2 KB)
newegg-success.log.gz (12.6 KB)

edmonds@debian.org (Robert Edmonds) wrote:

these servers will answer authoritatively for the A records www and
secure, but provide root referrals when asked about the AAAA records.

I've come across the same bad behaviour from the servers for
www.usps.com: they report that they're lame for the AAAA RR rather than
providing a NOERROR/NODATA response. (Note: fpdns can't id the DNS server
implementation involved.) Here are dnscap traces from Unbound and BIND:

Unbound (r1126):

        http://www.panix.com/~geoff/unbound_trace.txt

BIND (9.5.0):

        http://www.panix.com/~geoff/bind_trace.txt

The second trace shows that BIND goes on to query for the A RR even
though the servers are lame for the AAAA RR. I suspect the BIND
developers had to add this as a work around at some point. (Mark,
are you on this list?)

I've worked around this with a local-data statement in unbound.conf, but
the danger is that others deploying Unbound will quickly revert to BIND
the first time they come across this behaviour. www.usps.com is the
main web site for the US Postal Service, so this will happen quickly for
users in the US. I suspect that Unbound will have to be made resilient
to this sort of failure -- perhaps as an option which defaults to "yes".

Geoff

Hi Robert,

Thank you for the detailed bug report.

In svn trunk rev1137 a fix is in.

The fix, for the curious, is only to mark the newegg server lame if it
responds lame to both the A and AAAA queries. So the users can intermix
A and AAAA queries, with AAAA failing and A working.

Man those newegg servers / load balancers are bad stuff. I just noticed
it completely drops class CH (e.g., version.bind CH TXT) queries too.
This is not a particular problem luckily.

This also fixes www.usps.com by the way. Which seems to be running a
different setup as version.bind CH TXT is answered with a neat NOTIMP
answer (good!).

Best regards,
~ Wouter

Robert Edmonds wrote:

"W.C.A. Wijngaards" <wouter@NLnetLabs.nl> writes:

In svn trunk rev1137 a fix is in.

Istalled. Works fine now for www.usps.com. Thanks Wouter!

The fix, for the curious, is only to mark the newegg server lame if it
responds lame to both the A and AAAA queries. So the users can intermix
A and AAAA queries, with AAAA failing and A working.

Makes sense.

Geoff

This "server" will only answer A query, even though it is target of
a delegation. ie: usps.com gives out

;; AUTHORITY SECTION:
www.usps.com. 3600 IN NS nssam.usps.com.
www.usps.com. 3600 IN NS nseag.usps.com.

Then if you ask the server you get:
; <<>> DiG 9.4.0b2 <<>> @nseag.usps.com. www.usps.com. NS
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63452
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 13, ADDITIONAL: 4
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;www.usps.com. IN NS

;; AUTHORITY SECTION:
. 3600000 IN NS a.root-servers.net.
. 3600000 IN NS b.root-servers.net.

This server does not even have the SOA or NS that are required to exist
at the top of a zone it only answers query for A correctly.

IMHO it is wrong to a fix in resolver for such badly behaving
load balancer.

Please do not do it, tell people to report the error to the site
and instruct them to report the equipment they has a broken DNS
server.

         Olafur

Olafur Gudmundsson <ogud@ogud.com> writes:

This server does not even have the SOA or NS that are required to exist
at the top of a zone it only answers query for A correctly.

IMHO it is wrong to a fix in resolver for such badly behaving
load balancer.

Please do not do it, tell people to report the error to the site
and instruct them to report the equipment they has a broken DNS
server.

I agree that the server behaviour should be corrected. The question
is: how many name servers out there exhibit this error? If only
the servers for www.newegg.com and www.usps.com are broken, then
I agree that putting this work-around in the resolver is unnecessary
and perhaps even harmful. If they are just the tip of an iceberg,
then the work-around is needed. Otherwise sites that try to deploy
Unbound will find themselves dealing with user complaints for which
the convenient solution will be to revert to BIND.

I don't know how often this misconfiguration occurs. It would be
interesting to obtain the logs from a high-traffic resolver that
hasn't blackholed lame server logging. One clue, though: PowerDNS
Recursor appears to have the same work-around as BIND:

  http://www.panix.com/~geoff/pdns.out

Geoff

This server does not even have the SOA or NS that are required to exist
at the top of a zone it only answers query for A correctly.

this server also doesn't use DNS compression, doesn't understand ANY,
and, and, ...

IMHO it is wrong to a fix in resolver for such badly behaving
load balancer.

Re-read the last paragraph of the original post and join me on the
theatre's balcony.

Please do not do it, tell people to report the error to the site
and instruct them to report the equipment they has a broken DNS
server.

Well, BIND made the change a while ago ...

1880. [func] The lame cache is now done on a <qname,qclass,qtype>
                        basis as some servers only appear to be lame for
                        certain query types. [RT #14916]

in spite of being a widely used resolver implementation. Now, how should
Unbound choose between following the spec(*) or the leading implementation?

(*) I'm also not sure that the specs actually encourage, even less dictate,
a lameness memory. From the perspective of a name server operator who
occasionally receives lame delegations, I'd of course appreciate a less
exhaustive resolver behaviour.

-Peter

* Geoffrey Sisson:

I agree that the server behaviour should be corrected. The question
is: how many name servers out there exhibit this error?

Lameness specific to QNAME and QTYPE is fairly widespread among
DNS-based load balancing solutions.