Can't resolve m.facebook.com

Hi,

Running unbound r2580, I can't resolve m.facebook.com. I get SERVFAIL back.
The server was running for some time, so it's not in a fresh state. It seems the problem is that facebook DNS servers time out on AAAA records, so unbound gets the false assumption that they are unavailable.

Here are the verbose (level 4) logs while trying to resolve the name:
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: answer from the cache failed
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: mesh_run: start
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: validator[module 0] operate: extstate:module_state_initial event:module_event_new
Feb 9 08:24:33 cns04k unbound: [32693:3] info: validator operate: query m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: validator: pass to next module
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: mesh_run: validator module exit state is module_wait_module
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: mesh_run: python module exit state is module_wait_module
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: iterator[module 2] operate: extstate:module_state_initial event:module_event_pass
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: process_request: new external request event
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: iter_handle processing q with state INIT REQUEST STATE
Feb 9 08:24:33 cns04k unbound: [32693:3] info: resolving m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: request has dependency depth of 0
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: cache delegation returns delegpt
Feb 9 08:24:33 cns04k unbound: [32693:3] info: DelegationPoint<facebook.com.>: 5 names (0 missing), 5 addrs (0 result, 5 avail) cacheNS
Feb 9 08:24:33 cns04k unbound: [32693:3] info: ns1.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] info: ns2.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] info: ns3.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] info: ns5.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] info: ns4.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 69.63.186.49 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 66.220.145.65 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 66.220.151.20 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 204.74.67.132 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 204.74.66.132 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: iter_handle processing q with state INIT REQUEST STATE (stage 2)
Feb 9 08:24:33 cns04k unbound: [32693:3] info: resolving (init part 2): m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: iter_handle processing q with state INIT REQUEST STATE (stage 3)
Feb 9 08:24:33 cns04k unbound: [32693:3] info: resolving (init part 3): m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: iter_handle processing q with state QUERY TARGETS STATE
Feb 9 08:24:33 cns04k unbound: [32693:3] info: processQueryTargets: m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 0
Feb 9 08:24:33 cns04k unbound: [32693:3] info: DelegationPoint<facebook.com.>: 5 names (0 missing), 5 addrs (0 result, 5 avail) cacheNS
Feb 9 08:24:33 cns04k unbound: [32693:3] info: ns1.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] info: ns2.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] info: ns3.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] info: ns5.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] info: ns4.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 69.63.186.49 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 66.220.145.65 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 66.220.151.20 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 204.74.67.132 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 204.74.66.132 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: attempt to get extra 3 targets
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: selrtt 376
Feb 9 08:24:33 cns04k unbound: [32693:3] info: sending query: m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: sending to target: <facebook.com.> 69.63.186.49#53
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: dnssec status: not expected
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: EDNS lookup known=0 vs=0
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: serviced query UDP timeout=376 msec
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: inserted new pending reply id=6743
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: opened UDP if=0 port=50044
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: comm point start listening 13
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: mesh_run: iterator module exit state is module_wait_reply
Feb 9 08:24:33 cns04k unbound: [32693:3] info: mesh_run: end 1 recursion states (1 with reply, 0 detached), 1 waiting replies, 0 recursion replies sent, 0 replies dropped, 0 states jostled out
Feb 9 08:24:33 cns04k unbound: [32693:3] info: 0RDd mod2 rep m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: cache memory msg=223346769 rrset=505413687 infra=2802115 val=1072421
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: answer cb
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: Incoming reply id = 6743
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: Incoming reply addr = ip4 69.63.186.49 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: lookup size is 1 entries
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: received udp reply.
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: udp message[113:0] 674380000001000000020003016D0866616365626F6F6B03636F6D0000010001C00C0002000100000E10000704676C6231C00EC00C0002000100000E10000704676C6232C00EC02C0001000100000E10000445ABEF0AC03F0001000100000E10000445ABFF0A0000291000000080000000
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: outnet handle udp reply
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: serviced query: EDNS works for ip4 69.63.186.49 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: measured roundtrip at 113 msec
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: svcd callbacks start
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: worker svcd callback for qstate 0x8025ff080
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: mesh_run: start
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: iterator[module 2] operate: extstate:module_wait_reply event:module_event_reply
Feb 9 08:24:33 cns04k unbound: [32693:3] info: iterator operate: query m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: process_response: new external response event
Feb 9 08:24:33 cns04k unbound: [32693:3] info: scrub for facebook.com. NS IN
Feb 9 08:24:33 cns04k unbound: [32693:3] info: response for m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] info: reply from <facebook.com.> 69.63.186.49#53
Feb 9 08:24:33 cns04k unbound: [32693:3] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
Feb 9 08:24:33 cns04k ;; flags: qr ; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 2
Feb 9 08:24:33 cns04k ;; QUESTION SECTION:
Feb 9 08:24:33 cns04k ;; m.facebook.com. IN A
Feb 9 08:24:33 cns04k ;; ANSWER SECTION:
Feb 9 08:24:33 cns04k ;; AUTHORITY SECTION:
Feb 9 08:24:33 cns04k m.facebook.com. 3600 IN NS glb1.facebook.com.
Feb 9 08:24:33 cns04k m.facebook.com. 3600 IN NS glb2.facebook.com.
Feb 9 08:24:33 cns04k ;; ADDITIONAL SECTION:
Feb 9 08:24:33 cns04k glb1.facebook.com. 3600 IN A 69.171.239.10
Feb 9 08:24:33 cns04k glb2.facebook.com. 3600 IN A 69.171.255.10
Feb 9 08:24:33 cns04k ;; Query time: 0 msec
Feb 9 08:24:33 cns04k ;; WHEN: Thu Jan 1 01:00:00 1970
Feb 9 08:24:33 cns04k ;; MSG SIZE rcvd: 102
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: iter_handle processing q with state QUERY RESPONSE STATE
Feb 9 08:24:33 cns04k unbound: [32693:3] info: query response was REFERRAL
Feb 9 08:24:33 cns04k unbound: [32693:3] info: found in cache glb2.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] info: found in cache glb1.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] info: DelegationPoint<m.facebook.com.>: 2 names (0 missing), 2 addrs (0 result, 2 avail) parentNS
Feb 9 08:24:33 cns04k unbound: [32693:3] info: glb2.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] info: glb1.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 69.171.255.10 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 69.171.239.10 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: cleared outbound list for next round
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: iter_handle processing q with state QUERY TARGETS STATE
Feb 9 08:24:33 cns04k unbound: [32693:3] info: processQueryTargets: m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 0
Feb 9 08:24:33 cns04k unbound: [32693:3] info: DelegationPoint<m.facebook.com.>: 2 names (0 missing), 2 addrs (0 result, 2 avail) parentNS
Feb 9 08:24:33 cns04k unbound: [32693:3] info: glb2.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] info: glb1.facebook.com. * A AAAA
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 69.171.255.10 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: ip4 69.171.239.10 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: attempt to get extra 3 targets
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: servselect ip4 69.171.239.10 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: rtt=119000 LAME DNSSEC_LAME REC_LAME
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: servselect ip4 69.171.255.10 port 53 (len 16)
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: rtt=119000 LAME DNSSEC_LAME REC_LAME
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: No more query targets, attempting last resort
Feb 9 08:24:33 cns04k unbound: [32693:3] info: found in cache glb2.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] info: found in cache glb1.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] info: found parent-side glb2.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] info: found parent-side glb2.facebook.com. AAAA IN
Feb 9 08:24:33 cns04k unbound: [32693:3] info: found parent-side glb1.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] info: found parent-side glb1.facebook.com. AAAA IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: out of query targets -- returning SERVFAIL
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: store error response in message cache
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: return error response SERVFAIL
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: mesh_run: iterator module exit state is module_finished
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: mesh_run: python module exit state is module_finished
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: validator[module 0] operate: extstate:module_wait_module event:module_event_moddone
Feb 9 08:24:33 cns04k unbound: [32693:3] info: validator operate: query m.facebook.com. A IN
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: validator: nextmodule returned
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: cannot validate non-answer, rcode SERVFAIL
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: mesh_run: validator module exit state is module_finished
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: query took 0.113802 sec
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: pythonmod: clear, id: 1, pq:80936C4E8
Feb 9 08:24:33 cns04k unbound: [32693:3] info: mesh_run: end 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 1 recursion replies sent, 0 replies dropped, 0 states jostled out
Feb 9 08:24:33 cns04k unbound: [32693:3] info: average recursion processing time 0.113802 sec
Feb 9 08:24:33 cns04k unbound: [32693:3] info: histogram of recursion processing times
Feb 9 08:24:33 cns04k unbound: [32693:3] info: [25%]=0 median[50%]=0 [75%]=0
Feb 9 08:24:33 cns04k unbound: [32693:3] info: lower(secs) upper(secs) recursions
Feb 9 08:24:33 cns04k unbound: [32693:3] info: 0.065536 0.131072 1
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: cache memory msg=223346769 rrset=505413687 infra=2802115 val=1072421
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: svcd callbacks end
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: close of port 50044
Feb 9 08:24:33 cns04k unbound: [32693:3] debug: close fd 13

"Thanks for the report, the feature itself could complement the local-data
facility nicely, but it does not seem to be necessary now and we have a strict
no-features guideline."
Well, here is my +1 for http://www.nlnetlabs.nl/bugs-script/show_bug.cgi?format=multiple&id=356
:slight_smile:

Hi Attila,

Hi,

Running unbound r2580, I can't resolve m.facebook.com. I get
SERVFAIL back. The server was running for some time, so it's not in
a fresh state. It seems the problem is that facebook DNS servers
time out on AAAA records, so unbound gets the false assumption that
they are unavailable.

Well if you do not respond to queries, you deserve what you get. DNS
has noanswer-nodata packets and this is what should be used. They do
not implement RFC1034. And for that facebook deserves to be offline.

That said, you want your users to be able to connect to sites that
have broken software (or more likely: bad firewall). The feature you
name would not actually stop unbounds internal lookups for the AAAA
for the nameserver. You would need to configure a stub-zone in the
config file with the IP4s of the nameservers as a workaround.

The workaround for one name specific is not the right thing. Not sure
how to fix this in a more general way. Store timeout information
per-query-type and query-name specific (it is already per-zone) ?
That makes the timeout information useless for new queries.

I am not sure how to fix this, because on the other hand, very similar
situations would result in continuous probes to a server that is down.
And this also adds load to unbound.

Here are the verbose (level 4) logs while trying to resolve the
name:

Thanks, yes, it is doing a lot of AAAA lookups and those timeouts have
added up to make the zone offline.

Best regards,
   Wouter

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Attila,

> Hi,
>
> Running unbound r2580, I can't resolve m.facebook.com. I get
> SERVFAIL back. The server was running for some time, so it's not in
> a fresh state. It seems the problem is that facebook DNS servers
> time out on AAAA records, so unbound gets the false assumption that
> they are unavailable.

Well if you do not respond to queries, you deserve what you get. DNS
has noanswer-nodata packets and this is what should be used. They do
not implement RFC1034. And for that facebook deserves to be offline.

That said, you want your users to be able to connect to sites that
have broken software (or more likely: bad firewall). The feature you
name would not actually stop unbounds internal lookups for the AAAA
for the nameserver. You would need to configure a stub-zone in the
config file with the IP4s of the nameservers as a workaround.

The workaround for one name specific is not the right thing. Not sure
how to fix this in a more general way. Store timeout information
per-query-type and query-name specific (it is already per-zone) ?
That makes the timeout information useless for new queries.

I am not sure how to fix this, because on the other hand, very similar
situations would result in continuous probes to a server that is down.
And this also adds load to unbound.

> Here are the verbose (level 4) logs while trying to resolve the
> name:

Thanks, yes, it is doing a lot of AAAA lookups and those timeouts have
added up to make the zone offline.

I think I know of a hack, try a SOA or NS lookup on the apex at the same nameservers ?

Then you know it is still up and running.

I don't know if it is possible to know the apex at all times. And you probably
have to keep more state. :frowning:

It is an incrediable stupid hack I know.

Lucky you! I had to put "local-zone: "facebook.com." refuse" in
unbound.conf to keep that spyware off my network :slight_smile:

a message of 269 lines which said:

It seems the problem is that facebook DNS servers time out on AAAA
records, so unbound gets the false assumption that they are
unavailable.

I would say "unbound gets the correct assumption that they are
out-of-order".

http://status.aa.net.uk/apost.cgi?incident=1392

Hi Stephane,

a message of 269 lines which said:

It seems the problem is that facebook DNS servers time out on AAAA
records, so unbound gets the false assumption that they are
unavailable.

I would say "unbound gets the correct assumption that they are
out-of-order".

http://status.aa.net.uk/apost.cgi?incident=1392

An update to looking at unbound's code. The log excerpt previously has
some code-paths printed out that are impossible; from the code this
output should not be possible. That indicates trouble at compiletime;
however, there is no resolution on this.

Additionally, in unbound's svn trunk, there is a patch that unlinks A,
AAAA, and other-type timeouts. Thus if AAAA lookups timeout, then A
lookups will still cause a query to be sent out. (up to a small limit,
enough to catch these cases as working, but very small for the qps to
actual offline servers).

Best regards,
   Wouter

Given that the load-balancer in question in non DNS compliant why do extra work to work around the brokeness.
If enough resolvers say broken the load-balancer will get fixed.

  Olafur (who has been arguing this now for 10 years, and we still have broken load balancers because resolvers are too accomodating)

If everybody -or the majority- does this, yes, the load balancer will be fixed. But in this case everybody could resolve m.facebook.com, including google, opendns and others. Unbound couldn’t.
Who’s the incompetent? - asks the customer…
And unbound will resolve m.facebook.com. Or unbound will be changed to bind/google/opendns/you name it, because they work in real world scenarios.

Of course you are right, but the world works the opposite way. :frowning:

You're preaching to the choir; most (all?) people here agree with you.

But we're mostly providing service to others, who can override our decisions and do not share our commitment to standards:

"I don't care. Change to software that works."

It's a classic prisoners dilemma, with all the standard causes and solutions. Most effective solutions to the prisoners dilemma require coercion or binding, which is (thankfully) absent on the internet.

However, this particular issue (AAAA lookups fail) will eventually die. The paucity of IPv4 addresses will provide the coercion required to move to IPv6; facebook will then be forced to fix their LB, and AAAA lookups will work.

Postel's Law: "be conservative in what you do, be liberal in what you
accept from others"

a message of 6 lines which said:

Postel's Law: "be conservative in what you do, be liberal in what you
accept from others"

A very bad engineering idea. It is responsible for the vast amount of
brokenness in today's Internet. And, worse, for the amount of
workaround code which makes so many Internet programs brittle and full
of bugs.

+1

Postel had it right when the Internet was a research project. However, as a global telecommunications infrastructure, this philosophy is proving quite problematic.

Unfortunately, pragmatically speaking, software writers (if they wish their software to be used) tend to have to bend to the bogosities and obscenities inflicted upon them by the "big guys" resulting in code getting ever more complicated. So it goes.

Regards,
-drc

We ran into exactly the same problem with AAAA for m.facebook.com last
week, and contacted them about it. They fixed it fairly quickly,
sometime within Feb 10th, so now their servers return an authoritative
NOERROR with an empty answer.

While the issue was on-going, we used PowerDNS recursors running
version 3.2. While v3.1 behaves the same way as Unbound in that it
doesn't "cache the timeout", v3.2 times out on the first try,
registers the result as SERVFAIL, and then caches the result for
a configurable time (packetcache-servfail-ttl, default 60 sec).

Would it make sense to have a similar, configurable behaviour in
Unbound to handle servers that time out in a way that would minimize
client lookup latency?

There was a similar thread about this in 2010 [1].

Sven

[1]: [Unbound-users] Setting max-time before servfail
Message-ID: 9a0178111001150607x6572b94cw44b93c0ff6a28178@mail.gmail.com
http://www.mail-archive.com/unbound-users@unbound.net/msg00793.html