Ability to detect when queries are being blocked at the network level

We run a robust carrier-grade e-mail service in the cloud and have a dedicated DNS infrastructure that has undergone extensive tuning to work in AWS, see

https://www.sparkpost.com/blog/undocumented-limit-dns-aws/
https://www.sparkpost.com/blog/dns-aws-network-lessons/
https://www.usenix.org/sites/default/files/conference/protected-files/srecon18americas_slides_blosser.pdf

We occasionally have issues where we are unable to perform MX lookups for what appears to be a perfectly valid domain. I tracked down one such incident yesterday and in this case the authoritative name servers for the domain were deliberately blocking queries (something i confirmed from the NS box using dnstracer). The MX query works fine with the Google 8.8.8.8 resolver or indeed the AWS VPC default resolver (which we cannot rely on, see above).

I can’t find a way to monitor for this condition (which manifests as SERVFAIL ultimately). I’ve read the docs about how Unbound handles probes and backoff, but I don’t see any metric exposed that would tell me the domains where this is happening. If I could have a way to get the list of domains that display SERVFAIL, I could write an out-of-band script that attempts to resolve them via alternate paths and adds them to a whitelist config.

Thanks in advance for any suggestions

John

Im having the same issue here with my servers. several queries fails when using my server’s source IP but, Google public DNS return an answer.
my workaround was to forward those queries to 8.8.8.8 using forward domain.
i wonder if there’s a way to find what’s causing those SERVFAIL.

Hi John,

  If all authoritative servers for particular domain discard
(silently) queries from your Unbound resolver,
you could detect it with `unbound-control dump_infra'.

$ unbound-control dump_infra | grep nsec3.net
133.242.130.108 nsec3.net. ttl 571 ping 0 var 94 rtt 376 rto 120000 (snip)
2401:2500:102:1102:133:242:130:108 nsec3.net. ttl 571 ping 0 var 94
rtt 376 rto 120000 (snip)

  Note that 'rto' of all nameservers serving 'nsec3.net' are 120000
(milliseconds).
As 'Unbound Timeout Information' document describes 'rto 120000' indicates that
Unbound resolver determines the nameserver is unresponsible.
  Of course, we cannot distinguish between nameservers down (network
unreachable) and
discarded queries.