Unbound can be made unresponsive when using DoT

Hi,

I have DoT & DNSSEC all set up and working and was carrying out some tests to ensure that the server and the forward servers (Cloudflare) was working as I expected.

To that end I was using this test:

https://www.grc.com/dns/dns.htm

down the page you will see a button:

“Initiate standard DNS spoofability test”

When run, it carries out the test and returns results. If however you try using Dig or even a browser while the test is running nothing will function, Unbound is unresponsive.

After the test returns you still have to wait some time before Unbound recovers and is once again useable.

I am on Windows 10/64 (B18363.900-V1909) with an Intel Core i7 4930K @ 3.40GHz Ivy Bridge-E 22nm with 32GB Memory. Using Unbound v1.10.1

When I run the same test without DoT to the same forward servers everything seems to be OK and there is no hang or unresponsiveness.

I appreciate that there is much more TCP traffic when using DoT but should Unbound become unresponsive?

Is this an Unbound problem or something that I can resolve in the configuration?

Thanks

Ray

There are more than a few Unbound resource settings. These include the number of TCP and UDP ports to allow to be open at the same time. It is probably best to give "unbound.conf" a read on the documentation page. Also Windows home-style editions often have some down tuning of these available resources with respect to Windows professional-style editions.

- Eric

Hi Eric,

Thanks for your thoughts - did you have any suggestions as to which
parameters should be adjusted to what sort of value?

It seems that a lot of the issues I am seeing revolve around these entries:

7/06/2020 16:55:33 C:\Program Files\Unbound\unbound.exe[1756:0] info:
Capsforid: reply is equal. go to next fallback
27/06/2020 16:55:33 C:\Program Files\Unbound\unbound.exe[1756:0] info:
processQueryTargets: cid-d42a2173fbacf7ce.users.storage.live.com. AAAA IN
27/06/2020 16:55:33 C:\Program Files\Unbound\unbound.exe[1756:0] debug:
request cid-d42a2173fbacf7ce.users.storage.live.com. has exceeded the
maximum number of glue fetches 17 to a single delegation point
27/06/2020 16:55:33 C:\Program Files\Unbound\unbound.exe[1756:0] debug:
return error response SERVFAIL

I see many of them and there seems to be a limit of 17 - I have to admit I
am not sure which parameter to tweak, I have tried many of the more obvious
ones but to no avail. Apart from the unresponsiveness the errors above are
random in that the queries work sometimes but not every time. This causes
processes to fail as they think they can no longer access the resource they
are after on the internet, some retry but others just give up and exit.

With respect to the Capsforid This changes queries to a random
upper/lowercase characters which is present to thwart spoofing. That said
unbound does not as far as I can see show you what was sent and what was
received so its difficult to ascertain if it's a specific server or
something else. The query example above will go around a number of servers
each look as above and then the whole thing gives up. I really am not sure
what is going on here?

I saw this bug:
https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=4243
which is still unresolved I think but I have 'qname-minimisation:' set to no
anyway.

Any further suggestions willing accepted and tried out.

Thanks

Ray

2-----Original Message-----

Hi Renaud,

Thanks for that suggestion - there is a definite improvement and it is possible to use DIG etc to carry out other queries when that DNS Spoofability test is running. That test runs MUCH quicker and the results are excellent (which is good)

From that I can see that the Quad9 servers are not as well set up as Cloudflare.

I am still looking at the performance side along with testing some other parameters that may (or may not) improve things.

I will let you know if there is interest?

So far that one single change has made a world of difference - thanks.

Ray

So, it seems the so-reuseport is causing issues at least on OpenBSD and windows when using TCP. Maybe this should be investigated more on other platforms.

Hi Renaud,

I think the performance is OK now with your suggestions - Thanks

That said, I still see errors in the log file. Those errors however are not easy to decipher to see what is failing..

e.g.
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] debug: tcp error for address 2606:4700:4700::1111 port 853
And
29/06/2020 15:44:36 C:\Program Files\Unbound\unbound.exe[1776:0] debug: request db3pap001.storage.live.com. has exceeded the maximum number of glue fetches 17 to a single delegation point
29/06/2020 15:44:36 C:\Program Files\Unbound\unbound.exe[1776:0] debug: return error response SERVFAIL

And yet this works OK if I have read things correctly:
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] info: sending query: db3pap001.storage.live.com. AAAA IN
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] debug: sending to target: <.> 1.0.0.1#853
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] debug: cache memory msg=78220 rrset=89465 infra=8804 val=71316
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] debug: iterator[module 1] operate: extstate:module_wait_reply event:module_event_reply
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] info: iterator operate: query db3pap001.storage.live.com. A IN
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] info: iterator operate: chased to l-0003.l-msedge.net. A IN
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] info: response for db3pap001.storage.live.com. A IN
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] info: reply from <.> 1.0.0.1#853
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] info: query response was ANSWER
29/06/2020 15:44:35 C:\Program Files\Unbound\unbound.exe[1776:0] info: finishing processing for db3pap001.storage.live.com. A IN

But I also see these:

29/06/2020 15:45:12 C:\Program Files\Unbound\unbound.exe[1776:0] error: SERVFAIL <cid-d42a2173fbacf7ce.users.storage.live.com. A IN>: could not fetch nameservers for 0x20 fallback
29/06/2020 15:45:12 C:\Program Files\Unbound\unbound.exe[1776:0] reply: ::1 cid-d42a2173fbacf7ce.users.storage.live.com. A IN SERVFAIL 0.937464 0 61

And you might expect that queries to here would not fail:
29/06/2020 16:19:25 C:\Program Files\Unbound\unbound.exe[1776:0] info: Capsforid: reply is equal. go to next fallback
29/06/2020 16:19:25 C:\Program Files\Unbound\unbound.exe[1776:0] info: processQueryTargets: www.internic.net. AAAA IN
29/06/2020 16:19:25 C:\Program Files\Unbound\unbound.exe[1776:0] debug: request www.internic.net. has exceeded the maximum number of glue fetches 17 to a single delegation point
29/06/2020 16:19:25 C:\Program Files\Unbound\unbound.exe[1776:0] debug: return error response SERVFAIL

Changing from CloudFlare to Google as the forward server I still get:
29/06/2020 16:24:21 C:\Program Files\Unbound\unbound.exe[9440:0] debug: tcp error for address 8.8.4.4 port 853
29/06/2020 16:24:21 C:\Program Files\Unbound\unbound.exe[9440:0] debug: tcp error for address 8.8.8.8 port 853

BUT

I no longer see 0x20 or glue errors - it would be nice to know what is going on as I cannot see if Unbound is getting it wrong or CloudFlare is not doing what it should.

One last point, I saw a lot of failed attempts at these addresses (both CloudFlare & Google):

29/06/2020 16:25:45 C:\Program Files\Unbound\unbound.exe[9440:0] query: ::1 wpad.home. A IN
29/06/2020 16:25:45 C:\Program Files\Unbound\unbound.exe[9440:0] reply: ::1 wpad.home. A IN NXDOMAIN 0.000000 1 27
29/06/2020 16:25:45 C:\Program Files\Unbound\unbound.exe[9440:0] query: ::1 wpad.home. AAAA IN
29/06/2020 16:25:45 C:\Program Files\Unbound\unbound.exe[9440:0] reply: ::1 wpad.home. AAAA IN NXDOMAIN 0.000000 1 27

There were others like:

ahbgrtoputryz.home.

several random looking sets of characters before the .home. As they always failed either NXDOMAIN or SERVFAIL I added this entry:

local-zone: home always_nxdomain

so now there is no need for Unbound to go any further. I was however unable to ascertain what Windows was trying to do or which process was attempting the lookup. I have no "home" zone in the configuration. I also have no Proxy set up.

Just FYI - The Google are set up completely differently to CloudFlare with interesting results in the spoofabilty test.

If anyone else has ideas on the above...

Thanks

Ray