Cascading timeouts on forwarders leading to DoS

I have a weird problem with timeouts blocking upstream forwarders when there are excessive timeouts on lookups.

Here’s a quick outline of the setup:

Core DNS (Unbound) → Edge Forwarder DNS (Unbound) → Internet

When a request originates from the core to the internet (eg for zone . ), if the destination DNS server is unresponsive for whatever reason, the Core DNS times out BEFORE the edge forwarder. This increases the eto timeout, and eventually, under enough load and unresponsiveness from the offending off-network DNS, the egde forwarder is blocked, even though there is actually no problem with it. And no DNS stops the network.

The forwarding zone is . so it will still eventually time out. Adjusting infra-cache-min-rtt and max-rtt seems only to delay the onset of this issue.

Any clues to prevent forwarders from ever being blocked (or even moving into ‘probing’) would be greatly appreciated.

Kind regards,
Graeme

Hi Graeme,

I understand that in this case you forward all Core outbound traffic to the Edge Forwarder.

Since you know your network (between core and edge), there is little value for Unbound tracking response behaviour and adopting to the network environment.

You could try bringing 'infra-host-ttl' down from the default 900 seconds. Maybe even to a couple of seconds for your case.

Best regards,
-- Yorgos

Thanks for the response. I came to this conclusion earlier today, and came to this exact same conclusion. It’s now backed off to 30 seconds, and seems to be much better behaved.

My concern is that this could also be experienced if we were using public resolvers configured as forwarders, such as 1.1.1.1. Especially if you are doing a lot of reverse lookups (this was the main source of issues for us)

But thank you for the suggestion. It’s good to know I was reasoning along the correct lines.

Kind regards,
Graeme