DNS-over-TLS IPv4 interface ceases to respond

Hello,

We are using Unbound 1.7.3 to test the DNS-over-TLS service and advance
options (see specifications and config file below).

The server is generally on very low use (avg. 2 queries/s) but
configured following the optimization guide[1] in order to test options
and perform stress tests.

During two low use periods (07/15 and 07/23), we experienced an outage
in the IPv4 DNS-over-TLS service. The server was still responsive on
IPv6 and when queried locally using "traditional" DNS.

Restarting the server momentarily solved the issue.

As it happened during my holidays, I had little investigation
possibilities. I just confirmed the (un)responsiveness using ncat:

$ ncat -6 --ssl -v <IPv6_ADDRESS> 853
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: SSL connection to <IPv6_ADDRESS>:853. Fondation RESTENA
Ncat: SHA-1 fingerprint: ...
^C

$ ncat -4 --ssl -v <IPv4_ADDRESS> 853
Ncat: Version 7.70 ( https://nmap.org/ncat )
^C

Unfortunately, log verbosity was set to 1 an I didn't see anything
suspicious. It looked like Unbound was not even receiving the queries on
IPv4.

Did anyone already noticed such a problem? I wonder whether it is
related to Unbound or the underlying OpenSSL.

$ unbound -h
Version 1.7.3
linked libs: libevent 2.0.21-stable (it uses epoll), OpenSSL 1.0.2k-fips
26 Jan 2017
linked modules: dns64 respip validator iterator

Unbound configuration:
server:
  directory: "/usr/local/unbound"
  chroot: "/usr/local/unbound"
  username: unbound
  pidfile: "/var/run/unbound.pid"

  auto-trust-anchor-file: "/var/lib/root.key"

  num-threads: 1

  msg-cache-slabs: 2
  rrset-cache-slabs: 2
  infra-cache-slabs: 2
  key-cache-slabs: 2

  rrset-cache-size: 100m
  msg-cache-size: 50m

  outgoing-range: 8192
  num-queries-per-thread: 4096

  so-rcvbuf: 4m
  so-sndbuf: 4m
  so-reuseport: yes

  interface: 127.0.0.1
  interface: ::1

  # DNS-over-TLS
  tls-service-key: "/usr/local/unbound/etc/dns_over_tls.key"
  tls-service-pem: "/usr/local/unbound/etc/dns_over_tls.pem"
  incoming-num-tcp: 100

  tls-port: 853
  interface: 127.0.0.1@853
  interface: ::1@853
  interface: <IPv4_ADDRESS>@853
  interface: <IPv6_ADDRESS>@853

  access-control: 0.0.0.0/0 allow
  access-control: ::/0 allow

  prefer-ip6: yes

  hide-identity: yes
  hide-version: yes
  hide-trustanchor: yes

  use-caps-for-id: yes
  qname-minimisation: yes

  harden-below-nxdomain: yes
  harden-dnssec-stripped: yes

  aggressive-nsec: yes

  prefetch: yes
  prefetch-key: yes

  rrset-roundrobin: yes

  ratelimit: 1000
  ratelimit-slabs: 2

  logfile: "/var/log/unbound.log"
  verbosity: 1
  log-time-ascii: yes
  log-queries: yes
  log-replies: yes
  val-log-level: 2
  unwanted-reply-threshold: 10000000

  statistics-interval: 0
  extended-statistics: yes

# RFC 7706
# master are sorted by increasing AXFR response time
auth-zone:
  name: "."
  for-downstream: no
  for-upstream: yes
  fallback-enabled: yes
  master: c.root-servers.net
  master: f.root-servers.net
  master: k.root-servers.net
  master: iad.xfr.dns.icann.org
  master: b.root-servers.net
  master: d.root-servers.net
  master: lax.xfr.dns.icann.org
  master: g.root-servers.net

remote-control:
  control-enable: yes

[1] https://nlnetlabs.nl/documentation/unbound/howto-optimise/

Hi,

Hello,

We are using Unbound 1.7.3 to test the DNS-over-TLS service and advance
options (see specifications and config file below).

The server is generally on very low use (avg. 2 queries/s) but
configured following the optimization guide[1] in order to test options
and perform stress tests.

During two low use periods (07/15 and 07/23), we experienced an outage
in the IPv4 DNS-over-TLS service. The server was still responsive on
IPv6 and when queried locally using "traditional" DNS.

The so-reuseport option has in the past, for certain kernel versions
(but I don't know which) given trouble with TCP connections,
specifically with IPv6 connections. Since you have both IPv6 and IPv4
and IPv4 ceases, it looks like a similar issue, even though it is the
other protocol and not in NSD. Then, a solution was to turn off
so-reuseport (and they were so burned by the trouble they haven't dared
enable it again). It may be for you too; if that solves the problem,
perhaps a kernel upgrade can fix it? Perhaps the server should test
(something. a version?) before it enables so-reuseport?

However, it could be something else, but I don't know what.

Best regards, Wouter

Thanks for the quick reply.

We build our own kernels for those servers (currently using 4.16.13), so
it is rather recent (compared to 3.10 from CentOS).

I will disable so-reuseport, maintain extra logging on this server and
see if it happens again.