Strange behaviour unbound server

Hi everyone,

I faced with intersting issue with unbound server and couldn’t figure out without your help
We used unbound as primary dns resolver in our aws infrastructure, but from time to time unbound server is not responding to queries from our clients
Also I found by tcpdump and wireshark a lot of retransmission DNS requests from clients in the subnets.
But this issue present periodically, our clients get timeout issue during the day.
from 100 queries, timeout can be get for 3-8 queries.

For debug I used command:
perf trace -p $(pidof unbound) --duration=10
and got next:
13.285 (599.741 ms): unbound/15943 epoll_pwait(epfd: 54<anon_inode:[eventpoll]>, events: 0x564955c6ae10, maxevents: 128, timeout: -1, sigsetsize: 8) = -1 EINTR Interrupted system call
616.016 (94.403 ms): unbound/15943 epoll_pwait(epfd: 54<anon_inode:[eventpoll]>, events: 0x564955c6ae10, maxevents: 128, timeout: -1, sigsetsize: 8) = 1
710.662 (130.206 ms): unbound/15943 epoll_pwait(epfd: 54<anon_inode:[eventpoll]>, events: 0x564955c6ae10, maxevents: 128, timeout: -1, sigsetsize: 8) = 1
616.649 (224.502 ms): unbound/15952 epoll_pwait(epfd: 42<anon_inode:[eventpoll]>, events: 0x7faea89ea7f0, maxevents: 128, timeout: -1, sigsetsize: 8) = 1
850.606 (112.947 ms): unbound/15952 epoll_pwait(epfd: 42<anon_inode:[eventpoll]>, events: 0x7faea89ea7f0, maxevents: 128, timeout: -1, sigsetsize: 8) = 1
13.453 (1160.129 ms): unbound/15951 epoll_pwait(epfd: 37<anon_inode:[eventpoll]>, events: 0x7faea47ca3e0, maxevents: 64, timeout: -1, sigsetsize: 8) = 1
840.904 (335.113 ms): unbound/15943 epoll_pwait(epfd: 54<anon_inode:[eventpoll]>, events: 0x564955c6ae10, maxevents: 128, timeout: -1, sigsetsize: 8) = 1
710.891 (465.469 ms): unbound/15950 epoll_pwait(epfd: 36<anon_inode:[eventpoll]>, events: 0x7faeac8b2680, maxevents: 128, timeout: -1, sigsetsize: 8) = 1
13.769 (1174.857 ms): unbound/15954 epoll_pwait(epfd: 48<anon_inode:[eventpoll]>, events: 0x7fae98747c20, maxevents: 128, timeout: -1, sigsetsize: 8) = 1
1176.048 (17.121 ms): unbound/15943 epoll_pwait(epfd: 54<anon_inode:[eventpoll]>, events: 0x564955c6ae10, maxevents: 128, timeout: -1, sigsetsize: 8) = -1 EINTR Interrupted system call
1175.740 (21.495 ms): unbound/15951 epoll_pwait(epfd: 37<anon_inode:[eventpoll]>, events: 0x7faea47ca3e0, maxevents: 64, timeout: -1, sigsetsize: 8) = 1
1177.587 (19.955 ms): unbound/15950 epoll_pwait(epfd: 36<anon_inode:[eventpoll]>, events: 0x7faeac8b2680, maxevents: 128, timeout: 264, sigsetsize: 8) = 1
1196.914 (11.097 ms): unbound/15954 epoll_pwait(epfd: 48<anon_inode:[eventpoll]>, events: 0x7fae98747c20, maxevents: 128, timeout: -1, sigsetsize: 8) = 1

our infra:
ec2: c5.2xlarge (16gb mem, 8cores, 60gb gp2)
dist: amazon linux 2

unbound-libs-1.6.6-1.amzn2.0.2.x86_64
unbound-python-1.6.6-1.amzn2.0.2.x86_64
unbound-1.6.6-1.amzn2.0.2.x86_64

conf:
server:
verbosity: 1
num-threads: 8
statistics-interval: 0
extended-statistics: yes
statistics-cumulative: no
msg-cache-slabs: 4
rrset-cache-slabs: 4
infra-cache-slabs: 4
key-cache-slabs: 4
rrset-cache-size: 100m
msg-cache-size: 50m
so-rcvbuf: 4m
so-sndbuf: 4m
so-reuseport: yes
outgoing-range: 8192
num-queries-per-thread: 4096
do-daemonize: no
prefetch: yes
rrset-roundrobin: yes
logfile: “”
use-syslog: no
directory: “/etc/unbound”
chroot: “”
log-queries: no
access-control: 0.0.0.0/0 allow
interface: 0.0.0.0
interface-automatic: yes
port: 53
do-ip4: yes
do-ip6: no
do-udp: yes
do-tcp: yes
username: “unbound”
pidfile: “/var/run/unbound/unbound.pid”
root-hints: /etc/unbound/root.hints
key-cache-size: 32m
local-zone: “10.in-addr.arpa.” nodefault

remote-control:
control-enable: yes

any ideas?

Hi Eduard,

Hard to say why this happens periodically to you. Do you see an increase
in the incoming queries when this happens? Maybe running out of some
buffer space? Or do you by any chance periodically perform an expensive
operation on unbound, like doing a dump_cache from cron? Are there any
errors written to the log?

-- Ralph

Hi Ralph,

Thank you for response, do dump_cache with cron it’s good idea, also probably I can merge it with command to get request_list and join all of it with my cron to collect tcpdump traffic
But now, I know next:
amazon doesn’t like NXDOMAIN records, if query arrived to nonexistent domain, unbound forwards this query to aws vpc dns server and aws spend a lot of time to return answer.
Probably it can be our issue, but I am not 100% sure

чт, 11 июл. 2019 г. в 12:36, Ralph Dolmans via Unbound-users <unbound-users@nlnetlabs.nl>:

Amazon's DNS caches don't like a lot of things, and break DNSSEC in my
past experience. I don't know if they've solved that, but I long since
switched to only using them for the zones which need to be delegated to
them.

So when my Packer scripts build an OS image for use in AWS, they create
/etc/unbound/unbound.conf.d/ec2.conf containing:
-----------------------------8< ec2.conf >8-----------------------------
# http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-dns.html
# The 169.254.169.253 IP is guaranteed as long as DNS service is available for
# the VPC. However, it breaks DNSSEC, so we don't use it for "."

server:
        domain-insecure: "internal"
        private-domain: "amazonaws.com"
        private-domain: "internal"

forward-zone:
        name: "amazonaws.com."
        forward-addr: 169.254.169.253
forward-zone:
        name: "internal."
        forward-addr: 169.254.169.253
-----------------------------8< ec2.conf >8-----------------------------

This assumes a Debian-derived setup using /etc/unbound/unbound.conf.d/
files automatically.

You'll want to add extra config files for any domains which you register
in R53 for internal resolution. The above config covers the default
`internal`, and adds in `amazonaws.com` because using VPC Endpoints will
require you to go through their servers to get the overlaid DNS entries.

Your main configuration would then just resolve DNS normally, instead of
being a forwarder. This does mean that your security groups and subnet
ACLs will need to permit the boxes running Unbound to make DNS queries
out to the Internet.

-Phil