Benchmarking Unbound

I am currently testing unbound as part of a current project however
I'm seeing some wildly different results comparing unbound-control
stats to what queryperf results show. For example, I've logged the
output of total.num.queries to a file every second and calculated the
average result which shows 9474 queries per second, queryperf shows
6945 qps. Both results are from the same test load.

What is the proper way to monitor and benchmark the queries per second
on an unbound server?

Hi Michael,

This may surprise you, but this is actually a frequently asked question
for the unbound server from people that try to measure its speed.
(Boast: this is because of its speed).

I am currently testing unbound as part of a current project however
I'm seeing some wildly different results comparing unbound-control
stats to what queryperf results show. For example, I've logged the
output of total.num.queries to a file every second and calculated the
average result which shows 9474 queries per second, queryperf shows
6945 qps. Both results are from the same test load.

I see two explanations. The first, and my experience with others doing
benchmarks, is that unbound outperforms queryperf. The second is a
buffer overrun.

Unbound outperforms queryperf. Thus, it sends back more than queryperf
can handle. The measurement you got from queryperf is the speed of
queryperf itself.

Buffer overruns can be seen with netstat -su. The options so-rcvbuf and
so-sndbuf can help you stop the buffer overruns for unbound.

If you want faster results,
http://unbound.net/documentation/howto_optimise.html

What is the proper way to monitor and benchmark the queries per second
on an unbound server?

Since unbound outperforms your sender, you must make your sender
stronger. Start by using its own computer for the sender (do not run it
on the server itself, this server is busy with unbound, and you have
configured num-threads to the number of cpus in /proc/cpuinfo so it uses
all CPUs, right? Also you want to include the speed of the IP stack in
the result). Then add another computer that runs queryperf (you can add
up the result qps for the two computers that run queryperf). Add more
computers until the qps no longer goes up, or goes down a bit (a denial
of service on the unbound server). That is the measured speed.

- From Jan-Piet Mens (who wrote a book about it), I heard he ended up with
a whole classroom of pcs running queryperf :slight_smile:

Best regards,
   Wouter

Thanks.

We ended up tweaking the unbound.conf a bit and changed the following
settings which more than doubled the performance going from 8.8k qps
to over 20k.

# outgoing-range: 4096
outgoing-range: 32768

# so-rcvbuf: 0
so-rcvbuf: 32m

# msg-cache-size: 4m
msg-cache-size: 256m

num-queries-per-thread: 4096

# rrset-cache-size: 4m
rrset-cache-size: 256m

# infra-cache-numhosts: 10000
infra-cache-numhosts: 100000

We ended up tweaking the unbound.conf a bit and changed the following
settings which more than doubled the performance going from 8.8k qps
to over 20k.

Just out of curiosity, would it be possible to know some more details
about that benchmarking? Such as:
- was the queries done from the server itself? 1 external client? many
external clients? And were the external clients connected to the same
network as the server, perhaps connected to the same switch the server was
using? 100Mbps? 1Gbps?
- was it asking for locally cached data, or did the server need to go out
and query for some external data? And what was the normal size of the
replies?
- what kind of hardware (especially CPU + speed) + OS was the server running?

Regards
Eivind Olsen

The servers I've been testing are Intel E4500s with 2 GB of RAM and
Intel gigabit NICs, they are only on 100 mbit ports however and are
routed to through Cisco IP SLA.

The test client is on an entirely separate subnet and is an 8 core
opteron box with an Intel NIC however I've also run queryperf from
multiple sources at once using virtual machines.

The actual queries in my data file are captured from real DNS traffic
in our data center, about 50% of the queries end up hitting the cache.
I can see caching effects in my graphs as well.