Unbound multithread performance: an investigation into scaling of cache response qps

Wouter · March 23, 2010, 8:38am

Hi,

Because we haven't measured multithread performance scaling of unbound
before, I decided to try it myself. Also I was bored waiting late at
night for an audio broadcast from the IETF The study is below.

Unbound multithread performance: an investigation into scaling of cache
response qps

Using a Solaris 5.11 quadcore machine[*], with four CPUs at 200 Mhz, I
have tested unbound cache performance in various configurations. In
this test setup the solaris machine is blazing away its four cpus (no
hyperthreading), and two other hosts (BSD and linux) at 3Gz are running
perf and sending queries for cache responses for www.nlnetlabs.nl at a
high rate. We count the number of queries per second that this returns.

The various configurations are with the builtin mini-event (select(2)
based), and with libevent-1.4.12-stable(using evport). Also pthreads,
solaris-threads and no-threaded(fork(2)) operation are used. The
unbound config file contains some minimum statements to make it
accessible from the network - an access control list and interface
statements - and also num-threads, and this is set to 1, 2, 3 and 4.

It was observed that the threads all seem to handle about an even load
in the tests. So real multi-threading is happening. In this test it is
very easy to outperform the machine using the two senders, otherwise
this test becomes a lot trickier.

Table, qps in total for all threads together.

Configuration ------- 1 core --- 2 cores --- 3 cores --- 4 cores
select and pthreads 8450 14100 16100 18600
select and solaristhr 8600 13800 15800 17500
select and no threads 10000 17800 19800 22800
evport and pthreads 8400 13600 15900 18100
evport and solaristhr 8500 14100 16000 18600
evport and no threads 9700 17300 19600 22300

The performance scales up fairly neatly as multi-threading goes. For
every configuration a slower-than-linear speedup is observed, indicating
locks in the underlying operation system network stack. There is only
one network card, after all, and the CPUs have to lock and synchronise
with it. The solaristhreads are a little faster than pthreads, when
combined with evport (a solaris-specific socket management system call).
No threads is even faster (but of course fragments the cache), by about
20%, and its advantage increases slightly as the number of cores
increases (from 15% to 23%). The evport call is a little bit slower
than select, but since it breaks the 1024-limit of select, it will thus
remain useful for high capacity configurations.

To increase performance further, it seems the place to work at is the
network driver or network stack.

Best regards,
Wouter

[*] This machine has been donated by RIPE NCC and has mostly been used
for System/OS interoperation testing. It turned out to be a good machine
to expose certain race conditions that did not show up on regular
Intel/Linux or BSD systems. If you happen to have somewhat exotic
machinery around we would welcome your donation.

Simon_Perreault · March 23, 2010, 2:20pm

Very interesting! Thanks for sharing this.

I have one question: why does "no threads" go faster as the number of cores increase?

Thanks,
Simon

Wouter · March 23, 2010, 2:28pm

Hi Simon,

Aaron_Hopkins · March 23, 2010, 3:05pm

The performance scales up fairly neatly as multi-threading goes. For
every configuration a slower-than-linear speedup is observed, indicating
locks in the underlying operation system network stack.

There was no lock contention within unbound? I don't know how to measure
this on Solaris, but did you?

There is only one network card, after all, and the CPUs have to lock and
synchronise with it.

This should be true even with multiple processes, however.

This maybe not be true for Solaris, but you might try having unbound listen
on multiple ports and spread requests across them and see if it matters.

The last time I looked, recent-ish Linux 2.6 still had per-socket locking
even in the face of multiple network cards. This means that multiple
threads or even multiple processes sharing a UDP socket can't really exceed
one CPUs worth of raw sendto() performance sourced from the same socket. You can get much closer to linear scalability by binding to a different port
or IP per CPU.

-- Aaron

Wouter · March 24, 2010, 10:08am

Hi Aaron,

The performance scales up fairly neatly as multi-threading goes. For
every configuration a slower-than-linear speedup is observed, indicating
locks in the underlying operation system network stack.

There was no lock contention within unbound? I don't know how to measure
this on Solaris, but did you?

Yes it is visible. The no-threads version of unbound has no lock code
in it (macroed away), and thus has no lock contention. It has a
slightly better graph than the versions with locks (maybe a 5%
difference at 4 cores). So there is contention in unbound. In this
example, with all queries for the same cache element, the contention
should be as high as it gets, I think.

There is only one network card, after all, and the CPUs have to lock and
synchronise with it.

This should be true even with multiple processes, however.

Yes, this is what we see in the no-threads results. Those use
processes. But they still bind to the same port 53 socket.

This maybe not be true for Solaris, but you might try having unbound listen
on multiple ports and spread requests across them and see if it matters.

Yes, I have tried this. I got 2 more test machines to send queries
from, and modified unbound to open (num_threads)x UDP ports and every
Nth worker listens to UDP port N.

A control check, with four perfs running towards unbound.
evport, forked, 4senders: 9619 15860 19010 21979
evport, forked, 2senders: 9700 17300 19600 22300
Similar, slightly slower.

The special version where every process listens to its own UDP port, and
the perfs all run towards one port. evport, forked, process0 and perf0
use port 30053, process1 and perf1 use port 30054, process2 and perf2
use port 30055, process3 and perf3 use port 30056.
evport, forked, special: 10000 18783 23461 25797

This is faster. It is not linear.

In this test unbound has forked processes that do not lock mutexes or
any pthread stuff. They all have a copy of the same file-descriptor
table. But the list of fds passed to evport is different (same TCP, but
different UDP) for every process. There are also some pipes in the
background for interprocess comm but those are silent during the test.

The last time I looked, recent-ish Linux 2.6 still had per-socket locking
even in the face of multiple network cards. This means that multiple
threads or even multiple processes sharing a UDP socket can't really exceed
one CPUs worth of raw sendto() performance sourced from the same socket.
You can get much closer to linear scalability by binding to a different
port or IP per CPU.

Not sure it is worth it. Maybe some modifications can be made to the
UDP stack to make it more linear, but I do not know how.

Best regards,
Wouter

Wouter · March 24, 2010, 10:29am

Hi Aaron,

Another check with do-tcp: no, so that the evport system calls do not
have to lock that file descriptor to inspect if events happened on the
TCP fd.

A control check, with four perfs running towards unbound.
evport, forked, 4senders: 9619 15860 19010 21979
evport, forked, 2senders: 9700 17300 19600 22300

evport, forked, special: 10000 18783 23461 25797

evport, forked
special, no-tcp : 10200 18552 20226 27161
again 23646 27149
and again 19350 27792

The 3-core output is very bouncy. The four-core throughput is more stable.

So, that removes another lock, and speeds up a bit more

Best regards,
Wouter

Aaron_Hopkins · March 24, 2010, 9:03pm

evport, forked, 4senders: 9619 15860 19010 21979
evport, forked, special: 10000 18783 23461 25797 evport, forked special, no-tcp: 10200 18552 20226 27161

At 4 cores, there's 25%-ish overhead due the shared sockets between
processes, and this will likely increase as the number of cores increases.

I don't know how much you want to continue to play with this, but as an
experiment, comparing the forked 4senders case above with one that sent
replies from a socket unique to each process (even with the wrong source
IP/port) would show the socket locking overhead for sending vs receiving.

Generally you need to reply from the shared socket to keep the source port.
If it is most of the overhead, there might be other options for sending,
like binding multiple sockets with SO_REUSEADDR set to the same port. On
most Unix flavors, the last one to bind gets all incoming unicast traffic,
but I think you can still send from the others.

evport and pthreads 8400 13600 15900 18100 evport and solaristhr 8500 14100 16000 18600

It looks like there's also a 20% overhead for having threading enabled,
regardless of the number of CPUs. Hopefully, this shouldn't be true on
modern Linux, where uncontested mutexes are basically free.

-- Aaron