Performance tuning tips?

Does anybody have some additional performance tuning tips for Unbound, specificityon Solaris 11?

I have followed the recommended settings in the "HowTo Optimise", but still seem to hit a ceiling of +/- 3600 max queries a second.On top of that the platform/OS do become a bit sluggish when logging in via SSH.

Our install details:
OS: Solaris 11
CPU: SPARC T3
Unbound ver: 1.4.18-2 (64-bit)
LDNS ver: 1.6.15 (64-bit)
Libevent ver: 2.0.20 (64-bit)

Any insight will be appreciated.

Regards

Zitat von Jaco Lesch <jacol@saix.net>:

Does anybody have some additional performance tuning tips for Unbound, specificityon Solaris 11?

I have followed the recommended settings in the "HowTo Optimise", but still seem to hit a ceiling of +/- 3600 max queries a second.On top of that the platform/OS do become a bit sluggish when logging in via SSH.

Our install details:
OS: Solaris 11
CPU: SPARC T3
Unbound ver: 1.4.18-2 (64-bit)
LDNS ver: 1.6.15 (64-bit)
Libevent ver: 2.0.20 (64-bit)

Hello
you should clearify the following questions

Which part of the "HowTo Optimise" do you use and how?
What is the load on your system as of CPU power used?
Have you checked the upstream connection or forwarders used?
Do you use DNSSEC? What kind of data do you use for testing?

Without this no one will be able to tell why your system max out at 3600qps.

Regards

Andreas

Hi Jaco, Andreas,

Zitat von Jaco Lesch <jacol@saix.net>:

Does anybody have some additional performance tuning tips for
Unbound, specificityon Solaris 11?

I have followed the recommended settings in the "HowTo Optimise",
but still seem to hit a ceiling of +/- 3600 max queries a
second.On top of that the platform/OS do become a bit sluggish
when logging in via SSH.

Our install details: OS: Solaris 11 CPU: SPARC T3 Unbound ver:
1.4.18-2 (64-bit) LDNS ver: 1.6.15 (64-bit) Libevent ver: 2.0.20
(64-bit)

Hello you should clearify the following questions

Which part of the "HowTo Optimise" do you use and how? What is the
load on your system as of CPU power used? Have you checked the
upstream connection or forwarders used? Do you use DNSSEC? What
kind of data do you use for testing?

Without this no one will be able to tell why your system max out
at 3600qps.

The sparc T3 has special hardware threads. Unbound has an option to
use the solaris thread library, configure --with-solaris-threads and
perhaps with the sun compiler (CC=/opt/.../cc). If the hardware
threads really work, then with num-threads up to 64 (or your number of
hardware support), you could, potentially (this has not been tried)
get up to 64x your previous performance.

Best regards,
   Wouter

Zitat von "W.C.A. Wijngaards" <wouter@nlnetlabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Jaco, Andreas,

Zitat von Jaco Lesch <jacol@saix.net>:

Does anybody have some additional performance tuning tips for
Unbound, specificityon Solaris 11?

I have followed the recommended settings in the "HowTo Optimise",
but still seem to hit a ceiling of +/- 3600 max queries a
second.On top of that the platform/OS do become a bit sluggish
when logging in via SSH.

Our install details: OS: Solaris 11 CPU: SPARC T3 Unbound ver:
1.4.18-2 (64-bit) LDNS ver: 1.6.15 (64-bit) Libevent ver: 2.0.20
(64-bit)

Hello you should clearify the following questions

Which part of the "HowTo Optimise" do you use and how? What is the
load on your system as of CPU power used? Have you checked the
upstream connection or forwarders used? Do you use DNSSEC? What
kind of data do you use for testing?

Without this no one will be able to tell why your system max out
at 3600qps.

The sparc T3 has special hardware threads. Unbound has an option to
use the solaris thread library, configure --with-solaris-threads and
perhaps with the sun compiler (CC=/opt/.../cc). If the hardware
threads really work, then with num-threads up to 64 (or your number of
hardware support), you could, potentially (this has not been tried)
get up to 64x your previous performance.

Best regards,
   Wouter

Are the "software" (POSIX??) threads that slow with the T3 processor?

Just curious.

Andreas

Andreas

Here is some info regarding your questions:-

The portion of unbound.conf pertaining to the "HowTo Optimise":
         # Optimise settings
         num-threads: 112

Wouter

Had a look at how I compiled unbound, below the details:

CC="/opt/Studio12.3/solarisstudio12.3/bin/cc"
CFLAGS="-m64 -Qoption cg -xregs=no%appl -xmemalign=8s -mt"
LDFLAGS="-L/opt/local/lib/64"

export CC CFLAGS LDFLAGS

./configure --prefix=/opt/local \
         --libdir=/opt/local/lib/sparcv9 \
         --sysconfdir=/etc \
         --with-username=dnsadmin \
         --with-ldns \
         --with-libevent \
         --disable-gost --disable-ecdsa

OK, was not aware of the Solaris thread option will possibly make a big difference. Will recompile and test.

The SPARC T3 support up to 128 threads, so can I take the thread count above 64? Or shall I just make the max thread count 64 by default?

Thanks for the feedback so far.

Regards

Hi Jaco,

Wouter

Had a look at how I compiled unbound, below the details:

CC="/opt/Studio12.3/solarisstudio12.3/bin/cc" CFLAGS="-m64 -Qoption
cg -xregs=no%appl -xmemalign=8s -mt" LDFLAGS="-L/opt/local/lib/64"

export CC CFLAGS LDFLAGS

./configure --prefix=/opt/local \ --libdir=/opt/local/lib/sparcv9
\ --sysconfdir=/etc \ --with-username=dnsadmin \ --with-ldns \
--with-libevent \ --disable-gost --disable-ecdsa

OK, was not aware of the Solaris thread option will possibly make a
big difference. Will recompile and test.

Yes, with pthreads it will not be able to use the special hardware
support in the T3.

The SPARC T3 support up to 128 threads, so can I take the thread
count above 64? Or shall I just make the max thread count 64 by
default?

112 may be a good choice, but I do not know, you are the first to try
this I think.

If the T3 threads do not work faster for unbound, you may have to use
the cpu count instead of the threadcount.

Best regards,
   Wouter

Zitat von Jaco Lesch <jacol@saix.net>:

Andreas

Here is some info regarding your questions:-

The portion of unbound.conf pertaining to the "HowTo Optimise":
        # Optimise settings
        num-threads: 112
        #
        msg-cache-slabs: 64
        rrset-cache-slabs: 64
        infra-cache-slabs: 64
        key-cache-slabs: 64
        rrset-cache-size: 1024m
        msg-cache-size: 512m
        infra-cache-numhosts: 100000
        #
        # Larger socket buffer
        # For Solaris 11 set the following UDP parameter 1st:
        # 'ipadm set-prop -p max_buf=8388608 udp'
        so-rcvbuf: 8m
        so-sndbuf: 8m
        #
        outgoing-range: 8192
        num-queries-per-thread: 4096

Hmm, i would lower "num-queries-per-thread" or leave default and raise the "outgoing-range" if the OS permits. From my knowledge the outgoing-range limit the max. number of open sockets for upstream queries and should be roughly num-threads * num-queries-per-thread, no?

You could also try to lower the num-threads, maybe you have already missed the sweet-point with best balance between concurrency and max out the hardware.

Regards

Andreas

Andreas/Wouter

Thanks for your insight and feedback.

With the info/ideas from Andreas I have amended the parameters in the unbound.conf as follows:
         # Optimise settings
         num-threads: 64

This is likely to work against your goal, as thread management / query
distribution will start using more cpu than necessary. I'd rather tune
it down to a reasonable number that can still comfortably serve all
requests.

I did the same sort of optimization as you, and set threads to 16.
While it worked fine, I later switched to 4 threads after figuring out
a bit more about threads and how things work in the OS:

http://www.unbound.net/pipermail/unbound-users/2012-July/002452.html
https://unbound.net/pipermail/unbound-users/2012-February/002240.html
http://serverfault.com/questions/411868/multithreading-with-multi-queue-nic-on-smp-system

It may not apply directly to your Solaris setup, but it's worth
investigating.

My resolvers consistently serve 30k-40k+ reqs/sec (some threads are up
to around 12k reqs/sec each, due to uneven balance), with load around
0.6 and CPU use around 2.0--10.0% sys and ~2.0% user, depending on the
hardware it runs on (Intel Xeon L5520 or L5640, 2x socket, quad/hex
cores with hyperthreading). I run them on Linux 3.2.

sven

Zitat von Jaco Lesch <jacol@saix.net>:

Andreas/Wouter

Thanks for your insight and feedback.

With the info/ideas from Andreas I have amended the parameters in the unbound.conf as follows:
        # Optimise settings
        num-threads: 64
        #
        msg-cache-slabs: 64
        rrset-cache-slabs: 64
        infra-cache-slabs: 64
        key-cache-slabs: 64
        rrset-cache-size: 1024m
        msg-cache-size: 512m
        infra-cache-numhosts: 100000
        #
        # Larger socket buffer
        # For Solaris 11 set the following UDP parameter 1st:
        # 'ipadm set-prop -p max_buf=8388608 udp'
        so-rcvbuf: 8m
        so-sndbuf: 8m
        #
        outgoing-range: 32768
        num-queries-per-thread: 1024

This improved performance significantly with a jump from 3600 qps to 6200 qps and the system was not sluggish with this "workload" at all. The magic seems to have been the combination of "outgoing-range" and "num-queries-per-thread", the system load did drop from 28 to around 21 directly after this change.

I also did recompile with Wouter's suggestion of using only Solaris threads, but the gain was very little if any. Again this is something to keep in mind for the future? Also stayed with the 64 threads for now, but at a later stage will test again at a full 128 threads, when I feel brave again.

Hello

as said i doubt you will gain much with a high (>>4) number of threads. Threads are used to prevent resovler stall if all slots (num-queries-per-thread) are busy (data on the fly). With 64 threads you have the potential of 64k not yet answered queries hammering your upstream. I would suggest you start with some really low value like 4 threads and see if raising this value will raise the qps tested.

Regards

Andreas

Jaco Lesch wrote:

This improved performance significantly with a jump from 3600 qps to
6200 qps and the system was not sluggish with this "workload" at
all. The magic seems to have been the combination of
"outgoing-range" and "num-queries-per-thread", the system load did
drop from 28 to around 21 directly after this change.

can you clarify, is the workload a "production" resolver (so, a real
user base that is perhaps really only offering ~6k qps of load), or is
this a "benchmark" where the goal is to determine the maximum number of
qps that the server can handle?

either way, a load average of 21-28 sounds way too high. perhaps
solaris has some sort of tool (dtrace?) comparable to the linux "perf"
tool that would help pinpoint the cause of the poor performance?