~10% performance increase in worker_handle_request()

Hi,

I've been testing Unbound 1.4.21's performance with the trafgen tool
from the netsniff-ng toolkit, with a stream of UDP queries for the same
qname/qtype in order to test Unbound's 100% cache hit rate performance,
and using Linux's traffic control facility to vary the query load
offered to the Unbound server. Then I use Linux's perf utility to
record performance data about the system under different query loads.

I read Unbound by NLnet Labs — Unbound 1.24.2 documentation and
configured the following settings in my Unbound config file:

    num-threads: 4
    so-rcvbuf: 8m
    so-sndbuf: 8m
    msg-cache-slabs: 8
    rrset-cache-slabs: 8
    infra-cache-slabs: 8
    key-cache-slabs: 8

Under high loads but not under low loads, it appears the local_zone
lookup in worker_handle_request() is quite costly. Here are the top 10
profiled functions, with the top 2 function's call chains expanded:

- 7.22% unbound unbound [.] dname_lab_cmp
   - dname_lab_cmp
      - 78.04% local_zone_cmp
           rbtree_find_less_equal
           local_zones_lookup
           worker_handle_request.part.19.13072.2480
           comm_point_udp_callback.4353
           event_base_loop
         + comm_base_dispatch
      - 19.46% compress_tree_lookup.15131
           packed_rrset_encode.15189
           reply_info_encode.3179
           reply_info_answer_encode
           worker_handle_request.part.19.13072.2480
           comm_point_udp_callback.4353
           event_base_loop
         + comm_base_dispatch
      + 0.88% rbtree_find_less_equal
      + 0.53% packed_rrset_encode.15189
- 5.60% unbound libpthread-2.17.so [.] pthread_spin_lock
   - pthread_spin_lock
      - 53.14% worker_handle_request.part.19.13072.2480
           comm_point_udp_callback.4353
           event_base_loop
         + comm_base_dispatch
      - 46.86% comm_point_udp_callback.4353
           event_base_loop
         + comm_base_dispatch
+ 4.87% unbound [kernel.kallsyms] [k] _raw_spin_lock
+ 3.44% unbound [kernel.kallsyms] [k] _raw_spin_lock_bh
+ 3.35% unbound [kernel.kallsyms] [k] __ip_select_ident
+ 3.26% unbound [kernel.kallsyms] [k] fib_table_lookup
+ 3.12% unbound libpthread-2.17.so [.] pthread_rwlock_rdlock
+ 2.84% unbound [kernel.kallsyms] [k] fget_light
+ 2.80% unbound libpthread-2.17.so [.] pthread_rwlock_unlock
+ 2.78% unbound unbound [.] packed_rrset_encode.15189

I believe the calls to dname_lab_cmp() (#1) and pthread_spin_lock() (#2)
can ultimately be accounted to the call to local_zones_lookup() in
worker_handle_request().

The benchmark run that the above data is from resulted in Unbound
utilizing 1.938 CPUs, according to the "perf stat" command. With the
following patch, disabling local_zone lookups entirely, Unbound only
utilized 1.693 CPUs.

diff -Npru unbound-1.4.21.callgraph/daemon/worker.c unbound-1.4.21.callgraph2/daemon/worker.c
--- unbound-1.4.21.callgraph/daemon/worker.c 2013-09-19 06:51:33.000000000 -0400
+++ unbound-1.4.21.callgraph2/daemon/worker.c 2013-12-27 17:53:29.589149356 -0500
@@ -862,6 +862,7 @@ worker_handle_request(struct comm_point*
     server_stats_insrcode(&worker->stats, c->buffer);
     return 1;
   }
+ /*
   if(local_zones_answer(worker->daemon->local_zones, &qinfo, &edns,
     c->buffer, worker->scratchpad)) {
     regional_free_all(worker->scratchpad);
@@ -872,6 +873,7 @@ worker_handle_request(struct comm_point*
     server_stats_insrcode(&worker->stats, c->buffer);
     return 1;
   }
+ */
   if(!(LDNS_RD_WIRE(ldns_buffer_begin(c->buffer))) &&
     acl != acl_allow_snoop ) {
     ldns_buffer_set_limit(c->buffer, LDNS_HEADER_SIZE);

Moving the local_zone lookup to occur after the cache lookup as in the
following patch, Unbound only utilized 1.712 CPUs.

diff -Npru unbound-1.4.21.callgraph/daemon/worker.c unbound-1.4.21.callgraph3/daemon/worker.c
--- unbound-1.4.21.callgraph/daemon/worker.c 2013-09-19 06:51:33.000000000 -0400
+++ unbound-1.4.21.callgraph3/daemon/worker.c 2013-12-27 18:32:01.916016428 -0500
@@ -862,16 +862,6 @@ worker_handle_request(struct comm_point*
     server_stats_insrcode(&worker->stats, c->buffer);
     return 1;
   }
- if(local_zones_answer(worker->daemon->local_zones, &qinfo, &edns,
- c->buffer, worker->scratchpad)) {
- regional_free_all(worker->scratchpad);
- if(ldns_buffer_limit(c->buffer) == 0) {
- comm_point_drop_reply(repinfo);
- return 0;
- }
- server_stats_insrcode(&worker->stats, c->buffer);
- return 1;
- }
   if(!(LDNS_RD_WIRE(ldns_buffer_begin(c->buffer))) &&
     acl != acl_allow_snoop ) {
     ldns_buffer_set_limit(c->buffer, LDNS_HEADER_SIZE);
@@ -924,6 +914,17 @@ worker_handle_request(struct comm_point*
   ldns_buffer_rewind(c->buffer);
   server_stats_querymiss(&worker->stats, worker);

+ if(local_zones_answer(worker->daemon->local_zones, &qinfo, &edns,
+ c->buffer, worker->scratchpad)) {
+ regional_free_all(worker->scratchpad);
+ if(ldns_buffer_limit(c->buffer) == 0) {
+ comm_point_drop_reply(repinfo);
+ return 0;
+ }
+ server_stats_insrcode(&worker->stats, c->buffer);
+ return 1;
+ }

(attachments)

unbound-1.4.21.callgraph2.diff (816 Bytes)
unbound-1.4.21.callgraph3.diff (1.38 KB)

Hi Robert,

Making the local_zone lookup occur after the cache lookup has
little effect under low load, but improves performance by ~10%
under high load. This has the side effect of making runtime updates
to the local_zone data non-instantaneous if there is a conflicting
entry in the cache, because the cache would be consulted first.
This side effect might be fixable by having updates to the
local_zone database do an implicit flush, which adds some
additional complexity, but it's probably worth doing for up to a
~10% performance benefit under certain loads.

Very interesting, some optimizations can be made by making the local
zone lookup better (and its locking).

However, moving the check after the cache check is not a good idea.
The localzone and localdata statements are supposed to be able to
override the DNS contents from the internet. Checking it earlier
means that this always happens. Messing with cache flushes sounds
very bad, a query may create a recursive query that would then
override the configuration and so on, lots of complexity as well as
worries about giving the 'wrong' answers (as compared to the
configuration).

It would be better to optimise the localzone lookup itself somehow.
Not sure what the best way is, it is visible that it uses a rbtree now
and that this is slower than the hashtable that the other cache employs

Best regards,
   Wouter

W.C.A. Wijngaards wrote:

However, moving the check after the cache check is not a good idea.
The localzone and localdata statements are supposed to be able to
override the DNS contents from the internet. Checking it earlier
means that this always happens. Messing with cache flushes sounds
very bad, a query may create a recursive query that would then
override the configuration and so on, lots of complexity as well as
worries about giving the 'wrong' answers (as compared to the
configuration).

OK, that makes sense.

It would be better to optimise the localzone lookup itself somehow.
Not sure what the best way is, it is visible that it uses a rbtree now
and that this is slower than the hashtable that the other cache employs

Yes, the problem is not that the localzone lookup occurs first, per se,
but that the localzone lookup causes measurable contention with other
threads.

Am I wrong in assuming this would be a no-issue if localzone and
localdata would also enter the cache?

Best regards, jo

Hi Robert,

W.C.A. Wijngaards wrote:

Hi Robert,

W.C.A. Wijngaards wrote:

W.C.A. Wijngaards wrote:

However, moving the check after the cache check is not a
good idea. The localzone and localdata statements are
supposed to be able to override the DNS contents from the
internet. Checking it earlier means that this always
happens. Messing with cache flushes sounds very bad, a query
may create a recursive query that would then override the
configuration and so on, lots of complexity as well as
worries about giving the 'wrong' answers (as compared to the
configuration).

OK, that makes sense.

It would be better to optimise the localzone lookup itself
somehow. Not sure what the best way is, it is visible that
it uses a rbtree now and that this is slower than the
hashtable that the other cache employs

Yes, the problem is not that the localzone lookup occurs first,
per se, but that the localzone lookup causes measurable
contention with other threads.

The lock is changed to a rwlock, so that the threads can all
acquire the readlock to answer queries. That should reduce
contention? The call order is not changed.

Hi, Wouter:

I benchmarked svn r3047 (spinlock?) against svn r3048 (rwlock) and
the difference is not as dramatic. Maybe 0.5% - 1% speedup. It
doesn't appear to harm performance at any level, but it's also not
as dramatic as eliding the localzone lookup (~10%).

Too bad, I do not know how else to fix this. Duplicating the
localzone data for every worker is a lot of memory overhead (as well
as code and maintenance overhead whilst manipulating the localzone
data), and I think less desirable. It does reduce that lookup even
more because there are no locks on it in that case. ( you can already
compile unbound today without locking and without threading, and run
unbound that uses processes that duplicate the entire cache for every
worker, and then also duplicate the localzone data without locks for
every worker ).

Still, it is good to know this is a performance (or CPU-) heavy-point
at this time.

Best regards,
   Wouter