Improve avg response times

I am using an amazon large EC2 instance (4ECUs, 2 cores) for my unbound configured as below. I am seeing a 150ms+ average response time as reported by namebench Alexa 2K result. In order to reduce my lookup times, I am running an hourly scan of these 35K sites (from namebench dat files) in order to give my clients a cached response whenever possible. On average, my cachemiss rate is 6% as shown below. My cache-ttl-min is 1 hour so these entries should be cached at all times. The cachemisses I am guessing are from sites my pythonmod looks up and responds to in a special way:

6.5Mbytes of free RAM

total.num.cachehits=3185

total.num.cachemiss=188

mem.cache.rrset=8319405

mem.cache.message=8729827

(forked configuration)

server:

#disable chroot as it caused several issues with python’s PYTHONHOME vars

chroot: “”

verbosity: 0

set to num of cores or cpus

num-threads: 2

##slabs

rrset-cache-slabs: 1

infra-cache-slabs: 1

key-cache-slabs: 1

msg-cache-slabs: 1

##cache sizes

msg-cache-size: 250m

#2X msg-cache-size

rrset-cache-size: 500m

outgoing-range: 950

#2X outgoing range

num-queries-per-thread: 512

sudo sysctl -w net.core.rmem_max=8388608

so-rcvbuf: 8m

interface: 0.0.0.0

interface: ::0

port: 53

access-control: 0.0.0.0/0 allow

module-config: “python iterator”

prefetch: yes

cache-min-ttl: 3600

python:

python-script: “XYZ”

remote-control:

control-enable: yes

forward-zone:

name: “.”

forward-addr: XYZ

Question:

Even with this setup, I am seeing most of the domains return a TTL of 3600 at the start of a random namebench which means they were iterated/recursed over instead of looked up from cache. This is causing a 150ms+ average response times for these 35K sites. It’s the exact same 35K sites being scanned by namebench – why aren’t these looked up from the cache instead of being iterated over? Are these sites not cached for a full 3600 seconds?

With prefetch, cache-min-ttl of 1hour, why isn’t an hourly scan of these 35K sites populating my cache and giving me a <50ms response time on average?

With the same setup, if I take 500 sites and run namebench back to back for these fixed 500 sites, my average response time starts approaching 40-50ms which is where I am trying to be with the 35K sites.

Where am I going wrong and how can debug and fix this issue?

Vinay.

I tried using a prefetch trigger at 90% instead of 10% (intentionally inefficient) and a cache-min-ttl of 5400 so that an hourly scan is guaranteed to find a cached entry from the last scan and will also reset it’s TTL back to 5400 by forcing a new iteration.

If I do a new namebench run, I still get a 150ms+ avg response time and I see several responses that have a TTL of 5400 meaning they were cachemisses.

blog.sina.com.cn.

A

814.8959

5400

2

blogx.sina.com.cn. -> 218.30.115.254

www.utorrent.com.

A

203.908

5400

1

67.215.233.130

in.youtube.com.

A

29.46731

2982

2

youtube-ui.l.google.com. -> 74.125.224.46, 74.125.224.32, 74.125.224.33, 74.125.224.34, 74.125.224.35, 74.125.224.36, 74.125.224.37, 74.125.224.38, 74.125.224.39, 74.125.224.40, 74.125.224.41

search.mywebsearch.com.

A

72.6796

2985

2

www154.mywebsearch.com. -> 74.113.233.48

wenwen.soso.com.

A

296.7475

5236

1

202.55.10.153

www.uploading.com.

A

26.0509

3007

1

195.191.207.40

www.ebay.com.

A

31.4418

5400

1

66.135.200.161, 66.135.200.181, 66.135.210.61, 66.135.210.181, 66.211.181.161, 66.211.181.181

So this is appearing like unexpected behavior to me: If www.utorrent.com was scanned less than an hour ago and was supposed to be cached for 5400 seconds, why would a random scan like the one above find it flushed out of the cache (5400 TTL) and requiring an iteration (203.9 msecs response time)?

Hi Vinay,

The unbound name server is using the Least resent used (LRU) algorithm to decide what to throw out when the cache gets full. There is no ‘garbage collection’ like in Bind. Doing this like unbound does it should be somewhat faster and nicer to the processor, but the flip-side is that prefetching is not working that very well (good stuff might get thrown out when prefetching).

I notice that you don’t have that much memory. Maybe your memory is only large enough for say 10 minutes of cached entries. If nobody is asking for www.utorrent.com during those 10 minutes, then it will be kicked out. Try to increase your cache memory size or ask more often for the site that you want to make sure you have cached.

/Stephan

Hi Stephan,

Thanks. It indeed seems like LRU is happening. The RAM number was a typo – I had 6.5GBytes (not Mbytes) free while even at peak load unbound was taking <50-100Mbytes of RAM across all of its components.

With sufficient memory and my setup below, can you see why least-recently used entries are forced to be purged from cache?

To work-around this behavior, I am running a continuous scan instead of an hourly cronjob so my LRU entries get re-queried every 25mins instead of every 60mins. With a 80% prefetch, 3600 cache-min-ttl, I am now able to get 40-80ms avg response times mostly from the cache. However, the down side now is a steady 40-50% cpu load given all the scanning. I definitely have a very inefficient setup right now to get my response times down to ~50ms.

Would love to see how else I can keep my oldest entries from being purged from cache if memory and cache-min-ttl are large enough.

Vinay.