I am using an amazon large EC2 instance (4ECUs, 2 cores) for my unbound configured as below. I am seeing a 150ms+ average response time as reported by namebench Alexa 2K result. In order to reduce my lookup times, I am running an hourly scan of these 35K sites (from namebench dat files) in order to give my clients a cached response whenever possible. On average, my cachemiss rate is 6% as shown below. My cache-ttl-min is 1 hour so these entries should be cached at all times. The cachemisses I am guessing are from sites my pythonmod looks up and responds to in a special way:
6.5Mbytes of free RAM
total.num.cachehits=3185
total.num.cachemiss=188
mem.cache.rrset=8319405
mem.cache.message=8729827
(forked configuration)
server:
#disable chroot as it caused several issues with python’s PYTHONHOME vars
chroot: “”
verbosity: 0
set to num of cores or cpus
num-threads: 2
##slabs
rrset-cache-slabs: 1
infra-cache-slabs: 1
key-cache-slabs: 1
msg-cache-slabs: 1
##cache sizes
msg-cache-size: 250m
#2X msg-cache-size
rrset-cache-size: 500m
outgoing-range: 950
#2X outgoing range
num-queries-per-thread: 512
sudo sysctl -w net.core.rmem_max=8388608
so-rcvbuf: 8m
interface: 0.0.0.0
interface: ::0
port: 53
access-control: 0.0.0.0/0 allow
module-config: “python iterator”
prefetch: yes
cache-min-ttl: 3600
python:
python-script: “XYZ”
remote-control:
control-enable: yes
forward-zone:
name: “.”
forward-addr: XYZ
Question:
Even with this setup, I am seeing most of the domains return a TTL of 3600 at the start of a random namebench which means they were iterated/recursed over instead of looked up from cache. This is causing a 150ms+ average response times for these 35K sites. It’s the exact same 35K sites being scanned by namebench – why aren’t these looked up from the cache instead of being iterated over? Are these sites not cached for a full 3600 seconds?
With prefetch, cache-min-ttl of 1hour, why isn’t an hourly scan of these 35K sites populating my cache and giving me a <50ms response time on average?
With the same setup, if I take 500 sites and run namebench back to back for these fixed 500 sites, my average response time starts approaching 40-50ms which is where I am trying to be with the 35K sites.
Where am I going wrong and how can debug and fix this issue?
Vinay.
I tried using a prefetch trigger at 90% instead of 10% (intentionally inefficient) and a cache-min-ttl of 5400 so that an hourly scan is guaranteed to find a cached entry from the last scan and will also reset it’s TTL back to 5400 by forcing a new iteration.
If I do a new namebench run, I still get a 150ms+ avg response time and I see several responses that have a TTL of 5400 meaning they were cachemisses.
|
blog.sina.com.cn.
|
A
|
814.8959
|
5400
|
2
|
blogx.sina.com.cn. -> 218.30.115.254
|
|
|
|
|
|
|
|
|
|
|
|
|
|
www.utorrent.com.
|
A
|
203.908
|
5400
|
1
|
67.215.233.130
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
in.youtube.com.
|
A
|
29.46731
|
2982
|
2
|
youtube-ui.l.google.com. -> 74.125.224.46, 74.125.224.32, 74.125.224.33, 74.125.224.34, 74.125.224.35, 74.125.224.36, 74.125.224.37, 74.125.224.38, 74.125.224.39, 74.125.224.40, 74.125.224.41
|
|
|
|
|
|
|
|
|
|
|
|
|
search.mywebsearch.com.
|
A
|
72.6796
|
2985
|
2
|
www154.mywebsearch.com. -> 74.113.233.48
|
|
|
|
|
|
|
|
|
|
|
|
|
wenwen.soso.com.
|
A
|
296.7475
|
5236
|
1
|
202.55.10.153
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
www.uploading.com.
|
A
|
26.0509
|
3007
|
1
|
195.191.207.40
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
www.ebay.com.
|
A
|
31.4418
|
5400
|
1
|
66.135.200.161, 66.135.200.181, 66.135.210.61, 66.135.210.181, 66.211.181.161, 66.211.181.181
|
So this is appearing like unexpected behavior to me: If www.utorrent.com was scanned less than an hour ago and was supposed to be cached for 5400 seconds, why would a random scan like the one above find it flushed out of the cache (5400 TTL) and requiring an iteration (203.9 msecs response time)?
Hi Vinay,
The unbound name server is using the Least resent used (LRU) algorithm to decide what to throw out when the cache gets full. There is no ‘garbage collection’ like in Bind. Doing this like unbound does it should be somewhat faster and nicer to the processor, but the flip-side is that prefetching is not working that very well (good stuff might get thrown out when prefetching).
I notice that you don’t have that much memory. Maybe your memory is only large enough for say 10 minutes of cached entries. If nobody is asking for www.utorrent.com during those 10 minutes, then it will be kicked out. Try to increase your cache memory size or ask more often for the site that you want to make sure you have cached.
/Stephan
Hi Stephan,
Thanks. It indeed seems like LRU is happening. The RAM number was a typo – I had 6.5GBytes (not Mbytes) free while even at peak load unbound was taking <50-100Mbytes of RAM across all of its components.
With sufficient memory and my setup below, can you see why least-recently used entries are forced to be purged from cache?
To work-around this behavior, I am running a continuous scan instead of an hourly cronjob so my LRU entries get re-queried every 25mins instead of every 60mins. With a 80% prefetch, 3600 cache-min-ttl, I am now able to get 40-80ms avg response times mostly from the cache. However, the down side now is a steady 40-50% cpu load given all the scanning. I definitely have a very inefficient setup right now to get my response times down to ~50ms.
Would love to see how else I can keep my oldest entries from being purged from cache if memory and cache-min-ttl are large enough.
Vinay.