Looping module stopped

Gareth_Hopkins · September 10, 2010, 7:59am

Hi,

I’m seeing the following in my unbound logs

Sep 09 23:53:20 unbound[96721:0] error: internal error: looping module stopped
Sep 09 23:53:20 unbound[96721:0] info: pass error for qstate <sipinternal.tcp.slb.com. SRV IN>
Sep 10 05:07:31 unbound[96721:0] error: internal error: looping module stopped
Sep 10 05:07:31 unbound[96721:0] info: pass error for qstate <sipinternaltls.tcp.slb.com. SRV IN>
Sep 10 05:55:33 unbound[96721:0] error: internal error: looping module stopped
Sep 10 05:55:33 unbound[96721:0] info: pass error for qstate <sipinternaltls.tcp.slb.com. SRV IN>
Sep 10 05:55:47 unbound[96721:0] error: internal error: looping module stopped
Sep 10 05:55:47 unbound[96721:0] info: pass error for qstate <sipinternal.tcp.slb.com. SRV IN>

Unbound continues to run, but I would like to know what is causing the errors.

Can anyone shed some light ?

Cheers,

Gareth

Wouter · September 10, 2010, 9:51am

Hi Gareth,

This error means that a module was activated 1000 times (or so), which
is impossible or unusual, and should not happen. The query got
SERVFAIL, and it prints this log message.

This should not be happening, usually, lookups fail much earlier.
My quick try says that the name is an NXDOMAIN. And works fine for me.
So SERVFAIL instead of NXDOMAIN is no great loss (for the users), the
failsafe activated and nothing bad happened.

So, the question is why is it looping (the state machine is activated
too often)?

I would like to see a high verbosity trace of a resolution of this name
when it says it is looping; what is that module doing?

Best regards,
Wouter

Gareth_Hopkins · September 14, 2010, 7:10am

Hi Wouter,

Thanks for the reply.

I’ll enable logging on one of my dev boxes and see if I get the same errors.

Latest from the production box

Sep 13 18:59:05 unbound[96721:0] error: internal error: looping module stopped
Sep 13 18:59:05 unbound[96721:0] info: pass error for qstate <sipinternal.tcp.slb.com. SRV IN>
Sep 13 19:44:01 unbound[96721:0] error: internal error: looping module stopped
Sep 13 19:44:01 unbound[96721:0] info: pass error for qstate <sipinternaltls.tcp.slb.com. SRV IN>
Sep 13 20:31:11 unbound[96721:0] error: internal error: looping module stopped
Sep 13 20:31:11 unbound[96721:0] info: pass error for qstate <sipinternal.tcp.slb.com. SRV IN>
Sep 13 20:48:15 unbound[96721:0] error: internal error: looping module stopped
Sep 13 20:48:15 unbound[96721:0] info: pass error for qstate <sipinternaltls.tcp.slb.com. SRV IN>
Sep 13 22:07:07 unbound[96721:0] error: internal error: looping module stopped
Sep 13 22:07:07 unbound[96721:0] info: pass error for qstate <sipinternaltls.tcp.slb.com. SRV IN>
Sep 14 06:35:03 unbound[96721:0] error: internal error: looping module stopped
Sep 14 06:35:03 unbound[96721:0] info: pass error for qstate <sipinternaltls.tcp.slb.com. SRV IN>
Sep 14 07:04:02 unbound[96721:0] error: internal error: looping module stopped
Sep 14 07:04:02 unbound[96721:0] info: pass error for qstate <sipinternaltls.tcp.slb.com. SRV IN>
Sep 14 07:24:37 unbound[96721:0] error: internal error: looping module stopped
Sep 14 07:24:37 unbound[96721:0] info: pass error for qstate <sipinternaltls.tcp.slb.com. SRV IN>

Cheers,

Gareth

Gareth_Hopkins · September 14, 2010, 8:16am

I’ve just seen the following regarding the _tcp.slb.com zonefile.

There are 262 nameserver specified. Unbound should not have any issues with that right ?

See attached.

Cheers,

Gareth

(attachments)

unbound.txt (20.9 KB)

Wouter · September 14, 2010, 8:25am

Hi Gareth,

If the 262 nameservers all timeout, and they get 4 tries each, it hits
the 1000 limit of course.

I'll have to increase that limit ...

Best regards,
Wouter

Felix_Schueren · September 14, 2010, 10:26am

W.C.A. Wijngaards wrote:

Hi Gareth,

If the 262 nameservers all timeout, and they get 4 tries each, it hits
the 1000 limit of course.

a single query for "-t ns _tcp.slb.com" on the unbound svn build from ~2
weeks ago causes
error: internal error: looping module stopped

I'm wondering, does this impact other, regular queries if enough queries
for the broken zone come in?

Kind regards,

Felix

Wouter · September 14, 2010, 10:40am

Hi Felix,

Hayward_Bruce · October 19, 2010, 4:10pm

Hi

Currently we are running Bind 9.5+ on Solaris 10/Sun Netra 240s, and
RHEL 5.5 on Dell 2970s.

I have been trying different flavours of Unbound, both compiling myself,
and using the precompiled flavours from M n M.

I am also trying out Nominum, as well as Bind and Unbound.

I see a marked improvement in QPS over Bind on the Solaris 10 servers in
both Unbound, and Nominum.

On the RHEL servers I see improvement over Bind with Nominum (I get
20,000+ QPS with Bind alone), but degradation when trying Unbound. This
is trying different flavours of Unbound, both compiling myself, and
using the precompiled flavours from M n M.

Does anyone have a unbound.conf with reasonable QPS on a RHEL 5X server
that I can see?

Much Thanks

Bruce

Bruce Hayward, MTS Allstream Inc., (p) 204-958-1983 (e)
bruce.hayward@mtsallstream.com

Is it really necessary to print this email?

MTS ALLSTREAM INC. CONFIDENTIALITY WARNING: This email message is confidential and intended only for the named recipient(s). If you are not the intended recipient, or an agent responsible for delivering it to the intended recipient, or if this message has been sent to you in error, you are hereby notified that any review, use, dissemination, distribution or copying of this message or its contents is strictly prohibited. If you have received this message in error, please notify the sender immediately and delete the original message. If there is an agreement attached with this message, such agreement will not be binding until it is signed by all parties named therein.

PaulWouters · October 19, 2010, 7:01pm

Note that the EPEL builds for unbound use --enable-debug, which might make it slower.
I am not sure how big the impact is. The stock unbound.conf is also not aggressively
using most of your memory for a dedicated nameserver, so you might want to tweak the
stock config file. For some hints, see:

Paul

Hayward_Bruce · October 19, 2010, 7:40pm

Thanks, I have already used that page on optimizing

For my own compile I have been using:

./configure --prefix=/opt/unbound --with-libs=/usr/local/lib
--libexecdir=/opt/unbound/lib --sysconfdir=/var/unbound/etc
--sharedstatedir=/var/unbound --localstatedir=/var/unbound
--with-conf-file=/var/unbound/etc/unbound.conf
--with-run-dir=/var/unbound --with-chroot-dir=/var/unbound
--with-pidfile=/var/unbound/run/unbound.pid --with-username=unbound
--with-openssl=/lib64 --without-pthreads --without-solaris-threads
--with-libevent=/usr/local/libevent/

Is the default --enable-debug?

Bruce

Bruce Hayward, MTS Allstream Inc., (p) 204-958-1983 (e)
bruce.hayward@mtsallstream.com

PaulWouters · October 19, 2010, 8:29pm

Thanks, I have already used that page on optimizing

Okay.

For my own compile I have been using:

./configure --prefix=/opt/unbound --with-libs=/usr/local/lib
--libexecdir=/opt/unbound/lib --sysconfdir=/var/unbound/etc
--sharedstatedir=/var/unbound --localstatedir=/var/unbound
--with-conf-file=/var/unbound/etc/unbound.conf
--with-run-dir=/var/unbound --with-chroot-dir=/var/unbound
--with-pidfile=/var/unbound/run/unbound.pid --with-username=unbound
--with-openssl=/lib64 --without-pthreads --without-solaris-threads
--with-libevent=/usr/local/libevent/

I would not use chroot on a dedicated nameserver. All your important stuff
is already inside the chroot, not outside it. Also, with rhel/centos you
should use and trust the SElinux policies - they provide a much better
security context without having to install or link various (sometimes outdated)
binaries or special devices or config files in the chroot. And no surprises
when sending the daemon signals and it possibly not being able to read config
files or includes anymore.

Is the default --enable-debug?

No, it is not the default. So you should be fine. It is still surprising that
you're not outrunning bind though. Are you sure you are comparing similar
configurations, eg with DNSSEC validation and the root key loaded, and perhaps
with DLV?

What version of libevent are you using?
Why are you disabling threads?
Is it finding ssl (you did not add --with-ssl). I've seen a lot of speed differences
with different versions of openssl.

Paul

Kevin_Chadwick · October 19, 2010, 10:30pm

All your important stuff
is already inside the chroot, not outside it.

Assuming there is a bug in unbound (OpenBSD are thinking of adopting it,
so it must be good) meaning that where your important stuff is
matters. Then likely so do all the binaries etc. (if they have not been
removed) that may be used for priviledge elevation. It certainly can't
harm.

(sometimes outdated)
binaries or special devices or config files in the chroot.

Will you look after it or leave it to get dusty.

Is it finding ssl (you did not add --with-ssl). I've seen a lot of
speed differences with different versions of openssl.

Can you remember which one was slow and which was fast?

PaulWouters · October 20, 2010, 12:31am

Assuming there is a bug in unbound (OpenBSD are thinking of adopting it,
so it must be good) meaning that where your important stuff is
matters. Then likely so do all the binaries etc. (if they have not been
removed) that may be used for priviledge elevation. It certainly can't
harm.

What I meant was "the only valuable data on a dedicated nameserver resides
within the chroot, no need to get outside it. Its the compromise of the
nameserver data that matters, not the host. (the host is really just a container)

(sometimes outdated)
binaries or special devices or config files in the chroot.

Will you look after it or leave it to get dusty.

I don't use chroot. So I do not have duplicate/old binaries around.

Is it finding ssl (you did not add --with-ssl). I've seen a lot of
speed differences with different versions of openssl.

Can you remember which one was slow and which was fast?

0.9.[678] was faster then 1.0.0beta, but I think 1.0.0 was fastest.

Paul

Wouter · October 20, 2010, 5:54am

Hi Bruce,

Not sure, but could this be the kernel version of RHEL5 (which is very
old?) Which has a 'raging herd'-bug thus only your first thread gets to
do all the work? (I have had reports, privately, about such old kernels
showing bad behaviour).

Could it be that if you hard-line the interfaces (not 0.0.0.0 in the
unbound.conf but the exact IP value) then the OS route-selection does
not take time to determine a source interface? (Unless you use
interface-automatic, then this does not make a difference).

You are increasing num-threads, right? (even if you compile without
pthreads, num-threads will still use multiple CPUs).

Best regards,
Wouter

Hayward_Bruce · October 21, 2010, 1:32pm

One area of Bind that we use is views to direct traffic.

Before we can switch to Unbound, we would need a means of emulating
views.

In researching this (on Google) I came across a thread discussing this:
http://www.mail-archive.com/unbound-users@unbound.net/msg00337.html

Has anyone documented steps to accomplish this?

Thanks

Bruce

Bruce Hayward, MTS Allstream Inc., (p) 204-958-1983 (e)
bruce.hayward@mtsallstream.com

Is it really necessary to print this email?

MTS ALLSTREAM INC. CONFIDENTIALITY WARNING: This email message is confidential and intended only for the named recipient(s). If you are not the intended recipient, or an agent responsible for delivering it to the intended recipient, or if this message has been sent to you in error, you are hereby notified that any review, use, dissemination, distribution or copying of this message or its contents is strictly prohibited. If you have received this message in error, please notify the sender immediately and delete the original message. If there is an agreement attached with this message, such agreement will not be binding until it is signed by all parties named therein.

Ondrej_Sury · October 21, 2010, 2:52pm

Hey Bruce,

I think that it's pretty well documented in the mail you sent a
link... you setup two unbound instances and mangle the traffic from
set of ip addresses using standard firewall/nat features your
operating system has.

Anyway maybe if you can explain what you are trying to accomplish then
we can propose alternative without views.

Ondrej

Hayward_Bruce · October 25, 2010, 1:18pm

Hey

On specific resolvers we use bind views to direct those who come from an IP in a specific CIDR to use a specific zone. We have two cases of these views.

We also use views to isolate those that should only use internal zones versus those that should not use internal zones (external customers)

Those that do not come from an IP in a specific CIDR use a global zone.

"Views" were introduced in Bind 9.

http://oreilly.com/pub/a/oreilly/networking/news/views_0501.html

Bruce

Bruce Hayward, MTS Allstream Inc., (p) 204-958-1983 (e) bruce.hayward@mtsallstream.com

Ondrej_Sury · October 25, 2010, 1:32pm

Hi Bruce,

it should be fairly easy to accomplish both option using DNAT on linux
(or using other translation mechanisms either on the router or on the
end box).

f.e. on linux you can use:

- 10.10.10.1 is the normal address
- 10.10.10.2 is extra address you use to serve internal clients (can
be localhost if NATed on the box)
- 192.168.1.1/32 is the specific CIDR

iptables -t nat -A PREROUTING -s 192.168.1.1/32 -d 10.10.10.1 -j DNAT
--to-destination 10.10.14.2

If you do the NAT on the router before, it has the added benefit of
splitting the load (so you can provide less loaded service to your
customers... etc.)

Ondrej

Hayward_Bruce · October 25, 2010, 1:37pm

Hi Ondrej

Thanks for the direction, but how does Unbound know to have IPs in a range use a specific zone?

Bruce

Bruce Hayward, MTS Allstream Inc., (p) 204-958-1983 (e) bruce.hayward@mtsallstream.com

Ondrej_Sury · October 25, 2010, 1:50pm

Unbound doesn't have to know. You'll just configure multiple instances
of unbound (f.e. running on 127.0.0.2 127.0.0.3, etc...) and you'll do
all the logic on the routing level. Of course it's not suitable if you
have complicated setup like many views or overlapping views.

AFAIK the design decision for the unbound was to keep it simple,
efficient, secure and fast. So it doesn't implement everything you'll
find in other DNS software.

Ondrej