Unbound stop working without error-log

Hello

we have unound 1.4.6 operating as caching resolver asking a border Bind 9.7 as forwarder. This has worked for some time but today we got trouble to resolve more and more domains/hosts. A restart a unbound cleared the problem for some time but it failed again after some minutes. There is nothing in the normal log either for the border Bind 9.7 or for unbound. I have captured one failing event (www.computeruniverse.net) on the wire an with unbound loglevel 4. The border Bind does resolve fine and quick, but unbound does not deliver the result further on. This has happened for many different domains over night according to mailserver logs. Any hints how to debug this problem?

Thanks

Andreas

(attachments)

unbound-problem.zip (7.67 KB)

Hi Andreas,

we have unound 1.4.6 operating as caching resolver asking a border Bind
9.7 as forwarder. This has worked for some time but today we got trouble
to resolve more and more domains/hosts. A restart a unbound cleared the
problem for some time but it failed again after some minutes. There is
nothing in the normal log either for the border Bind 9.7 or for unbound.
I have captured one failing event (www.computeruniverse.net) on the wire
an with unbound loglevel 4. The border Bind does resolve fine and quick,
but unbound does not deliver the result further on. This has happened
for many different domains over night according to mailserver logs. Any
hints how to debug this problem?

In the debug log, it seems like unbound receives the reply just fine,
and should be trying to do a send() call to send a reply to the user.

The tcpdump does not show any reply from unbound to a user. Unbound
must be trying to send, and the syscall seems to work, or it would log
an error. Is there some routing issue? (try interface-automatic: yes?)

Best regards,
   Wouter

Zitat von "W.C.A. Wijngaards" <wouter@NLnetLabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andreas,

we have unound 1.4.6 operating as caching resolver asking a border Bind
9.7 as forwarder. This has worked for some time but today we got trouble
to resolve more and more domains/hosts. A restart a unbound cleared the
problem for some time but it failed again after some minutes. There is
nothing in the normal log either for the border Bind 9.7 or for unbound.
I have captured one failing event (www.computeruniverse.net) on the wire
an with unbound loglevel 4. The border Bind does resolve fine and quick,
but unbound does not deliver the result further on. This has happened
for many different domains over night according to mailserver logs. Any
hints how to debug this problem?

In the debug log, it seems like unbound receives the reply just fine,
and should be trying to do a send() call to send a reply to the user.

The tcpdump does not show any reply from unbound to a user. Unbound
must be trying to send, and the syscall seems to work, or it would log
an error. Is there some routing issue? (try interface-automatic: yes?)

Best regards,
   Wouter

The request is orginating at localhost with "nameserver 127.0.0.1" as the only content in resolv.conf so routing should not be the problem. If have disabled "interface-automatic: yes" because the machine is multihomed and have bound unbound to the needed interfaces "by hand" with the "interface:" statement. I have also set "do-ip6: no" because the machine has ipv6 capabilities but no ipv6 connectivity until yet.

Regards

Andreas

Hi Andreas,

Zitat von "W.C.A. Wijngaards" <wouter@NLnetLabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andreas,

The request is orginating at localhost with "nameserver 127.0.0.1" as
the only content in resolv.conf so routing should not be the problem. If
have disabled "interface-automatic: yes" because the machine is
multihomed and have bound unbound to the needed interfaces "by hand"
with the "interface:" statement. I have also set "do-ip6: no" because
the machine has ipv6 capabilities but no ipv6 connectivity until yet.

Looks like this is reasonable config to me.

So the packetdump was from the eth interface, and did not include lo0
traffic, that is why the reply is not in the packet capture?

Best regards,
   Wouter

The packet capture is from the connection unbound (10.5.0.1) -> upstream bind (10.5.0.3) to see if bind deliver results or not. If have done no localhost capture but i can try later. It is a bit difficult because it's the main resolver for the internal net and people don't like to be without internet access today :wink:

Regards

Andreas

Zitat von "W.C.A. Wijngaards" <wouter@NLnetLabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andreas,

The request is orginating at localhost with "nameserver 127.0.0.1" as
the only content in resolv.conf so routing should not be the problem. If
have disabled "interface-automatic: yes" because the machine is
multihomed and have bound unbound to the needed interfaces "by hand"
with the "interface:" statement. I have also set "do-ip6: no" because
the machine has ipv6 capabilities but no ipv6 connectivity until yet.

Looks like this is reasonable config to me.

So the packetdump was from the eth interface, and did not include lo0
traffic, that is why the reply is not in the packet capture?

Best regards,
   Wouter

Caught an other one:

Nov 2 12:34:02 mailer unbound: [10971:1] debug: validator[module 0] operate: extstate:module_state_initial event:module_event_new
Nov 2 12:34:02 mailer unbound: [10971:1] info: validator operate: query <interrisk.de. MX IN>
Nov 2 12:34:02 mailer unbound: [10971:1] debug: iterator[module 1] operate: extstate:module_state_initial event:module_event_pass
Nov 2 12:34:02 mailer unbound: [10971:1] info: resolving <interrisk.de. MX IN>
Nov 2 12:34:02 mailer unbound: [10971:1] info: processQueryTargets: <interrisk.de. MX IN>
Nov 2 12:34:02 mailer unbound: [10971:1] info: sending query: <interrisk.de. MX IN>
Nov 2 12:34:02 mailer unbound: [10971:1] debug: sending to target: <.> 10.5.0.3#53
Nov 2 12:34:02 mailer unbound: [10971:1] debug: cache memory msg=167968 rrset=261019 infra=2880 val=92563
Nov 2 12:34:02 mailer unbound: [10971:1] debug: iterator[module 1] operate: extstate:module_wait_reply event:module_event_reply
Nov 2 12:34:02 mailer unbound: [10971:1] info: iterator operate: query <interrisk.de. MX IN>
Nov 2 12:34:02 mailer unbound: [10971:1] info: response for <interrisk.de. MX IN>
Nov 2 12:34:02 mailer unbound: [10971:1] info: reply from <.> 10.5.0.3#53
Nov 2 12:34:02 mailer unbound: [10971:1] info: query response was ANSWER
Nov 2 12:34:02 mailer unbound: [10971:1] info: finishing processing for <interrisk.de. MX IN>
Nov 2 12:34:02 mailer unbound: [10971:1] debug: validator[module 0] operate: extstate:module_wait_module event:module_event_moddone
Nov 2 12:34:02 mailer unbound: [10971:1] info: validator operate: query <interrisk.de. MX IN>
Nov 2 12:34:02 mailer unbound: [10971:1] debug: cache memory msg=168222 rrset=261597 infra=2880 val=92563

The packet capture is from localhost for this event.

Regards

Andreas

(attachments)

server-failure.cap (200 Bytes)

Hi Andreas,

There is no failure in these logs, it looks fine. Maybe if you increase
verbosity even more (it'll log the exact packet it receives and so on)?
The answer from bind seems fine, the log looks fine, but the response
to the client is SERVFAIL. Your config also looks fine.

Best regards,
   Wouter

Zitat von "W.C.A. Wijngaards" <wouter@NLnetLabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andreas,

There is no failure in these logs, it looks fine. Maybe if you increase
verbosity even more (it'll log the exact packet it receives and so on)?
The answer from bind seems fine, the log looks fine, but the response
to the client is SERVFAIL. Your config also looks fine.

That's the point which is baffling me too. After all it is working for some time...
I will try to do level 5 logging and see what is happening.

Regards

Andreas

Update for the disaster-tourists on the list - unbound logs with
val-log-level: 2 that the upstream bind sends expired signatures -
sleuthing continues ...

Best regards,
   Wouter

Zitat von "W.C.A. Wijngaards" <wouter@NLnetLabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Update for the disaster-tourists on the list - unbound logs with
val-log-level: 2 that the upstream bind sends expired signatures -
sleuthing continues ...

It seems more that unbound and bind disagree in their opinion if the signature is expired or not. As said the time unbound starts failing the same queries done directly to the upstream resolve *and* validate fine. So the options are:

- Bind does not send the same data it is using for validation to the downtsream (unbound) client. Would be a Bind bug i guess.
- Unbound and Bind do validation different (should not happen IMHO)
- Validation in Unbound for some cases is broken. Would be a bug in Unbound i guess.

It would be nice to get help how to debug this as DNSSEC "by-hand" is somewhat challenging.

Regards

Andreas

Hi Andreas,

It seems more that unbound and bind disagree in their opinion if the
signature is expired or not. As said the time unbound starts failing the
same queries done directly to the upstream resolve *and* validate fine.
So the options are:

That is strange. Your clocks are synchronised, so that is not it.
Could it have been the recent daylight-savings change somehow?

Both bind and unbound may have some leeway for expired signatures that
you can configure; val-sig-skew-max and val-sig-skew-min config options
for unbound.

- Bind does not send the same data it is using for validation to the
downtsream (unbound) client. Would be a Bind bug i guess.

Try doing a dig @<bind> name +dnssec and then with +dnssec +cdflag. If
that is different, then this is happening.

- Unbound and Bind do validation different (should not happen IMHO)

Yes.

- Validation in Unbound for some cases is broken. Would be a bug in
Unbound i guess.

Well, when unbound refuses to validate it, enable val-log-level: 2, and
take a look in the log file, it gives a detailed error. Then dig
+dnssec and dig +dnssec +cdflag when it mentions (also to the unbound so
see what is in the cache, and also at the IP address it mentions).

If you enable val-log-level: 2 (and you can have verbosity low), it
gives one line per validation failure. This is a (relatively) low
amount of logging, but very useful, as it tells you why exactly unbound
failed the query.

It would be nice to get help how to debug this as DNSSEC "by-hand" is
somewhat challenging.

This is pretty easy, the RRSIG notes ....
  RRSIG bla bla expiration inception bla bla.
They are in yyyymmddhhmmss format UTC.

Most signers leave a couple weeks headroom in the expiration date.

Best regards,
   Wouter

Zitat von "W.C.A. Wijngaards" <wouter@NLnetLabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andreas,

It seems more that unbound and bind disagree in their opinion if the
signature is expired or not. As said the time unbound starts failing the
same queries done directly to the upstream resolve *and* validate fine.
So the options are:

That is strange. Your clocks are synchronised, so that is not it.
Could it have been the recent daylight-savings change somehow?

Both bind and unbound may have some leeway for expired signatures that
you can configure; val-sig-skew-max and val-sig-skew-min config options
for unbound.

- Bind does not send the same data it is using for validation to the
downtsream (unbound) client. Would be a Bind bug i guess.

Try doing a dig @<bind> name +dnssec and then with +dnssec +cdflag. If
that is different, then this is happening.

- Unbound and Bind do validation different (should not happen IMHO)

Yes.

- Validation in Unbound for some cases is broken. Would be a bug in
Unbound i guess.

Well, when unbound refuses to validate it, enable val-log-level: 2, and
take a look in the log file, it gives a detailed error. Then dig
+dnssec and dig +dnssec +cdflag when it mentions (also to the unbound so
see what is in the cache, and also at the IP address it mentions).

If you enable val-log-level: 2 (and you can have verbosity low), it
gives one line per validation failure. This is a (relatively) low
amount of logging, but very useful, as it tells you why exactly unbound
failed the query.

It would be nice to get help how to debug this as DNSSEC "by-hand" is
somewhat challenging.

This is pretty easy, the RRSIG notes ....
  RRSIG bla bla expiration inception bla bla.
They are in yyyymmddhhmmss format UTC.

Most signers leave a couple weeks headroom in the expiration date.

I will try to capture the follwoing:

- Logging from unbound as suggested
- dig from both as the error happens
- Packet dump from unbound <--> bind communication over the wire

May i send this to you in private to not clutter the list with attachments?

Thanks for your help

Andreas

* W. C. A. Wijngaards:

That is strange. Your clocks are synchronised, so that is not it.

Process time and system time are not necessarily equal. Depending on
which clock you ask, there might be some magic that prevents the
process clock from going backwards and jumping forward (which would
otherwise wreck havoc with timeout handling in select() loops).

clock_gettime(CLOCK_MONOTONIC) should not be affected by this.
I'm not sure about the gettimeofday() behavior.

Zitat von lst_hoe02@kwsoft.de:

Zitat von "W.C.A. Wijngaards" <wouter@NLnetLabs.nl>:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andreas,

It seems more that unbound and bind disagree in their opinion if the
signature is expired or not. As said the time unbound starts failing the
same queries done directly to the upstream resolve *and* validate fine.
So the options are:

That is strange. Your clocks are synchronised, so that is not it.
Could it have been the recent daylight-savings change somehow?

Both bind and unbound may have some leeway for expired signatures that
you can configure; val-sig-skew-max and val-sig-skew-min config options
for unbound.

- Bind does not send the same data it is using for validation to the
downtsream (unbound) client. Would be a Bind bug i guess.

Try doing a dig @<bind> name +dnssec and then with +dnssec +cdflag. If
that is different, then this is happening.

- Unbound and Bind do validation different (should not happen IMHO)

Yes.

- Validation in Unbound for some cases is broken. Would be a bug in
Unbound i guess.

Well, when unbound refuses to validate it, enable val-log-level: 2, and
take a look in the log file, it gives a detailed error. Then dig
+dnssec and dig +dnssec +cdflag when it mentions (also to the unbound so
see what is in the cache, and also at the IP address it mentions).

If you enable val-log-level: 2 (and you can have verbosity low), it
gives one line per validation failure. This is a (relatively) low
amount of logging, but very useful, as it tells you why exactly unbound
failed the query.

It would be nice to get help how to debug this as DNSSEC "by-hand" is
somewhat challenging.

This is pretty easy, the RRSIG notes ....
  RRSIG bla bla expiration inception bla bla.
They are in yyyymmddhhmmss format UTC.

Most signers leave a couple weeks headroom in the expiration date.

I will try to capture the follwoing:

- Logging from unbound as suggested
- dig from both as the error happens
- Packet dump from unbound <--> bind communication over the wire

I was not able to reproduce the problem any more but maybe it was even related to that one:

Fix for Bind 9.7.2-P3:
It was discovered that Bind would incorrectly mark zone data as insecure
when the zone is undergoing a key algorithm rollover. (CVE-2010-3614)

Regards

Andreas