serve-expired: "yes" and cache-min-ttl: 30 unsafe?

Dear Folks,

Thank you for an excellent piece of software.

I am puzzled by the behaviour of our multi-level DNS system which
answered many queries for names having shorter TTLs with SERVFAIL.

By multilevel, I mean clients talk to one server, which forwards to
another, and for some clients, there is a third level of caching.

So it was unwise to add:
serve-expired: "yes"
cache-min-ttl: 30

to the server section of these DNS servers running unbound 1.6.8 on
up to date RHEL 7? Please could anyone cast some light on why this
was so? I will be spending some time examining the cause.

If you need more information, please let me know.

Dear Folks,

I am puzzled by the behaviour of our multi-level DNS system which
answered many queries for names having shorter TTLs with SERVFAIL.

I mean that SERVFAILs went up to 50% of replies, and current names
with TTLs of around 300 failed to be fetched by the resolver, the last
DNS servers in the chain. What I mean is that adding these two
configuration options (serve-expired: "yes" and cache-min-ttl: 30)
caused an outage. I am trying to understand why.

Any ideas in understanding the mechanism would be very welcome.

Dear Folks,

I am puzzled by the behaviour of our multi-level DNS system which
answered many queries for names having shorter TTLs with SERVFAIL.

I mean that SERVFAILs went up to 50% of replies, and current names
with TTLs of around 300 failed to be fetched by the resolver, the last
DNS servers in the chain. What I mean is that adding these two
configuration options (serve-expired: "yes" and cache-min-ttl: 30)
caused an outage. I am trying to understand why.

Any ideas in understanding the mechanism would be very welcome.

We use 1.6.8 with both those settings, and observed prolonged SERVFAIL periods.

In our case, the upstream server became inaccessible for a period of time, but when contact resumed the SERVFAILs persisted.

We reduced the infra-host-ttl value to compensate.

(Why is infra-host-ttl's default 900 seconds? That seems like a long time to wait to retry the upstream server.)

    M.

Dear Marc,

Thank you for your reply.

I am puzzled by the behaviour of our multi-level DNS system which
answered many queries for names having shorter TTLs with SERVFAIL.

I mean that SERVFAILs went up to 50% of replies, and current names
with TTLs of around 300 failed to be fetched by the resolver, the last
DNS servers in the chain. What I mean is that adding these two
configuration options (serve-expired: "yes" and cache-min-ttl: 30)
caused an outage. I am trying to understand why.

Any ideas in understanding the mechanism would be very welcome.

We use 1.6.8 with both those settings, and observed prolonged SERVFAIL periods.

In our case, the upstream server became inaccessible for a period of time, but when contact resumed the SERVFAILs persisted.

This behaviour was quite catastrophic, and to me, unexpected.

Do you have any idea of the mechanism behind this failure?

Is there a way to deal better with zero TTL names?

We reduced the infra-host-ttl value to compensate.

Did that bring your system to a functioning condition?

Dear Folks,

I would never expect that a combination of these two apparently
innocuous configuration values could cause a massive outage.

This appears to be a very serious bug in Unbound. Does anyone think
this behaviour (described below) is in any way expected?

Dear Marc,

Thank you for your reply.

I am puzzled by the behaviour of our multi-level DNS system which
answered many queries for names having shorter TTLs with SERVFAIL.

I mean that SERVFAILs went up to 50% of replies, and current names
with TTLs of around 300 failed to be fetched by the resolver, the last
DNS servers in the chain. What I mean is that adding these two
configuration options (serve-expired: "yes" and cache-min-ttl: 30)
caused an outage. I am trying to understand why.

Any ideas in understanding the mechanism would be very welcome.

We use 1.6.8 with both those settings, and observed prolonged SERVFAIL periods.

In our case, the upstream server became inaccessible for a period of time, but when contact resumed the SERVFAILs persisted.

This behaviour was quite catastrophic, and to me, unexpected.

Do you have any idea of the mechanism behind this failure?

Is there a way to deal better with zero TTL names?

We reduced the infra-host-ttl value to compensate.

(Sorry for my slow response -- this slipped through the cracks.)

Did that bring your system to a functioning condition?

Yes & no. We reduced infra-host-ttl to 30 seconds, which means that we are only affected by this for (up to) 30 seconds after upstream access returns. That is adequate for our purposes.

So I think the mechanism is pretty clear, and I think it's good for unbound to cache the upstream server's status for a period of time. I'm just not convinced that 900 seconds is a reasonable default time.

(BTW, our case has nothing to do with zero TTL names: The IP address configured as the zone's forward-addr became inaccessible. No names involved. That said, I do not know how unbound deals with 0-TTL names.)

I do not think our case is a bug. It also has nothing to do with serve-expired or cache-min-ttl. But since we use those settings, I wanted to relate our experience with a confusing SERVFAIL situation.

In your multi-level system, are you 100% sure that all the forward-addr IPs are *always* accessible? If they are, then you may be seeing SERVFAILs for a different reason.

    M.

Dear Marc and anyone else interested in why severe outages can be
caused by serve-expired: "yes" and cache-min-ttl: 30:

I am puzzled by the behaviour of our multi-level DNS system which
answered many queries for names having shorter TTLs with SERVFAIL.

I mean that SERVFAILs went up to 50% of replies, and current names
with TTLs of around 300 failed to be fetched by the resolver, the last
DNS servers in the chain. What I mean is that adding these two
configuration options (serve-expired: "yes" and cache-min-ttl: 30)
caused an outage. I am trying to understand why.

Any ideas in understanding the mechanism would be very welcome.

We use 1.6.8 with both those settings, and observed prolonged SERVFAIL periods.

In our case, the upstream server became inaccessible for a period of time, but when contact resumed the SERVFAILs persisted.

This behaviour was quite catastrophic, and to me, unexpected.

And career affecting.

Do you have any idea of the mechanism behind this failure?

Is there a way to deal better with zero TTL names?

We reduced the infra-host-ttl value to compensate.

(Sorry for my slow response -- this slipped through the cracks.)

Did that bring your system to a functioning condition?

Yes & no. We reduced infra-host-ttl to 30 seconds, which means that we are only affected by this for (up to) 30 seconds after upstream access returns. That is adequate for our purposes.

So I think the mechanism is pretty clear, and I think it's good for unbound to cache the upstream server's status for a period of time. I'm just not convinced that 900 seconds is a reasonable default time.

(BTW, our case has nothing to do with zero TTL names: The IP address configured as the zone's forward-addr became inaccessible. No names involved. That said, I do not know how unbound deals with 0-TTL names.)

I do not think our case is a bug. It also has nothing to do with serve-expired or cache-min-ttl. But since we use those settings, I wanted to relate our experience with a confusing SERVFAIL situation.

How busy are your systems?

In your multi-level system, are you 100% sure that all the
forward-addr IPs are *always* accessible? If they are, then you may
be seeing SERVFAILs for a different reason.

Absolutely; they are all in our local network. And when I removed
those two configuration values, everything came back to normal good
behaviour almost immediately. Perhaps a distinguishing factor is that
some of these systems deal with in the order of up to 50,000 mixed
queries per second.

The result was so unexpected and surprisingly severe and I categorise
our situation as the result of a very serious bug. Tomorrow there are
repercussions for me personally.

Defining the bug is all complicated by the fact that before this
happened, I had chosen to change jobs within the same company, so no
longer have access to these systems to test the effects of those
configuration values. I don't know if it was one, the other, or a
combination of both that caused the problem. Perhaps no one but me
wants to find out.

(Why is infra-host-ttl's default 900 seconds? That seems like a long time to wait to retry the upstream server.)

    M\.

By multilevel, I mean clients talk to one server, which forwards to
another, and for some clients, there is a third level of caching.

So it was unwise to add:
serve-expired: "yes"
cache-min-ttl: 30

to the server section of these DNS servers running unbound 1.6.8 on
up to date RHEL 7?

Hint: the answer is an unreserved "YES!".

Hi Nick,

Sorry to hear Unbound has caused you problems. I'm trying to figure out
the reason of the observed SERVFAIL responses.

Was the serve-expired and cache-min-ttl configured on the Unbound
instance that has the forward configured, or the instance the queries
are forwarded to? Or both?

Any change the SERVFAILS were only for DNSSEC signed domains? Did you
had a change to see the reason for the SERVFAIL responses in the Unbound
log? Maybe the forwarder was returning expired DNSSEC signatures?

-- Ralph

Dear Marc and anyone else interested in why severe outages can be
caused by serve-expired: "yes" and cache-min-ttl: 30:

I am puzzled by the behaviour of our multi-level DNS system which
answered many queries for names having shorter TTLs with SERVFAIL.

I mean that SERVFAILs went up to 50% of replies, and current names
with TTLs of around 300 failed to be fetched by the resolver, the last
DNS servers in the chain. What I mean is that adding these two
configuration options (serve-expired: "yes" and cache-min-ttl: 30)
caused an outage. I am trying to understand why.

Any ideas in understanding the mechanism would be very welcome.

We use 1.6.8 with both those settings, and observed prolonged SERVFAIL periods.

In our case, the upstream server became inaccessible for a period of time, but when contact resumed the SERVFAILs persisted.

This behaviour was quite catastrophic, and to me, unexpected.

And career affecting.

Ugh, that's terrible. I am sorry you're suffering such serious consequences.

Do you have any idea of the mechanism behind this failure?

Is there a way to deal better with zero TTL names?

We reduced the infra-host-ttl value to compensate.

(Sorry for my slow response -- this slipped through the cracks.)

Did that bring your system to a functioning condition?

Yes & no. We reduced infra-host-ttl to 30 seconds, which means that we are only affected by this for (up to) 30 seconds after upstream access returns. That is adequate for our purposes.

So I think the mechanism is pretty clear, and I think it's good for unbound to cache the upstream server's status for a period of time. I'm just not convinced that 900 seconds is a reasonable default time.

(BTW, our case has nothing to do with zero TTL names: The IP address configured as the zone's forward-addr became inaccessible. No names involved. That said, I do not know how unbound deals with 0-TTL names.)

I do not think our case is a bug. It also has nothing to do with serve-expired or cache-min-ttl. But since we use those settings, I wanted to relate our experience with a confusing SERVFAIL situation.

How busy are your systems?

Moderately busy, but nowhere near your level. 1-2 thousand qps, max.

In our case, the upstream servers (indeed, the whole Internet) is on the other side of a slow (~600ms RTT) and occasionally unreliable link.

In your multi-level system, are you 100% sure that all the
forward-addr IPs are *always* accessible? If they are, then you may
be seeing SERVFAILs for a different reason.

Absolutely; they are all in our local network. And when I removed
those two configuration values, everything came back to normal good
behaviour almost immediately. Perhaps a distinguishing factor is that
some of these systems deal with in the order of up to 50,000 mixed
queries per second.

It appears to me that your problem is different from what we saw. Our setup is pretty modest. Maybe you're using some feature that we're not, and the problem is coming from there. Here's some more info about our setup:

  * Unbound version 1.6.8
  * IPv4 only
  * No DNSSEC
  * No TCP that I'm aware of (very little, if any)
  * Moderate query rate
  * Modest cache size (16MB RRSET).
  * Forward-everything (zone name is ".")
  * Only 1 or 2 forward-addrs.

The result was so unexpected and surprisingly severe and I categorise
our situation as the result of a very serious bug. Tomorrow there are
repercussions for me personally.

Certainly severe. However, we never saw any problem like this when we enabled serve-expired and cache-min-ttl.

Defining the bug is all complicated by the fact that before this
happened, I had chosen to change jobs within the same company, so no
longer have access to these systems to test the effects of those
configuration values. I don't know if it was one, the other, or a
combination of both that caused the problem. Perhaps no one but me
wants to find out.

I wish I could help you more. All I can say with any certainty is that your problem does not seem to be common. Not much of a comfort, I know.

I hope your repercussions aren't too severe!

    M.

Dear Ralph,

Sorry to hear Unbound has caused you problems. I'm trying to figure out
the reason of the observed SERVFAIL responses.

Thank you.

Was the serve-expired and cache-min-ttl configured on the Unbound
instance that has the forward configured, or the instance the queries
are forwarded to? Or both?

Both.

Any change the SERVFAILS were only for DNSSEC signed domains?

No, a particular name in our domain which is not signed often came
back with SERVFAIL after it expired from the cache.

Did you had a change to see the reason for the SERVFAIL responses in
the Unbound log? Maybe the forwarder was returning expired DNSSEC
signatures?

There were many SERVFAIL responses for queries for DS records.

Hi Nick,

The report says that you were using 1.6.8, but in 1.7.1 there is this
bugfix: - Fix #3736: Fix 0 TTL domains stuck on SERVFAIL unless manually
          flushed with serve-expired on.
The SERVFAIL must be caused by the brief outage, and that bug then
happened. So, it could be something that is already fixed, but if not,
it would be good to get details; to reproduce and fix.

Best regards, Wouter

Dear Wouter,

Thank you for your helpful reply.

The report says that you were using 1.6.8, but in 1.7.1 there is this
bugfix: - Fix #3736: Fix 0 TTL domains stuck on SERVFAIL unless manually
         flushed with serve-expired on.
The SERVFAIL must be caused by the brief outage, and that bug then
happened. So, it could be something that is already fixed, but if not,
it would be good to get details; to reproduce and fix.

Yes, it certainly would. However, I have left that job and no longer
have access to those machines.