Reloading NSD zone configuration

Erik_Romijn · April 22, 2009, 4:22pm

Hi,

I've been looking into how I can let NSD pick up changes in it's config
file (zone additions, changes and removals).

Unless I am mistaken somewhere, NSD will only pick this up after a full
restart, which can take up to several seconds on my system with, so far,
a very small amount of zones (<10).

I would like to change the zone configuration of NSD up to every minute,
so I'm a bit worried about the outage window of an NSD restart
(potentially) every minute. Especially if the amount of zones would
become higher.

As far as I've been able to measure my actual window of unresponsiveness
is quite short, but I'm having doubts about my measurement method.

My questions:
(a) Is there a better way to change zone configuration
    while NSD is running?
(b) Am I right in assuming that, if I have a large number of zones,
    the restarts will start taking proportionally longer?
(c) Is there anyone that has tried to do something similar before,
    and found a way to avoid this problem?

cheers,
Erik Romijn

PaulWouters · April 22, 2009, 5:12pm

I've been looking into how I can let NSD pick up changes in it's config
file (zone additions, changes and removals).

Unless I am mistaken somewhere, NSD will only pick this up after a full
restart, which can take up to several seconds on my system with, so far,
a very small amount of zones (<10).

It picks up zone changed when you tell it to reload. Reload just will not
pick up removed or added zones.

I would like to change the zone configuration of NSD up to every minute,

What is the use case for this?

My questions:
(a) Is there a better way to change zone configuration
while NSD is running?

Not that I know.

(b) Am I right in assuming that, if I have a large number of zones,
the restarts will start taking proportionally longer?

Yes. But you might speed things up by running the zone compiler manually
yourself so that your waiting time does not include zone compilations.

(c) Is there anyone that has tried to do something similar before,
and found a way to avoid this problem?

I'd be interested in seeing why you need to add/remove zones every minute...

Paul

Erik_Romijn · April 22, 2009, 5:31pm

Paul Wouters wrote:

Unless I am mistaken somewhere, NSD will only pick this up after a full
restart, which can take up to several seconds on my system with, so far,
a very small amount of zones (<10).

It picks up zone changed when you tell it to reload. Reload just will not
pick up removed or added zones.

I'm referring to changes of master IPs with 'change'. Which, according
to my testing, are also only picked up after a full restart.

I would like to change the zone configuration of NSD up to every minute,

What is the use case for this?

Hosting zones for users, and promising them very quick set-up and master
change times because they are very impatient

But, if what I want can't realistically be done, 5 minutes would
probably be acceptable.

cheers,
Erik

Aaron_Hopkins · April 22, 2009, 6:52pm

I'm basically building a smaller version of http://www.everydns.com/ and
don't want to drop any queries every time a new domain gets added or removed
by a customer. How do I do this with NSD?

EveryDNS currently uses tinydns to serve its data, which bakes all of the
information it needs to serve into its CDB.

Somewhat similarly, the NSD zone compiler already needs to parse nsd.conf,
and already puts some information from there into the nsd.db. Would it be
difficult to put everything necessary to serve a zone into the nsd.db and
re-read that on a reload?

Alternatively, all of my zones have the exact same config except that the
"name:" and "zonefile:" change. I'd be happy if I could just tell NSD a set
of options to use for all zones, and to have it serve everything it finds in
the nsd.db using those options rather than bothering to re-read the config.

-- Aaron

Greg_A_Woods · April 26, 2009, 3:06am

Apparently your users think the DNS is more dynamic than it really is!
(and indeed more than it is meant to be!)

5 minute updates are more than adequate given the realistic minimum TTL
for any record is 5m anyway -- i.e. use that as your justification if
necessary.

Realistically users shouldn't expect DNS updates even that often -- I'd
suggest advertizing a 30 minute minimum setup time cycle just for your
own sanity, never mind that of the software and protocols.

Indeed there are lots of instances of application-level caching which
can last as long as 30 minutes irrespective of the TTL in the RR
delivered to the application (eg. M$-IE).

Assuming you have truly geographically diverse authoritative nameservers
for all these hosted zones (as you should), then you're also going to
have to moderate your SLA for your DNS hosting based on whatever you can
expect for reach-ability of those remote nameservers too.

Manage your user expectations up front and they hopefully won't blow up
at you when things don't go exactly as planned.

Antti_Ristimäki · April 28, 2009, 5:40am

Hi,

We are also concerned about the length of the service break when adding new zones to the configuration file. When a new slave zone/new slave zones is/are added to the configuration file and the NSD is restarted, how long should it take until NSD continues to serve the previously added zones that already exist in the database? That is, is NSD able to continue to serve the
previously added zones even if the zone transfers for the new zones have not been completed?

In our case, the total amount of zones is a couple of thousands, which is why we'd like to avoid unnecessary restarts and service breaks of NSD. To be able to reload the NSD configuration file without restarting NSD would be a very nice feature for us, too. I know this topic has been discussed quite a lot...

Antti

Jelte_Jansen2 · April 28, 2009, 8:34am

Antti Ristimäki wrote:

Hi,

We are also concerned about the length of the service break when adding new zones to the configuration file. When a new slave zone/new slave zones is/are added to the configuration file and the NSD is restarted, how long should it take until NSD continues to serve the previously added zones that already exist in the database? That is, is NSD able to continue to serve the
previously added zones even if the zone transfers for the new zones have not been completed?

If you restart NSD, with some new slave zones added, it will serve existing zones as soon as it is up (i.e. within seconds on most systems, see below for my private setup and some very anecdotal timing benchmarks). It will also start to transfer the new slave zones, but while it is doing that it already serves existing ones.

In our case, the total amount of zones is a couple of thousands, which is why we'd like to avoid unnecessary restarts and service breaks of NSD. To be able to reload the NSD configuration file without restarting NSD would be a very nice feature for us, too. I know this topic has been discussed quite a lot...

I have a small server at home which is running 18677 zones on a 1Ghz 256MB RAM machine. Compiling those into a nsd.db file takes a while (several minutes), but restarting nsd itself takes about 7 to 10 seconds. Throwing queries against it from the other side i would estimate that 1 or 2 seconds of those are spent waiting for the previous process to stop, at which point it is still serving.

We are looking into it (if only because the question comes up about once a week now), but my personal guess is that a full reload would not gain that much if not implemented really really well, so we want to take a good look at what could actually be gained and how before doing anything.

Jelte

Ralf_Weber · April 28, 2009, 8:52am

Moin!

We are looking into it (if only because the question comes up about once a week now), but my personal guess is that a full reload would not gain that much if not implemented really really well, so we want to take a good look at what could actually be gained and how before doing anything.

I think the main requirement here is to add and remove zones on the fly without restarting, something that is possible with other servers. I think for people who are doing a lot of end customer DNS hosting and usually have thousands or millions of small zones this is a must have before they even can consider NSD. If you have say e.g 3600 zone additions per day that is one hour downtime even if it just takes one second per restart/reload.

So long
-Ralf

Antti_Ristimäki · April 28, 2009, 9:36am

If you restart NSD, with some new slave zones added, it will serve existing
zones as soon as it is up (i.e. within seconds on most systems, see below for my
private setup and some very anecdotal timing benchmarks). It will also start to
transfer the new slave zones, but while it is doing that it already serves
existing ones.

Thank you for this very valuable information. Restart times of this magnitude would be acceptable for us, given that the frequency of zone additions is rather low in our environment.

Throwing queries against it
from the other side i would estimate that 1 or 2 seconds of those are spent
waiting for the previous process to stop, at which point it is still serving.

Regarding the process stopping phase, what would be expected to happen in case that one or more zone transfers are pending at the same time when SIGTERM is sent to the previous process?

BR,

Antti

Mohammad_H_Al_Shami · April 29, 2009, 5:05am

Hi guys,

I've been interested in this issue for a while now, and I hope NSD has that soon. But for the time being I propose a workaround.

I'm not a big fan of zone transfers, hated them since the day I set up my first DNS server. Currently I use a patched version of VegaDNS with a backend Perl script to manage my zones. The Perl script generates the configration and zone files then copies them to all my servers.

As for adding/removing a zone, at the end of the Perl script:
1) Shut down server A
2) Wait 5 seconds
3) Start server A
4) Wait 5 seconds
5) Shut down server B
...

With this you have only one of your servers restarting at a certain moment.

When I wrote the script just restarting NSD caused it to generate an error, if I remember correctly it couldn't bind to port 53. This happened only the first time NSD was restarted after a server reboot, which was weird. Since the script worked properly as it is I haven't bothered in checking it again.

Hope that helps.

Regards,
Mohammad H. Al-Shami

Ondrej_Sury · April 29, 2009, 7:39am

Howdy,

there are/was some movement in IETF for management protocol which
would allow adding new zones from master to primary without operator
help. And I would oppose adding this feature to NSD before some kind
of protocol is standardized.

But even that there are several options you could use to achieve what you need:

Option A)

Add second server and deploy H-A (VRRP/UCARP/Heartbeat/whatever) to
switch primary IP address between servers in cluster. Do a reload on
passive server then switch nodes.

Option B)

Same, but with only one server. Make one NSD listen on port 1053,
second NSD listen on port 2053. Both serve data from same database.
Use your firewall rules to redirect port 53 udp+tcp to active server.
When adding new zone - apply same logic as in Option A - reload
inactive daemon, switch to it using iptables, reload second daemon.

Option C)

Just add one more DNS server and reload one at time. DNS resolvers can
cope with that just fine (even with just two, but three are better if
one goes down for some other reason).

Option D)

Maybe you don't have a right tool for what you are doing. It's no use
to forced usage of NSD if another tool/daemon would serve you better.

Ondrej

Robert_Martin-Legene · April 29, 2009, 8:43am

Greg A. Woods wrote:

Apparently your users think the DNS is more dynamic than it really is!
(and indeed more than it is meant to be!)

5 minute updates are more than adequate given the realistic minimum TTL
for any record is 5m anyway -- i.e. use that as your justification if
necessary.

You have some good points, but the picture looks different from other
perspectives, I and believe some customers are not seeing it from your
angle.

While TTL plays a role in some situations, I don't think this is one of
them. The Internet motto is "do it now". A reasonable question could be
"why do the user have to wait?" (today's answer is: because the computer
wants you to).

What we're seeing is, that people are working on a task to
register/redelegate a number of domain names and don't need the work
flow disturbed unnecessarily.

Some TLD's require the DNS to be set up correctly before accepting
requests for new domains or for redelegation of existing ones. If the
user doesn't have the domain name ready on the new servers when the
registry receives the request, they will reject it. So the user is often
only waiting for their DNS provider to created a ~10 RR-strong zone on a
few servers.

A name server handling hundreds or thousands of requests per second,
surely can create 10 RR's and a zone-cut "on the fly"?

I know I make it sound very simple, and it isn't -- But it should be

If NSD can help deliver faster updates, it's market segment share will
increase.

Arnt_Gulbrandsen · April 29, 2009, 9:57am

Robert Martin-Legène writes:

I know I make it sound very simple, and it isn't -- But it should be

NSD currently ties this to updating and removing zones.

IMO, removing zones is not at all time-critical. It's okay if zones stay active for a few hours extra. Updating and adding zones can be much more pressing.

Maybe ignoring zones makes the problem simpler... e.g. nsd could load a supplementary database. If a server serves, say, 10k zones, loading a hundred supplements into RAM won't be much of a problem (and it saves compiling 10k zone files because one changed). Eventually nsd has to restart with freshly compiled database, of course.

Arnt

Wolfgang_Nagele · April 29, 2009, 2:39pm

Hi,

I think the main requirement here is to add and remove zones on the fly without restarting, something that is possible with other servers. I think for people who are doing a lot of end customer DNS hosting and usually have thousands or millions of small zones this is a must have

Second that.

Cheers,
Wolfgang

Adapa_Srinivasa · April 29, 2009, 4:39pm

That would be really interesting in view of ICANN proposal
http://afp.google.com/article/ALeqM5jienXKDbIYHNPcywgq84IqyHtbPw &
Home | ICANN New gTLDs for more global TLDs
rather than limiting to .com, .cc, etc.

Cheers
Srini

Antti_Ristimaki · May 13, 2009, 9:21am

Hi all,

There has been a good discussion regarding NSD zone additions and restarts. However, I still have a little question regarding the NSD restarting process. In our environment, the number of (slave) zones will be about two thousands, so it is very likely that one or more zone transfers are pending when the NSD process is restarted.

Does anyone know how NSD will handle these kinds of situations? Is it able to terminate the pending transfers in a controlled way or is this a possible point of failure?

Regards,
Antti Ristimäki

Greg_A_Woods · May 21, 2009, 4:26am

While TTL plays a role in some situations, I don't think this is one of
them. The Internet motto is "do it now". A reasonable question could be
"why do the user have to wait?" (today's answer is: because the computer
wants you to).

Well, for sure the TTL of DNS RRs must, by definition, play a part in
every change to, or deletion of, any existing RR in the DNS. That's
just a fact of the protocol and most modern DNS cache implementations.

This includes changes to NS records -- i.e. including re-delegations.

While per-record TTLs don't affect brand new zones un-delegated zones,
zone transfer delays (or their equivalent) will affect the time it takes
for all authoritative servers to be ready to serve a new zone. You
cannot give a customer a green light to rely upon their new zone until:
(a) it is loaded in all of the authoritative servers it has been
delegated to; and (b) all the parent zone nameservers are successfully
handing out all of the new delegating NS records for the domain. As I
believe I said before, if you're following the full spirit of RFC 2182
then as part of that you must allow for the potential that the initial
primary server may not always be able to reach all secondary servers in
near real time, and this goes for both the parent's servers, as well as
the delegated servers.

May I humbly suggest that if you are making promises you cannot possibly
keep because of protocol design and real-world limitations then perhaps
you should more urgently be attempting to change your customer's
expectations such that they will be more in line with the reality of the
DNS protocol design and the other real-world limitations of the global
Internet. Education and awareness is always a good thing, even for
customers!

What we're seeing is, that people are working on a task to
register/redelegate a number of domain names and don't need the work
flow disturbed unnecessarily.

Some TLD's require the DNS to be set up correctly before accepting
requests for new domains or for redelegation of existing ones. If the
user doesn't have the domain name ready on the new servers when the
registry receives the request, they will reject it. So the user is often
only waiting for their DNS provider to created a ~10 RR-strong zone on a
few servers.

I don't really understand the problem. I.e. I understand what you're
saying, but I don't agree that it's a problem that needs fixing in
anything remotely resembling making NSD do dynamic configuration changes
while still answering queries.

If you're doing both jobs, i.e. hosting DNS and registering domains,
then it should be trivial to schedule rolling batch jobs that will do
the right things, in the right order, for all pending registrations.
S.M.O.P.

If you're only doing one of the jobs then it's either not a problem in
the first place (if you're just hosting the DNS -- assuming you set the
user's expectations appropriately), or it's again just a SMOP to ensure
things are done in the right order (if you're the registrar waiting for
a]l, or at least one reachable, delegated servers to answer
authoritatively before you delegate the new domain in the parent zone).

If you give your users tools providing reports and feedback then they
can manage their work flow in step with the realities of the protocol
requirements and other real-world limitations which may affect the setup
and operation of their domains.

A name server handling hundreds or thousands of requests per second,
surely can create 10 RR's and a zone-cut "on the fly"?

As I said before, if you are following the full spirit of RFC 2182 then
it really _really_ will not matter if even minutes worth of requests get
dropped on the floor by one authoritative server (or indeed each in
rolling succession if you happen to control them all), even if that
server handles millions of domains. The DNS was designed with this very
certain eventuality in mind. Indeed it is good to reduce the outage
time as much as is possible, but it is also good to ensure there are
more than two geographically and topologically separated authoritative
nameservers for each and every domain such that at least one will be
reachable and running from every possible user's perspective. Really
large DNS providers can probably afford through economies of scale to
also use many other networking tricks to make things even more reliable
and redundant in the face of very high volumes of domains and queries,
but none of this requires changing NSD in any way for their use. Indeed
changing NSD wouldn't solve any of the real problems here anyway.
Making an authoritative DNS server capable of dynamic changes to its
configuration is really just papering over the wrong issues.

Aaron_Hopkins · May 21, 2009, 5:22am

You are confusing the worst-case with the average case here.

Of course some things will fail some of the time, but if 99.9% of the time I
have no reachability problems between my servers, why can't I promise to get
their new zones set up and ready to go within a few seconds 99.9% of the
time?

And of course the DNS infrastructure is designed to handle outages of
individual servers and I'm willing to accept the 3 second stall on the
lookup if one of my servers or datacenters is actually down. However, that
is a significantly more rare occurance than how frequently customers want to
add zones.

Why should one customer adding a zone cause lookups on all other zones for
all other customers to stall every time? This is just unnecessarily
worsening the quality of service. I could also add some code to NSD to
randomly discard 1% of queries, which DNS should handle somewhat gracefully
too, but I wouldn't be satisfied with that result either.

-- Aaron

Greg_A_Woods · May 27, 2009, 11:29pm

Well, personally from an SLA point of view I firmly believe user
expectations should be set for the _worst_ case if you wish to avoid
issues. It also offers you the chance to impress users when you can
consistently exceed their expectations in the normal average case.

If you tell your new users that it will take on the order of at 30
minutes (plus the TTL of any RRs if the zone was previously served by
any other NS, plus the time for the parent zone servers to consistently
(re)delegate the domain) under ideal circumstances before everything is
in sync and their new domain is fully functional, and you tell them that
you will send them an e-mail notification(s) when all of this has
happened, eg. showing confirmation that all NS hosts are actively
serving the new zone, and then again when all NS hosts are properly and
consistently delegated in all the NS hosts for the parent zone, then
they will have proper and appropriate expectations. I.e. tell them what
to expect, and give job completion confirmation automatically. That's
what expectation management is all about. Without knowing what to
expect then users will make up their own ideas, and they will ignore
critical issues such as TTLs for any existing NS records, parent zone
update cycles and timing, etc., etc., etc. If you fail to set
reasonable user expectations then you will end up in the uncomfortable
place of being unable to (easily) meet their unreasonable demands.

I believe one of the most important things to do in implementing your
infrastructure is to ensure that configuration changes are batched.
Don't try to trigger reloads on customer demand -- do regular rolling
reloads every 20 minutes or so (thus the suggested 30-minute SLA). This
design was best when using BIND, and it is still best for NSD of course.

I think the primary concept of using periodic batched config updates on
a rolling schedule to each auth server should answer all the practical
problems for most small DNS providers. Of course those hosting very
large numbers of domains should also consider some of the other more
expensive suggestions given in this thread, such as using load balancing
to a cluster, or a hot standby with CARP, etc. -- not to improve new
customer setup time, but rather to improve reliability and availability.
Also there may be some pending optimizations on this front as per the
TODO file, especially if someone steps up to code them.

Aaron_Hopkins · May 27, 2009, 11:53pm

Regular rolling reloads every 20 minutes with an optimistic 15 seconds of
downtime (with a ton of zones) per reload yields a best-case SLA per
nameserver at 98.75%. Sure there's redundancy, but 1.25% of all lookups
still result in a 3 second stall.

I won't choose to drop any packets in normal operation. If your users are
happy with you doing so, great. Mine aren't.

-- Aaron