Release 0.7.2

Hi,

Unbound closed-beta version 0.7.2 is available on http://unbound.net
or directly via this link
http://unbound.net/downloads/unbound-0.7.2.tar.gz
SHA1 checksum: 2fa26e8bb7b691c80f4146b00a0eef7e0352ae91

This version is for the closed-beta test.

The fixes are:
~ - building in a subdirectory works again
~ - does not complain about a lack of entropy when starting
~ instead it falls back to using no entropy for randomness.

Please inform me about your experience with unbound!

Best regards,
~ Wouter

Hello Wouter

Sorry for not getting back to you sooner. I just didn't have the time
to start testing in December. I have built unbound as a package for
Debian/Linux during the past days and will deploy it on some of our
anycast instances very soon as described in a previous message. I
still need to do a little work to integrate it into the management
system I use to monitor the BIND caches.

Hopefully, I can start operating unbound as a production service early
next week.

Just a quick update. I've installed unbound 0.7.2 on one of our
anycast instances. It is currently receiving around 50-70 queries per
second. DNSSEC validation is enabled and I use trust anchors for our
own zones as well as all of the signed zones from RIPE NCC and the se
ccTLD (useful feature that I can use the trusted-key file from BIND,
btw).

For the time being, I'll switch back to BIND when I'm not in the
office (and I'll be away Thursday and Friday, so no more testing until
next week).

Would you like me to collect statistics of some sort (queries, process
resource usage, whatever)?

Alexander Gall wrote:

For the time being, I'll switch back to BIND when I'm not in the
office (and I'll be away Thursday and Friday, so no more testing until
next week).

Smart. :slight_smile:

Would you like me to collect statistics of some sort (queries, process
resource usage, whatever)?

When it exits, it prints a couple lines of statistics. One is the number
of cache hits (number of replies, and how many from the cache). This
number may suggest whether you are giving unbound enough memory for
cache storage.

Otherwise, I'd like to know your operational experience.

Best regards,
~ Wouter

> Would you like me to collect statistics of some sort (queries, process
> resource usage, whatever)?
When it exits, it prints a couple lines of statistics. One is the number
of cache hits (number of replies, and how many from the cache). This
number may suggest whether you are giving unbound enough memory for
cache storage.

Here it is for the first run today (going for lunch now :slight_smile:

[1200478446] unbound[1381:0] info: start of service (unbound 0.7.2).
[1200482710] unbound[1381:0] error: tcp connect: No route to host

Oh, didn't notice this before.

[1200483456] unbound[1381:0] info: mesh has 4 states (4 with reply, 0 detached), 13 waiting replies
[1200483456] unbound[1381:0] info: sent 16820 replies, with average wait of 1.152015 sec
[1200483456] unbound[1381:0] info: histogram of reply wait times
[1200483456] unbound[1381:0] info: lower(secs) upper(secs) replycount
[1200483456] unbound[1381:0] info: 0.000008 0.000016 2
[1200483456] unbound[1381:0] info: 0.000016 0.000032 173
[1200483456] unbound[1381:0] info: 0.000032 0.000064 917
[1200483456] unbound[1381:0] info: 0.000064 0.000128 199
[1200483456] unbound[1381:0] info: 0.000128 0.000256 16
[1200483456] unbound[1381:0] info: 0.000256 0.000512 1
[1200483456] unbound[1381:0] info: 0.000512 0.001024 1603
[1200483456] unbound[1381:0] info: 0.001024 0.002048 3682
[1200483456] unbound[1381:0] info: 0.002048 0.004096 872
[1200483456] unbound[1381:0] info: 0.004096 0.008192 1438
[1200483456] unbound[1381:0] info: 0.008192 0.016384 893
[1200483456] unbound[1381:0] info: 0.016384 0.032768 1757
[1200483456] unbound[1381:0] info: 0.032768 0.065536 1068
[1200483456] unbound[1381:0] info: 0.065536 0.131072 1515
[1200483456] unbound[1381:0] info: 0.131072 0.262144 1439
[1200483456] unbound[1381:0] info: 0.262144 0.524288 653
[1200483456] unbound[1381:0] info: 0.524288 1.000000 187
[1200483456] unbound[1381:0] info: 1.000000 2.000000 115
[1200483456] unbound[1381:0] info: 2.000000 4.000000 43
[1200483456] unbound[1381:0] info: 4.000000 8.000000 33
[1200483456] unbound[1381:0] info: 8.000000 16.000000 37
[1200483456] unbound[1381:0] info: 16.000000 32.000000 25
[1200483456] unbound[1381:0] info: 32.000000 64.000000 22
[1200483456] unbound[1381:0] info: 64.000000 128.000000 68
[1200483456] unbound[1381:0] info: 128.000000 256.000000 45
[1200483456] unbound[1381:0] info: 256.000000 512.000000 15
[1200483456] unbound[1381:0] info: 512.000000 1024.000000 2
[1200483456] unbound[1381:0] info: server stats for thread 0: 262132 queries, 245299 from cache
[1200483456] unbound[1381:0] info: server stats for thread 0: requestlist max 71 avg 3.63779 exceeded 0

Otherwise, I'd like to know your operational experience.

Good so far :slight_smile: I'll keep you informed.

Alexander Gall wrote:

Here it is for the first run today (going for lunch now :slight_smile:

[1200478446] unbound[1381:0] info: start of service (unbound 0.7.2).
[1200482710] unbound[1381:0] error: tcp connect: No route to host

Oh, didn't notice this before.

Probably some domain with a nameserver address 10.0.0.0 or so, after
failing UDP couple of times, unbound even tries TCP, which fails with
this message.
If it gets annoying; I can try to bump the error up one verbosity level.

[1200483456] unbound[1381:0] info: mesh has 4 states (4 with reply, 0

detached), 13 waiting replies

[1200483456] unbound[1381:0] info: sent 16820 replies, with average

wait of 1.152015 sec

[1200483456] unbound[1381:0] info: histogram of reply wait times

histogram looks fine (just like I have).
The average above is a bit misleading due to the outliers, the trunk
version prints the median as well.

[1200483456] unbound[1381:0] info: server stats for thread 0: 262132

queries, 245299 from cache

[1200483456] unbound[1381:0] info: server stats for thread 0:

requestlist max 71 avg 3.63779 exceeded 0

93% is fine. requestlist did not exceed (there is config to up it), good

Good so far :slight_smile: I'll keep you informed.

Thanks!
~ Wouter

Wouter,

Here's an update on my testing of 0.7.2 (I had the flu last week, back
to work now :slight_smile:

Total run time so far is about 62 hours with a bit over 10 million
answered queries. I was confident enough to let unbound run for
almost two days just now, which should have been enough to test most
of the TTL and DNSSEC validation expiration logic I guess. BTW, here
is the configuration I use:

server:
  interface: 127.0.0.1
  interface: ::1
  interface: 130.59.31.248
  interface: 2001:620:0:ff::3
  interface: 130.59.35.34
  interface: 2001:620:0:109:208:2ff:fea0:cffd
  chroot: ""
  directory: "/etc/unbound"
  logfile: "/var/log/unbound/unbound"
  pidfile: "/var/run/unbound/unbound.pid"
  trusted-keys-file: "/etc/bind/trusted-keys"

Everything else is left at the default values. Please let me know if
you want me to test anything non-default in particular.

I did not notice any problems at all and didn't get any reports from
our users neitherl. I count this as a success :slight_smile:

unbound also correctly dealt with the situation that it had a trust
anchor defined for example.net (from the set of trust anchors
distributed by RIPE NCC
<https://www.ripe.net/projects/disi/keys/ripe-ncc-dnssec-keys-new.txt&gt;\),
but the corresponding DNSKEY is missing from the zone:

[1201660228] unbound[16620:0] info: failed to prime trust anchor -- could not fetch DNSKEY rrset <example.net. DNSKEY IN>

Operational experience: I was able to integrate unbound into our
anycast caching system without problems. This allows me to run BIND
and unbound in parallel on different anycast instances just as I had
planned to do. All of this is looking very good.

Hi Alexander,

Alexander Gall wrote:

Wouter,

Here's an update on my testing of 0.7.2 (I had the flu last week, back
to work now :slight_smile:

Thank you very much for the test. I'll examine your logs below.

Total run time so far is about 62 hours with a bit over 10 million
answered queries. I was confident enough to let unbound run for
almost two days just now, which should have been enough to test most
of the TTL and DNSSEC validation expiration logic I guess.

Good to know that validation was tested too; and works :slight_smile:

Everything else is left at the default values. Please let me know if
you want me to test anything non-default in particular.

I did not notice any problems at all and didn't get any reports from
our users neitherl. I count this as a success :slight_smile:

Great!
10 million real user queries is not something that I can try myself.

unbound also correctly dealt with the situation that it had a trust
anchor defined for example.net (from the set of trust anchors
distributed by RIPE NCC
<https://www.ripe.net/projects/disi/keys/ripe-ncc-dnssec-keys-new.txt&gt;\),
but the corresponding DNSKEY is missing from the zone:

[1201660228] unbound[16620:0] info: failed to prime trust anchor --

could not fetch DNSKEY rrset <example.net. DNSKEY IN>

Ah, yes, in such a situation you can wait for example.net to fix their
zone and after 900 (default bogus ttl) unbound will pick that up, or
instead change your config and kill -HUP unbound (this also clears the
cache).

Operational experience: I was able to integrate unbound into our
anycast caching system without problems. This allows me to run BIND
and unbound in parallel on different anycast instances just as I had
planned to do. All of this is looking very good.

Oh this is really nice. Would be interesting to know of any noticable
differences between bind and unbound. Apart from version.bind CH TXT, of
course.

Your log file. Thank you for sharing the statistics.
* There are many TCP connect errors. I assume now, that someone
configured a zone as example.com NS 10.0.0.100, with a nameserver on
their local subnet. Unbound cannot contact that nameserver, and tries
(finally) to use TCP on it; which gives this error.

I think the log file should not be cluttered with zone administration
mistakes by others. I can demote this particular error to a higher
verbosity level (2), or I can print the address that failed and then you
(the operator) or a script can pickup those and block them
(do-not-query-address: 10.0.0.0/8 in the config file).

I think I'll demote the error message, as it does not bother the
resolver operator. What do you think? Would you like to have the
addresses printed to the log anyway?

* You have 93% cache hits. With the default 4+4 Mb cache (4 mb for rrset
data, 4 mb for message data), so unbound caps memory usage at about
20-30 Mb total for the process. For 10 million queries. This is
impressive. You could try to increase cache size to improve cache hits;
but it doesn't seem worth the effort.

* The requestlist (this is the to-do list of pending recursive queries)
stays nice and small as well. If the computer were unable to bear the
load, this number would shoot up as requests come in faster than they
could be handled.

* For the histogram, onlookers please note the average reply time
printed a) does not include the cache responses (which are better
measured in qps then seconds per query) b) is skewed because of really
large upper numbers caused by unbound retrying very hard for a couple of
records (remote server down). In a newer unbound the median is printed
as well, a nicer way to average recursion speed.

* There is a significant bump on the lower end of your histogram, at 32
microsec. I assume this is because a lot of recursion requests are due
to a CNAME. Like where a CNAME is used to load-balance with DNS.
Consequently, I need to pay attention to CNAME-processing when I do
optimization, good to know.

Thank you,

Best regards,
~ Wouter

> unbound also correctly dealt with the situation that it had a trust
> anchor defined for example.net (from the set of trust anchors
> distributed by RIPE NCC
> <https://www.ripe.net/projects/disi/keys/ripe-ncc-dnssec-keys-new.txt&gt;\),
> but the corresponding DNSKEY is missing from the zone:
>
> [1201660228] unbound[16620:0] info: failed to prime trust anchor --
could not fetch DNSKEY rrset <example.net. DNSKEY IN>

Ah, yes, in such a situation you can wait for example.net to fix their
zone and after 900 (default bogus ttl) unbound will pick that up, or
instead change your config and kill -HUP unbound (this also clears the
cache).

> Operational experience: I was able to integrate unbound into our
> anycast caching system without problems. This allows me to run BIND
> and unbound in parallel on different anycast instances just as I had
> planned to do. All of this is looking very good.
>

Oh this is really nice. Would be interesting to know of any noticable
differences between bind and unbound. Apart from version.bind CH TXT, of
course.

One thing I've noticed is a slight difference in the way BIND and
unbound deal with dead servers at delegation points. There was one
domain for which all name servers were unreachable. Both caches
eventually marked these servers as bad and returned "SERFVAIL" quickly
(for as long as they cache this type of information), but I had the
impression that BIND was timing out on the queries for a longer period
of time than unbound before marking the servers as bad.

Your log file. Thank you for sharing the statistics.
* There are many TCP connect errors. I assume now, that someone
configured a zone as example.com NS 10.0.0.100, with a nameserver on
their local subnet. Unbound cannot contact that nameserver, and tries
(finally) to use TCP on it; which gives this error.

Possibly.

I think the log file should not be cluttered with zone administration
mistakes by others. I can demote this particular error to a higher
verbosity level (2), or I can print the address that failed and then you
(the operator) or a script can pickup those and block them
(do-not-query-address: 10.0.0.0/8 in the config file).

I think I'll demote the error message, as it does not bother the
resolver operator. What do you think? Would you like to have the
addresses printed to the log anyway?

Yes, that would be useful and yes, you can move the message itself to
a higher verbosity level. Speaking of logging: I think I prefer the
modular approach to logging used by BIND. If you only have a
verbosity level to tune, it can be hard to get to certain information
without having to collect vast amounts of unwanted information as
well. Verbosity 2 is already so verbose that I don't want to use it
on a regular basis.

What about classifying all messages and let the user set the verbosity
for each class (which is essentially what BIND does)? I don't mind
much if there is only a single log file, as long as the messages are
tagged with the class so I can pick out what I need with simple tools.

In general, what I would find exetremely useful (and what I've been
missing so much with BIND) is a catalog of log messages, possibly
including additional information what events can cause a message to be
generated. If you're going to do this, please add unique tags for
every message, so you can change the message text in later releases
but still identify the message itself.

I have this Perl script that analyzes my BIND logs and I have to
change it with every release of BIND because they often change the
message texts.

* You have 93% cache hits. With the default 4+4 Mb cache (4 mb for rrset
data, 4 mb for message data), so unbound caps memory usage at about
20-30 Mb total for the process. For 10 million queries. This is
impressive. You could try to increase cache size to improve cache hits;
but it doesn't seem worth the effort.

I think a large percentage of the queries is for names in switch.ch,
which can explain the good hit ratio.

* The requestlist (this is the to-do list of pending recursive queries)
stays nice and small as well. If the computer were unable to bear the
load, this number would shoot up as requests come in faster than they
could be handled.

* For the histogram, onlookers please note the average reply time
printed a) does not include the cache responses (which are better
measured in qps then seconds per query) b) is skewed because of really
large upper numbers caused by unbound retrying very hard for a couple of
records (remote server down). In a newer unbound the median is printed
as well, a nicer way to average recursion speed.

Ah, ok.

* There is a significant bump on the lower end of your histogram, at 32
microsec. I assume this is because a lot of recursion requests are due
to a CNAME. Like where a CNAME is used to load-balance with DNS.
Consequently, I need to pay attention to CNAME-processing when I do
optimization, good to know.

Cool.

Hi Alexander,

Alexander Gall wrote:

One thing I've noticed is a slight difference in the way BIND and
unbound deal with dead servers at delegation points. There was one
domain for which all name servers were unreachable. Both caches
eventually marked these servers as bad and returned "SERFVAIL" quickly
(for as long as they cache this type of information), but I had the
impression that BIND was timing out on the queries for a longer period
of time than unbound before marking the servers as bad.

Unbound gives up *this* query quickly. The server is not marked bad yet,
but further attempts are only done on subsequent queries. The first try
fails quickly, but later tries get more time. 2 minutes max timeout,
then the server is marked bad.

I think I'll demote the error message, as it does not bother the
resolver operator. What do you think? Would you like to have the
addresses printed to the log anyway?

Yes, that would be useful and yes, you can move the message itself to
a higher verbosity level. Speaking of logging: I think I prefer the
modular approach to logging used by BIND. If you only have a
verbosity level to tune, it can be hard to get to certain information
without having to collect vast amounts of unwanted information as
well. Verbosity 2 is already so verbose that I don't want to use it
on a regular basis.

OK have bumped it to verbosity 2. Have also added remote address info to
the error messages. (in svn trunk).

What about classifying all messages and let the user set the verbosity
for each class (which is essentially what BIND does)? I don't mind
much if there is only a single log file, as long as the messages are
tagged with the class so I can pick out what I need with simple tools.

In general, what I would find exetremely useful (and what I've been
missing so much with BIND) is a catalog of log messages, possibly
including additional information what events can cause a message to be
generated. If you're going to do this, please add unique tags for
every message, so you can change the message text in later releases
but still identify the message itself.

I have this Perl script that analyzes my BIND logs and I have to
change it with every release of BIND because they often change the
message texts.

Nice idea. Some type names and a verbosity value for each.
I'll think about it. It would mean a bit of a revision of the logging
system.

The trouble is keeping such a catalog up to date. Maybe
source-code-annotation can help here :slight_smile:

Best regards,
~ Wouter

Yes, that would sound like a good way to do it. I think that would be
a cool thing to have. It would also keep down the volume on this list
once unbound is taking off big time, because people would actually
understand what the software is telling them :slight_smile: