Edns client subnets

Larry_Havemann · April 19, 2014, 12:07am

Hello,

I was wondering if there was a timeline for completing this addition to unbound. Looking at the svn branch for edns client subnets it looks like the last commit was about 6 months ago(2013/11/19).

Thanks,

-Larry

Yuri_Schaeffer · May 1, 2014, 8:52pm

Hi Larry,

I was wondering if there was a timeline for completing this
addition to unbound. Looking at the svn branch for edns client
subnets it looks like the last commit was about 6 months
ago(2013/11/19).

There have been no commits to this branch since then because the
feature is complete. We've been in a catch-22: To our knowledge nobody
actually tried to use it so we are hesitant to call it production
code, but everyone interested seems to wait until we call it
production code.

To get out of this situation we've decided to include it as a patch in
contrib/ of the regular release. We do however need to do some work to
get it there (think continues integration tests). I don't have a clear
timeline for it as it is low priority, but I intend to allocate some
time for it each week.

Regards,
Yuri

Larry_Havemann · May 2, 2014, 8:00pm

Hi Yuri,

I’ve done a bit of testing with this and found a few issues.

The returned record does not update based on geoip when using different subnets. This happen only when the first request a given name does not have a client subnet passed with it:

root@dnsr001:~/src/edns-subnet# /EdgeCast/ecdns/bin/dig_iana +ttl @localhost gp1.wpc.edgecastcdn.net

; <<>> DiG 9.9.3-P1 <<>> +ttl @localhost gp1.wpc.edgecastcdn.net
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43765
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;gp1.wpc.edgecastcdn.net. IN A

;; ANSWER SECTION:
gp1.wpc.edgecastcdn.net. 3600 IN A 72.21.81.253

;; Query time: 7 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri May 02 19:48:02 UTC 2014
;; MSG SIZE rcvd: 68

root@dnsr001:~/src/edns-subnet# cd util/data/^C
root@dnsr001:~/src/edns-subnet# /EdgeCast/ecdns/bin/dig_iana +ttl @localhost gp1.wpc.edgecastcdn.net +client=110.232.0.0/24

; <<>> DiG 9.9.3-P1 <<>> +ttl @localhost gp1.wpc.edgecastcdn.net +client=110.232.0.0/24
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21321
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 110.232.0.0/24/0
;; QUESTION SECTION:
;gp1.wpc.edgecastcdn.net. IN A

;; ANSWER SECTION:
gp1.wpc.edgecastcdn.net. 3591 IN A 72.21.81.253

;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri May 02 19:48:11 UTC 2014
;; MSG SIZE rcvd: 79

root@dnsr001:~/src/edns-subnet# unbound-control flush gp1.wpc.edgecastcdn.net
ok

root@dnsr001:~/src/edns-subnet# /EdgeCast/ecdns/bin/dig_iana +ttl @localhost gp1.wpc.edgecastcdn.net +client=110.232.0.0/24

; <<>> DiG 9.9.3-P1 <<>> +ttl @localhost gp1.wpc.edgecastcdn.net +client=110.232.0.0/24
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36195
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 110.232.0.0/24/19
;; QUESTION SECTION:
;gp1.wpc.edgecastcdn.net. IN A

;; ANSWER SECTION:
gp1.wpc.edgecastcdn.net. 3600 IN A 117.18.232.133

;; Query time: 3 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri May 02 19:48:56 UTC 2014
;; MSG SIZE rcvd: 79

The TTL returned when edns-subnet is passed does not change over time:

At one point I had a working patch to fix this issue, however I am unable to find the whole patch at this time. I do have a small patch that sets the correct ttl in the reply from edns-subnet/subnetmod.c to utils/data/msgreply.c however I’m missing the msgreply.c piece that correctly set the response.(See attached patch for the first part) I believe this is happening because the cache tree for client-subnets is different from the standard cache tree.

root@dnsr001:~/src/edns-subnet# date; /EdgeCast/ecdns/bin/dig_iana @localhost gp1.wpc.edgecastcdn.net +client=110.232.0.0/24
Fri May 2 16:23:20 UTC 2014

; <<>> DiG 9.9.3-P1 <<>> @localhost gp1.wpc.edgecastcdn.net +client=110.232.0.0/24
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33335
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 110.232.0.0/24/19
;; QUESTION SECTION:
;gp1.wpc.edgecastcdn.net. IN A

;; ANSWER SECTION:
gp1.wpc.edgecastcdn.net. 3600 IN A 117.18.232.133

;; Query time: 3 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri May 02 16:23:20 UTC 2014
;; MSG SIZE rcvd: 79

root@dnsr001:~/src/edns-subnet# date; /EdgeCast/ecdns/bin/dig_iana @localhost gp1.wpc.edgecastcdn.net +client=110.232.0.0/24
Fri May 2 16:29:49 UTC 2014

; <<>> DiG 9.9.3-P1 <<>> @localhost gp1.wpc.edgecastcdn.net +client=110.232.0.0/24
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17943
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 110.232.0.0/24/19
;; QUESTION SECTION:
;gp1.wpc.edgecastcdn.net. IN A

;; ANSWER SECTION:
gp1.wpc.edgecastcdn.net. 3600 IN A 117.18.232.133

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri May 02 16:29:49 UTC 2014
;; MSG SIZE rcvd: 79

unbound-control marks all edns-subnet hits as misses:

root@dnsr001:~/src/edns-subnet# unbound-control stats_noreset
thread0.num.queries=5
thread0.num.cachehits=0
thread0.num.cachemiss=5
thread0.num.prefetch=0
thread0.num.recursivereplies=5
thread0.requestlist.avg=0
thread0.requestlist.max=0
thread0.requestlist.overwritten=0
thread0.requestlist.exceeded=0
thread0.requestlist.current.all=0
thread0.requestlist.current.user=0
thread0.recursion.time.avg=0.000522
thread0.recursion.time.median=6.25e-07
total.num.queries=5
total.num.cachehits=0
total.num.cachemiss=5
total.num.prefetch=0
total.num.recursivereplies=5
total.requestlist.avg=0
total.requestlist.max=0
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=0
total.requestlist.current.user=0
total.recursion.time.avg=0.000522
total.recursion.time.median=6.25e-07
time.now=1399048264.960805
time.up=616.002507
time.elapsed=616.002507

May 02 16:29:49 unbound[13363:0] info: 127.0.0.1 gp1.wpc.edgecastcdn.net. A IN
May 02 16:29:49 unbound[13363:0] debug: udp request from ip4 127.0.0.1 port 50867 (len 16)
May 02 16:29:49 unbound[13363:0] debug: mesh_run: start
May 02 16:29:49 unbound[13363:0] debug: subnet[module 0] operate: extstate:module_state_initial event:module_event_new
May 02 16:29:49 unbound[13363:0] info: subnet operate: query gp1.wpc.edgecastcdn.net. A IN
May 02 16:29:49 unbound[13363:0] debug: subnet: answered from cache

(attachments)

subnetmod.c-patch (384 Bytes)

Yuri_Schaeffer · May 2, 2014, 11:55pm

Hi Larry,

Thank you, this is very useful indeed!

1) The returned record does not update based on geoip when using
different subnets. This happen only when the first request a given
name does not have a client subnet passed with it:

1) dig: answer x 2) dig +client: answer x (110.232.0.0/24/0) 3)
flush cache 4) dig +client: answer y (110.232.0.0/24/19)

I'm aware of this and don't consider it an issue. It is doing best
effort for ECS queries to non whitelisted servers, but does not go the
extra mile to get an optimal answer. A quick explanation on what is
happening:

at 1) the query is not in the cache, a full recurse is started. Since
you don't have the particular authority whitelisted no ECS is passed
to it. The answer will end up in the regular cache. Subsequent queries
are looked up in that cache directly without going to the whole module
chain, making it cheap.

at 2) ECS is passed in the query, this time the initial cache lookup
is skipped as ECS queries are in a secondary cache. Of course this
cache is empty and thus a full recurse is started. This recurse grabs
records from the cache as much as possible and indeed, the record is
in the cache and no packets need to go over the wire.

Had the server been whitelisted, unbound would have sent ECS to it in
step 1). And the answer would not end up in the regular cache.

2) The TTL returned when edns-subnet is passed does not change over
time: 3) unbound-control marks all edns-subnet hits as misses:

Indeed! I had not considered this before. I see what I can do after
the weekend. Thanks for reporting your findings.

Regards,
Yuri

Larry_Havemann · May 5, 2014, 6:11pm

I’m aware of this and don’t consider it an issue. It is doing best

effort for ECS queries to non whitelisted servers, but does not go the

extra mile to get an optimal answer. A quick explanation on what is

happening:

at 1) the query is not in the cache, a full recurse is started. Since

you don’t have the particular authority whitelisted no ECS is passed

to it. The answer will end up in the regular cache. Subsequent queries

are looked up in that cache directly without going to the whole module

chain, making it cheap.

at 2) ECS is passed in the query, this time the initial cache lookup

is skipped as ECS queries are in a secondary cache. Of course this

cache is empty and thus a full recurse is started. This recurse grabs

records from the cache as much as possible and indeed, the record is

in the cache and no packets need to go over the wire.

This creates an issue where you have a user coming in with ecs 1.2.3.0/24 gets the correct answer, then user coming without ecs get the default answer and finally user with ecs 1.2.4.0/24 get default instead of the correct answer. To me and for my use this is a major show stopper. Is there anyway the look up order could be reversed if ecs is enabled? So that all queries first try ecs then goto normal cache?

Thanks,
Larry

Yuri_Schaeffer · May 6, 2014, 9:36am

Hi Larry,

This creates an issue where you have a user coming in with ecs
1.2.3.0/24 gets the correct answer, then user coming without ecs
get the default answer and finally user with ecs 1.2.4.0/24 get
default instead of the correct answer.

Did you test this?

I believe this is not what would happen. The final user asking for ecs
1.2.4.0/24 would get an answer from the ecs cache. In case it was not in
cache an upstream lookup would be done with ecs 1.2.4.0/24. Yielding the
most accurate response.

Also, I'd like to stress that an answer from the 'regular' cache is not
wrong, merely suboptimal. It is what a non ECS aware resolver would
answer.

Please note there is no need for the clients to 'speak' ECS. When you
whitelist a server to do ECS Unbound will ask it for the most specific
answer for its client. Relaying ECS, specially to non-whitelisted
servers is a courtesy to the client and not mandated by the draft.

To me and for my use this is a major show stopper.

I'm interested in your usecase and what functionality you are looking
for. What do you expect from a recursor? Do you not intent to use the
whitelist functionality?

Is there anyway the look up order could be reversed if ecs is
enabled? So that all queries first try ecs then goto normal
cache?

While possible it would affect every single incoming query. It is
assumed ECS is only communicated with a fraction of authority servers.
It would mean a significant performance hit, especially since the ECS
cache is not a straight forward key:value lookup.

Best regards,
Yuri

Larry_Havemann · May 6, 2014, 4:51pm

Here is an example showing the behavior described above:

Client 1 subnet 110.232.0.0/24 Asia:
root@dnsr001:/usr/etc/unbound# /EdgeCast/ecdns/bin/dig_iana @localhost gp1.wpc.edgecastcdn.net +client=110.232.0.0/24

; <<>> DiG 9.9.3-P1 <<>> @localhost gp1.wpc.edgecastcdn.net +client=110.232.0.0/24
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12638
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 110.232.0.0/24/19
;; QUESTION SECTION:
;gp1.wpc.edgecastcdn.net. IN A

;; ANSWER SECTION:
gp1.wpc.edgecastcdn.net. 3600 IN A 117.18.232.133 <=== Asia IP

;; Query time: 45 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue May 06 16:32:57 UTC 2014
;; MSG SIZE rcvd: 79

Client 2 no subnet:
root@dnsr001:/usr/etc/unbound# /EdgeCast/ecdns/bin/dig_iana @localhost gp1.wpc.edgecastcdn.net

; <<>> DiG 9.9.3-P1 <<>> @localhost gp1.wpc.edgecastcdn.net
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44048
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;gp1.wpc.edgecastcdn.net. IN A

;; ANSWER SECTION:
gp1.wpc.edgecastcdn.net. 3600 IN A 72.21.81.253 <==== NA IP

;; Query time: 2 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue May 06 16:33:01 UTC 2014
;; MSG SIZE rcvd: 68

Client 3 46.22.74.0/24 Europe:
root@dnsr001:/usr/etc/unbound# /EdgeCast/ecdns/bin/dig_iana @localhost gp1.wpc.edgecastcdn.net +client=46.22.74.0/24

; <<>> DiG 9.9.3-P1 <<>> @localhost gp1.wpc.edgecastcdn.net +client=46.22.74.0/24
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33795
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 46.22.74.0/24/0
;; QUESTION SECTION:
;gp1.wpc.edgecastcdn.net. IN A

;; ANSWER SECTION:
gp1.wpc.edgecastcdn.net. 3583 IN A 72.21.81.253 <===== NA IP

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue May 06 16:33:18 UTC 2014
;; MSG SIZE rcvd: 79

This is a show stopper because a client coming in from 46.22.74.0 is in Europe and with this result is being directed to servers located in North America. This is a major performance hit for the end user, the correct response should be:

root@dnsr001:/usr/etc/unbound# /EdgeCast/ecdns/bin/dig_iana @localhost gp1.wpc.edgecastcdn.net +client=46.22.74.0/24

; <<>> DiG 9.9.3-P1 <<>> @localhost gp1.wpc.edgecastcdn.net +client=46.22.74.0/24
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31443
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 46.22.74.0/24/24
;; QUESTION SECTION:
;gp1.wpc.edgecastcdn.net. IN A

;; ANSWER SECTION:
gp1.wpc.edgecastcdn.net. 3600 IN A 93.184.221.133 <==== Europe IP

;; Query time: 3 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue May 06 16:35:07 UTC 2014
;; MSG SIZE rcvd: 79

The use case where this becomes a show stopper is this. As a CDN we pull content from many different customers. If the customer has geoip rules setup so that content in the EU is pulled from an EU origin server and content for NA is pulled from an NA server this would cause the pull to happen from the wrong part of the world. Now if 1000’s of servers in the EU and 1000’s of servers in NA are all pulling from the same location instead of distributing the load correctly as described in the geoip rule the NA origin server could fail. This would lead to their content either never loading or becoming stale. So for me the result from regular cache is in fact wrong. I did find a way to force it to always do ecs lookup by white listing 1-255.0.0.0/8, but to me this seems like a very sloppy way of doing things.

I think the performance hit would be acceptable to anyone enabling ECS so long as it is documented that there is a performance hit.

Larry_Havemann · July 9, 2014, 4:53pm

Hi Yuri,

I was wondering if you’d made any head way on the ECS TTL bug or the bug in unbound-control that causes ECS queries to be counted as misses even when they are returned from cache.

Thanks,
Larry