I'm running NSD 2.2.1 on Solaris; and have some questions about the stats
dumps. I've set nsd to dump at 300 sec intervals, to a file. Reading that
file, I'm doing some tail|grep|ssh to get my customer the stats on query
amount, etc.
However, through a combination of deficiencies in their collection system
(outside my jurisdiction) and strange deviations in dump intervals, I'm
getting inconsistent data and a very unhappy customer. The dump time is not
at a steady 300 sec cycle, but deviates at strange intervals, and way too
much at times. Given the non-real-time characteristics of Unices, I'm
prepared to accept 10-20 second offsets, but not values ranging from 20% to
1400%. What is going on?
A graphical representation of the data can be found at:
<http://www.noc.kth.se/~mansaxel/wdel.png>
I notice two NSD processes running. Are they writing concurrently to the
log? If so; is there any way to differentiate between them?
Måns Nilsson wrote:
However, through a combination of deficiencies in their collection system
(outside my jurisdiction) and strange deviations in dump intervals, I'm
getting inconsistent data and a very unhappy customer. The dump time is not
at a steady 300 sec cycle, but deviates at strange intervals, and way too
much at times. Given the non-real-time characteristics of Unices, I'm
prepared to accept 10-20 second offsets, but not values ranging from 20% to
1400%. What is going on?
NSD also dumps statistics when the process exits. So the shorter dump interval can be explained by database reloads (which causes NSD to restart itself).
The two big peaks in the green line are surprising. I'm not sure what could cause this. It looks like at some point the periodic SIGALRM that is used to dump statistics is lost, and dumping is not resumed until an NSD reload happens...
A graphical representation of the data can be found at:
<http://www.noc.kth.se/~mansaxel/wdel.png>
I notice two NSD processes running. Are they writing concurrently to the
log? If so; is there any way to differentiate between them?
Only the child process dumps statistics, since the parent process isn't answering any queries. If you run with multiple child servers (using the -N option) there is currently no way to distinguish the servers. Using syslog instead of a log file in this case.
Erik
NSD also dumps statistics when the process exits. So the shorter dump
interval can be explained by database reloads (which causes NSD to
restart itself).
Ok, this explains things.
The two big peaks in the green line are surprising. I'm not sure what
could cause this. It looks like at some point the periodic SIGALRM that
is used to dump statistics is lost, and dumping is not resumed until an
NSD reload happens...
Is it the mother process who sends those to the child? I saw them
only after a restart where I had truncated the log file.
Only the child process dumps statistics, since the parent process isn't
answering any queries. If you run with multiple child servers (using the
-N option) there is currently no way to distinguish the servers. Using
syslog instead of a log file in this case.
There is no such multiple statement.
An additional question: As I understand the log format, the second time_t
value on a ?STATS line is the start of the global period, while the first
is dumptime, and all querycounts are since the second time_t. This means
that one can derive the average query count since initiation per second by
doing something like
qcount_since_start
Måns Nilsson wrote:
Subject: Re: Statistics dumps. Date: Thu, Jul 28, 2005 at 04:38:07PM
+0200 Quoting Erik Rozendaal (erik@nlnetlabs.nl):
NSD also dumps statistics when the process exits. So the shorter dump interval can be explained by database reloads (which causes NSD to restart itself).
Ok, this explains things.
An additional question: As I understand the log format, the second time_t
value on a ?STATS line is the start of the global period, while the first
is dumptime, and all querycounts are since the second time_t. This means
that one can derive the average query count since initiation per second by
doing something like
qcount_since_start
--------------------
(time_field_1 - time_field_2)
True, except for the case a reload has happened. In this case all the counters are reset to 0, but the second time_t is not reset. This may be a bug...
...but if I want to get the qps in 5 minute intervals (and take a stats
dump every 300 secs to achieve that resolution.) I can do post-processing
by saving the last dump time and doing some awk math on that (which would
catch the not-at-300-sec-intervals dumps) but I still get strange value
dips (again refering to the graph URL in my last mail) as soon as I've
gotten an extra reload. It might be my math, of course, but your opinions
on the best sampling method to get qps in 5 minute or better resolution
from NSD stats dumps would be most appreciated.
I'm not sure if you can get the math right without also knowing a reboot/reload happened. And detecting a reboot/reload is not easy without the pid information.
Erik