regular statistics dumps getting out of sync

The BIND8 like STATS in nsd are of great help, so I have a nameserver (2.3.3)
running with "-s 60" to dump those statistics every minute. The data is
read, processed and fed into rrdtool to provide some query graphs.

Now it turns out that every now and then an interval doesn't last 60 seconds
but 61 instead. This may happen since alarm() doesn't guarantee instant
delivery. It would be nice, though, if the alarm restart tried to resync
the interval.

Instead of

#ifdef BIND8_STATS
          alarm(nsd.st.period);
#endif

I'd like to propose that nsd set the alarm to
  nsd->st.period - (time(NULL) - nsd->st.boot) % nsd->st.period
that is, the remaining time for a multiple of the "-s" interval.

Even better for rrdtool use, the first dump would occur on a predictable
boundary, e.g. for 60 second intervals it would be written on the first
full minute, so the initial value would read

  nsd->st.period - nsd->st.boot % nsd->st.period

There's still a potential race condition, of course. And you wouldn't want
to try "-s 1".

Opinions?

-Peter

Hi Peter,

Although we could resync the interval, you keep the race condition. You
are trying to run two processes, both at 1 minute intervals and want to
keep them synchonised. I think that a solution that avoids race
conditions altogether (and keeps NSD feature-clean :slight_smile: ) is to signal
rrdtool when STATS appear.

I do not know exactly how you are feeding the data into rrdtool, but the
docs I could find say that rrdtool can take date from any time point,
even if you want to see it on the minute (it interpolates).

You can start a shell script, to
  tail -f $nsd_logfile | grep '[XN]STATS' | while read; do kill -HUP
$rrdtool_pid; done &
So that when STATS appear in the logfile, the $rrdtool_pid is signalled.

So tail will be blocking on a read on the logfile, and when the file is
written it unblocks and your PID of choice is sent a signal. It HUPs
twice, for NSTATS and XSTATS, not sure if rrdtool likes that.

Hope this helps,

Best regards,
   Wouter

Peter Koch wrote:

data is read from the logfile every "some" minutes and entered into the
rrdtool repository, so it's not real time. Interpolation is what I was trying
to avoid since rrdtool doesn't seem do deal too well with bursty data; but
maybe you're right in suggesting I have a rrdtool problem, not an nsd one.
Still, getting the interval boundaries more stable seems a useful goal to me,
even if a (much smaller, though) race condition remains.

Thanks,
  Peter

I've done such things, and in my experience, the quality of the output is better if you resync.

If you resync, you get effectively the right interval until conditions are completely horrible, and then it falls back to 2*interval, 3*interval, etc. The interval's _effectively_ right because when a signal is delivered a second late, generally the same reason has also prevented you from doing anything that would be reflected in the statistics you report.

I've only seen the 2*interval thing in case of disasters. True disasters like another process eating all RAM+swap. (IIRC rrdtool can be configured to detect 2*interval periods and display them as outages.)

By comparison, if you don't resync, the period changes by a much smaller factor, and it starts deteriorating much sooner. You don't need a fork bomb to affect data quality, you just need a bit of overload or bad luck.

The algorithms I've used are (translating from my select() to nsd's alarm()):

     alarm( nsd->st.boot - time(NULL) % nsd->st.period );.

and

     alarm( nsd->st.period - ( time(NULL) % nsd->st.period ) );

The first gives better data for a single process, since its forst st.period is optimally reported. The second gives better aggregate data across a process restart or when data from several nsds are combined.

Arnt