Selecting the Promtheus metrics exposed by Cascade

With PR 405 merged, we have support for Prometheus metrics. However, only a handful of metrics are actually exported right now. During the development I made a table of potential metrics and which type they should have.

This post is both supposed to provide a point of keeping track of the metrics implemented (until documented in the official docs) and to facilitate input on which metrics we should implement in the future (for production and beyond).

yes/no/maybe metric name metric type comment
implemented zones_configured Gauge
implemented zones_loaded Gauge
implemented zones_active Gauge
implemented zones_halted {name="example.com", mode="<SoftHalt,HardHalt>"}, or Gauge
implemented zones_waiting Gauge waiting in signing queue
implemented zones_unsigned Gauge
implemented zones_signed Gauge This metric shows the number of zones that have been signed but not yet moved to the publishing step.
maybe zones_status {name="example.com"} (number mapped enum) Gauge (enum see spec) The pipeline status per zone ; implement custom type that wraps/implements the Gauge related trait? Duplicate of zones_active and zones_halted, etc.. Bad human readability bacause that status would be encoded as integers
maybe signing_operations_active {name="example.com"} Gauge
maybe signing_operations_completed {name="example.com"} Counter should not decrease
maybe signing_operations_last_started {name="example.com"} Gauge (unix-time)
maybe signing_operations_last_completed {name="example.com"} Gauge (unix-time)
maybe zones_singing_frequency Gauge TODO: does this have to be turned into a time and number of signs per spec? Is this obsoleted by signing_operations_completed and prometheus mapping that over time?
maybe zones_signing_speed Gauge sigs per second
maybe zones_pending_rrsets_to_sign {name="example.com"} Gauge
maybe zones_disk_usage Gauge Memory/disk usage per zone, and per version of zone
maybe zones_memory_usage Gauge
maybe zones_expected_next_sign Gauge When zones are expected to be resigned
maybe zones_expected_next_key_rollover Gauge When automated key rollover operations are expected to happen
maybe zones_earliest_signature_expire_time Gauge When signatures in the signed zone will expire
maybe zones_soa_serial Gauge (Gauge because the serial can roll over)
maybe zones_num_records_signed Gauge Basic details about the contents of zones
maybe zones_num_records_unsigned Gauge Number of unsigned records
maybe zones_num_records {name="example.com", type="<RRSIG,...>"} Gauge (maybe merge with num_record_(un)signed?) number of records of different rrtypes
maybe zones_num_rrsets {name="example.com"} Gauge (maybe merge with num_record_(un)signed?) number of records of different rrtypes (NSEC,NSEC3,DNSKEY,DS,NS,glue)
no; instead: ↓ zones_time_spent_in_stage_X Gauge Latency measurements of the various stages a zone passes through: - Specifically the most recent, shortest, and longest latencies; - For the loading, unsigned-approving, signing, signed-approving stages, and their total
maybe zone_pipeline_latency {stage="<loading,...>"} Gauge Latency measurements of the various stages a zone passes through. Statistics like the most recent, shortest, and longest latency for the different stages (loading, unsigned-approving, signing, signed-approving) and their total are calculated by scraper/frontend (e.g. a Grafana dashboard)
The availability of upstream zone servers:
maybe/no zone_upstream_last_queried_successfully {name="example.com"} Gauge (unix-time) - When they were last successfully queried
maybe/no zone_upstream_last_queried_unsuccessfully {name="example.com"} Gauge (unix-time) - When they were last unsuccessfully queried
maybe zone_upstream_queries_total {name=”example.com”} Counter - The success rate of queries would be calculated by the metrics frontend
maybe zone_upstream_queries_successful {name=”example.com”} Counter
maybe zone_upstream_queries_failed {name=”example.com”} Counter
maybe keyset_errors {name="example.com"} Counter Number of keyset errors per zone
maybe signing_errors {name="example.com"} Counter Number of signing errors per zone
maybe general_errors Counter Simple counters for general errors/warnings created by the cascade daemon
maybe general_warnings Counter