With PR 405 merged, we have support for Prometheus metrics. However, only a handful of metrics are actually exported right now. During the development I made a table of potential metrics and which type they should have.
This post is both supposed to provide a point of keeping track of the metrics implemented (until documented in the official docs) and to facilitate input on which metrics we should implement in the future (for production and beyond).
| yes/no/maybe | metric name | metric type | comment |
|---|---|---|---|
| implemented | zones_configured |
Gauge | |
| implemented | zones_loaded |
Gauge | |
| implemented | zones_active |
Gauge | |
| implemented | zones_halted {name="example.com", mode="<SoftHalt,HardHalt>"}, or |
Gauge | |
| implemented | zones_waiting |
Gauge | waiting in signing queue |
| implemented | zones_unsigned |
Gauge | |
| implemented | zones_signed |
Gauge | This metric shows the number of zones that have been signed but not yet moved to the publishing step. |
| maybe | zones_status {name="example.com"} (number mapped enum) |
Gauge (enum see spec) | The pipeline status per zone ; implement custom type that wraps/implements the Gauge related trait? Duplicate of zones_active and zones_halted, etc.. Bad human readability bacause that status would be encoded as integers |
| maybe | signing_operations_active {name="example.com"} |
Gauge | |
| maybe | signing_operations_completed {name="example.com"} |
Counter | should not decrease |
| maybe | signing_operations_last_started {name="example.com"} |
Gauge (unix-time) | |
| maybe | signing_operations_last_completed {name="example.com"} |
Gauge (unix-time) | |
| maybe | zones_singing_frequency |
Gauge | TODO: does this have to be turned into a time and number of signs per spec? Is this obsoleted by signing_operations_completed and prometheus mapping that over time? |
| maybe | zones_signing_speed |
Gauge | sigs per second |
| maybe | zones_pending_rrsets_to_sign {name="example.com"} |
Gauge | |
| maybe | zones_disk_usage |
Gauge | Memory/disk usage per zone, and per version of zone |
| maybe | zones_memory_usage |
Gauge | |
| maybe | zones_expected_next_sign |
Gauge | When zones are expected to be resigned |
| maybe | zones_expected_next_key_rollover |
Gauge | When automated key rollover operations are expected to happen |
| maybe | zones_earliest_signature_expire_time |
Gauge | When signatures in the signed zone will expire |
| maybe | zones_soa_serial |
Gauge | (Gauge because the serial can roll over) |
| maybe | zones_num_records_signed |
Gauge | Basic details about the contents of zones |
| maybe | zones_num_records_unsigned |
Gauge | Number of unsigned records |
| maybe | zones_num_records {name="example.com", type="<RRSIG,...>"} |
Gauge | (maybe merge with num_record_(un)signed?) number of records of different rrtypes |
| maybe | zones_num_rrsets {name="example.com"} |
Gauge | (maybe merge with num_record_(un)signed?) number of records of different rrtypes (NSEC,NSEC3,DNSKEY,DS,NS,glue) |
| no; instead: ↓ | zones_time_spent_in_stage_X |
Gauge | Latency measurements of the various stages a zone passes through: - Specifically the most recent, shortest, and longest latencies; - For the loading, unsigned-approving, signing, signed-approving stages, and their total |
| maybe | zone_pipeline_latency {stage="<loading,...>"} |
Gauge | Latency measurements of the various stages a zone passes through. Statistics like the most recent, shortest, and longest latency for the different stages (loading, unsigned-approving, signing, signed-approving) and their total are calculated by scraper/frontend (e.g. a Grafana dashboard) |
| The availability of upstream zone servers: | |||
| maybe/no | zone_upstream_last_queried_successfully {name="example.com"} |
Gauge (unix-time) | - When they were last successfully queried |
| maybe/no | zone_upstream_last_queried_unsuccessfully {name="example.com"} |
Gauge (unix-time) | - When they were last unsuccessfully queried |
| maybe | zone_upstream_queries_total {name=”example.com”} |
Counter | - The success rate of queries would be calculated by the metrics frontend |
| maybe | zone_upstream_queries_successful {name=”example.com”} |
Counter | |
| maybe | zone_upstream_queries_failed {name=”example.com”} |
Counter | |
| maybe | keyset_errors {name="example.com"} |
Counter | Number of keyset errors per zone |
| maybe | signing_errors {name="example.com"} |
Counter | Number of signing errors per zone |
| maybe | general_errors |
Counter | Simple counters for general errors/warnings created by the cascade daemon |
| maybe | general_warnings |
Counter |