VOOZH about

URL: https://dev.to/soniarotglam/10-production-grade-alert-rules-for-cosmos-validators-with-real-promql-4i5g

⇱ 10 production-grade alert rules for Cosmos validators (with real PromQL) - DEV Community


Most Cosmos validator monitoring is one of two things: a check that the process is "up", or a 200-panel Grafana dashboard nobody looks at until after the incident. Neither pages you before you get jailed.

Quick disclosure: I am the co-founder who does the marketing at The Good Shell, not the one on call with the validators at 3am. But I sit next to the engineers who are, and their alerting config was living in a private repo doing nobody else any good. So I wrote it up. These are the 10 alert rules our team actually pages on for Cosmos Hub and Cosmos SDK validators. Real PromQL, production thresholds, nothing decorative. Everything you need is in this post: the rules, the scrape config, and the alert routing. Copy it, change the thresholds, ship it.

Before the rules: where the metrics come from

CometBFT exposes Prometheus metrics on :26660/metrics when you set instrumentation.prometheus = true in config.toml. Node-level metrics (disk, memory, clock) come from node_exporter.

One gotcha that breaks copy-pasted rules: the metric namespace. Modern CometBFT uses the cometbft_ prefix. Older chains, or any chain running instrumentation.namespace = "tendermint", expose the same metrics under tendermint_. Check yours before deploying:

curl -s localhost:26660/metrics | grep validator_missed_blocks

Examples below use cometbft_. Swap the prefix if your node uses tendermint_.

1. Validator is missing blocks (jailing risk)

The single most important rule. Page on the rate first, because by the time you hit the absolute jail threshold it is often too late.

- alert: ValidatorMissingBlocks
 expr: increase(cometbft_consensus_validator_missed_blocks[5m]) > 10
 for: 2m
 labels:
 severity: warning
 annotations:
 summary: "Validatormissed{{$value}}blocksinthelast5m"

Cosmos Hub jails you at 500 missed in a 10,000 block window (min_signed_per_window = 0.05). Add a critical rule with a buffer, and tune the number to your chain's signed-blocks window:

- alert: ValidatorJailImminent
 expr: cometbft_consensus_validator_missed_blocks > 400
 for: 1m
 labels:
 severity: critical
 annotations:
 summary: "Approachingjailthreshold:{{$value}}missedblocks"

2. Dropped from the active set or jailed

Voting power goes to zero when you are jailed or fall out of the active set. This is the "you are already out" alarm.

- alert: ValidatorNotInActiveSet
 expr: cometbft_consensus_validator_power == 0
 for: 1m
 labels:
 severity: critical
 annotations:
 summary: "Validatorpoweris0(jailedoroutsidetheactiveset)"

3. Block height is not advancing (node halted)

If height stops moving, the node is stuck: a failed upgrade, a corrupted DB, or a panic loop. This fires even when the process is technically "up".

- alert: BlockHeightStalled
 expr: increase(cometbft_consensus_height[3m]) == 0
 for: 2m
 labels:
 severity: critical
 annotations:
 summary: "Nonewblocksin3m,nodeisstuck"

4. In the set but not signing recent blocks

Distinct from missed_blocks: this catches a validator that is active but whose signer stopped producing signatures (a dead remote signer, a key issue). It compares the chain height to the last height you actually signed.

- alert: ValidatorNotSigning
 expr: cometbft_consensus_height - cometbft_consensus_validator_last_signed_height > 5
 for: 2m
 labels:
 severity: critical
 annotations:
 summary: "Lastsignedheightis{{$value}}blocksbehindchainhead"

5. Low peer count

A validator behind sentries should always have peers. A collapsing peer count means a sentry is down or you are being partitioned, both of which lead to missed blocks.

- alert: LowPeerCount
 expr: cometbft_p2p_peers < 5
 for: 5m
 labels:
 severity: warning
 annotations:
 summary: "Only{{$value}}peersconnected"

6. Block production is slowing down

Rising block intervals mean the network (or your node) is struggling to finalize. Useful as an early "something is wrong" signal before blocks are outright missed.

- alert: SlowBlockProduction
 expr: avg_over_time(cometbft_consensus_block_interval_seconds[5m]) > 8
 for: 5m
 labels:
 severity: warning
 annotations:
 summary: "Averageblockintervalis{{$value}}sover5m"

(Bonus: cometbft_consensus_rounds > 1 sustained tells you consensus is taking multiple rounds to commit, another stress signal.)

7. Disk almost full

Chain data grows continuously. A validator that runs out of disk halts instantly. Alert with enough runway to prune or expand.

- alert: DiskSpaceCritical
 expr: |
 (node_filesystem_avail_bytes{mountpoint="/"} /
 node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
 for: 5m
 labels:
 severity: critical
 annotations:
 summary: "Rootfilesystemat{{$value|humanize}}%free"

8. Memory pressure (upgrade OOM risk)

Under normal load gaiad sits at 16 to 32GB. During coordinated upgrades it spikes, and an OOM kill at upgrade height is a classic jailing event. Catch the pressure before the kernel does.

- alert: HighMemoryPressure
 expr: |
 (node_memory_MemAvailable_bytes /
 node_memory_MemTotal_bytes) * 100 < 10
 for: 5m
 labels:
 severity: critical
 annotations:
 summary: "Only{{$value|humanize}}%memoryavailable"

9. Remote signer is down (TMKMS or Horcrux)

If your signer dies, the node keeps running but cannot sign, and you march toward the jail threshold silently. On Cosmos Hub that is roughly 16 minutes (500 blocks at ~2s). This assumes you scrape your signer host (a blackbox or port check works if TMKMS has no native exporter).

- alert: RemoteSignerDown
 expr: up{job="tmkms"} == 0
 for: 1m
 labels:
 severity: critical
 annotations:
 summary: "Remotesignertargetisdown,validatorcannotsign"

10. Clock drift (NTP)

Underrated and brutal. With Proposer-Based Timestamps, a validator whose clock drifts past the chain's precision bound starts seeing valid proposals as "not timely" and prevotes nil, and its own proposals get rejected. The fix is monitoring the offset, not assuming chrony is fine. Needs the node_exporter timex collector.

- alert: ClockDrift
 expr: abs(node_timex_offset_seconds) > 0.1
 for: 5m
 labels:
 severity: warning
 annotations:
 summary: "Clockoffsetis{{$value}}s,consensustimingatrisk"

Wiring it up

Point Prometheus at the node and node_exporter:

scrape_configs:
 - job_name: cometbft
 static_configs:
 - targets: ["validator:26660"]
 - job_name: node
 static_configs:
 - targets: ["validator:9100"]

Then route severity to where it belongs: critical to PagerDuty (wake someone up), warning to Slack. The point of splitting them is that you should be able to ignore Slack at 3am and still get paged for the things that actually jail you (rules 1 to 4, 7, 8, 9).

Take it

That is the whole baseline. Drop the ten rules into a rules.yml, then route by severity so the noise lands in Slack and the things that actually jail you go to PagerDuty:

route:
 receiver: slack
 group_by: [alertname]
 routes:
 - matchers: [severity="critical"]
 receiver: pagerduty
receivers:
 - name: pagerduty
 pagerduty_configs:
 - service_key: <your-pagerduty-key>
 - name: slack
 slack_configs:
 - api_url: <your-slack-webhook>
 channel: "#validator-alerts"

Swap the thresholds for your chain's parameters and you have real alerting in an afternoon. This is the baseline our engineers run for Cosmos validators day to day. If it saves you one 3am page, it did its job. Better thresholds and war stories welcome in the comments.