VOOZH about

URL: https://www.zabbix.com/integrations/etcd

⇱ etcd monitoring and integration with Zabbix


Propose integration

etcd

etcd is an open source distributed key-value store used to hold and manage the critical information that distributed systems need to keep running. Most notably, it manages the configuration data, state data, and metadata for Kubernetes, the popular container orchestration platform.

Available solutions




This template is for Zabbix version: 7.4

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http?at=release/7.4

Etcd by HTTP

Overview

This template is designed to monitor etcd by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

The template Etcd by HTTP — collects metrics by help of the HTTP agent from /metrics endpoint.

Refer to the vendor documentation.

For the users of etcd version <= 3.4 !

In etcd v3.5 some metrics have been deprecated. See more details on Upgrade etcd from 3.4 to 3.5. Please upgrade your etcd instance, or use older Etcd by HTTP template version.

Requirements

Zabbix version: 7.4 and higher.

Tested versions

This template has been tested on:

  • Etcd 3.5.6

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

  1. Make sure that etcd allows the collection of metrics. You can test it by running: curl -L http://localhost:2379/metrics.

  2. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify it, run curl -L http://<etcd_node_address>:2379/metrics.

  3. Add the template to the etcd node. Set the hostname or IP address of the etcd host in the {$ETCD.HOST} macro. By default, the template uses a client's port. You can configure metrics endpoint location by adding --listen-metrics-urls flag.

For more details, see the etcd documentation.

Additional points to consider:

  • If you have specified a non-standard port for etcd, don't forget to change macros: {$ETCD.SCHEME} and {$ETCD.PORT}.
  • You can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template to use on a host level if necessary.
  • To test availability, run: zabbix_get -s etcd-host -k etcd.health.
  • See the macros section, as it will set the trigger values.

Macros used

Name Description Default
{$ETCD.HOST}

The hostname or IP address of the etcd API endpoint.

<SET ETCD HOST>
{$ETCD.PORT}

The port of the etcd API endpoint.

2379
{$ETCD.SCHEME}

The request scheme which may be http or https.

http
{$ETCD.USER}
{$ETCD.PASSWORD}
{$ETCD.LEADER.CHANGES.MAX.WARN}

The maximum number of leader changes.

5
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

The maximum number of proposal failures.

2
{$ETCD.HTTP.FAIL.MAX.WARN}

The maximum number of HTTP request failures.

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

The maximum number of proposals in queue.

5
{$ETCD.OPEN.FDS.MAX.WARN}

The maximum percentage of used file descriptors.

90
{$ETCD.GRPC_CODE.MATCHES}

The filter of discoverable gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

The filter to exclude discovered gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

CHANGE_IF_NEEDED
{$ETCD.GRPC.ERRORS.MAX.WARN}

The maximum number of gRPC request failures.

1
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

The filter of discoverable gRPC codes, which will create triggers.

Aborted|Unavailable

Items

Name Description Type Key and additional info
Service's TCP port state Simple check net.tcp.service["{$ETCD.SCHEME}","{$ETCD.HOST}","{$ETCD.PORT}"]

Preprocessing

  • Discard unchanged with heartbeat: 10m

Get node metrics HTTP agent etcd.get_metrics
Node health HTTP agent etcd.health

Preprocessing

  • JSON Path: $.health

  • Boolean to decimal

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Server is a leader

It defines - whether or not this member is a leader:

1 - it is;

0 - otherwise.

Dependent item etcd.is.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_is_leader)

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Server has a leader

It defines - whether or not a leader exists:

1 - it exists;

0 - it does not.

Dependent item etcd.has.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_has_leader)

  • Discard unchanged with heartbeat: 10m

Leader changes

The number of leader changes the member has seen since its start.

Dependent item etcd.leader.changes

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_leader_changes_seen_total)

Proposals committed per second

The number of consensus proposals committed.

Dependent item etcd.proposals.committed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_committed_total)

  • Change per second
Proposals applied per second

The number of consensus proposals applied.

Dependent item etcd.proposals.applied.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_applied_total)

  • Change per second
Proposals failed per second

The number of failed proposals seen.

Dependent item etcd.proposals.failed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_failed_total)

  • Change per second
Proposals pending

The current number of pending proposals to commit.

Dependent item etcd.proposals.pending

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_pending)

Reads per second

The number of read actions by get/getRecursive, local to this member.

Dependent item etcd.reads.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_reads_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Writes per second

The number of writes (e.g., set/compareAndDelete) seen by this member.

Dependent item etcd.writes.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_writes_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Client gRPC received bytes per second

The number of bytes received from gRPC clients per second.

Dependent item etcd.network.grpc.received.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_received_bytes_total)

  • Change per second
Client gRPC sent bytes per second

The number of bytes sent from gRPC clients per second.

Dependent item etcd.network.grpc.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_sent_bytes_total)

  • Change per second
HTTP requests received

The number of requests received into the system (successfully parsed and authd).

Dependent item etcd.http.requests.rate

Preprocessing

  • Prometheus to JSON: etcd_http_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
HTTP 5XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 5XX.

Dependent item etcd.http.requests.5xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"5.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
HTTP 4XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 4XX.

Dependent item etcd.http.requests.4xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"4.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs received per second

The number of RPC stream messages received on the server.

Dependent item etcd.grpc.received.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs sent per second

The number of gRPC stream messages sent by the server.

Dependent item etcd.grpc.sent.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_sent_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs started per second

The number of RPCs started on the server.

Dependent item etcd.grpc.started.rate

Preprocessing

  • Prometheus to JSON: grpc_server_started_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Get version HTTP agent etcd.get_version
Server version

The version of the etcd server.

Dependent item etcd.server.version

Preprocessing

  • JSON Path: $.etcdserver

  • Discard unchanged with heartbeat: 1d

Cluster version

The version of the etcd cluster.

Dependent item etcd.cluster.version

Preprocessing

  • JSON Path: $.etcdcluster

  • Discard unchanged with heartbeat: 1d

DB size

The total size of the underlying database.

Dependent item etcd.db.size

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_db_total_size_in_bytes)

Keys compacted per second

The number of DB keys compacted per second.

Dependent item etcd.keys.compacted.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_db_compaction_keys_total)

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Keys expired per second

The number of expired keys per second.

Dependent item etcd.keys.expired.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_store_expires_total)

  • Change per second
Keys total

The total number of keys.

Dependent item etcd.keys.total

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_keys_total)

Uptime

Etcd server uptime.

Dependent item etcd.uptime

Preprocessing

  • Prometheus pattern: VALUE(process_start_time_seconds)

  • JavaScript: The text is too long. Please see the template.

Virtual memory

The size of virtual memory expressed in bytes.

Dependent item etcd.virtual.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_virtual_memory_bytes)

Resident memory

The size of resident memory expressed in bytes.

Dependent item etcd.res.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_resident_memory_bytes)

CPU

The total user and system CPU time spent in seconds.

Dependent item etcd.cpu.util

Preprocessing

  • Prometheus pattern: VALUE(process_cpu_seconds_total)

  • Change per second
Open file descriptors

The number of open file descriptors.

Dependent item etcd.open.fds

Preprocessing

  • Prometheus pattern: VALUE(process_open_fds)

Maximum open file descriptors

The Maximum number of open file descriptors.

Dependent item etcd.max.fds

Preprocessing

  • Prometheus pattern: VALUE(process_max_fds)

Deletes per second

The number of deletes seen by this member per second.

Dependent item etcd.delete.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_delete_total)

  • Change per second
PUT per second

The number of puts seen by this member per second.

Dependent item etcd.put.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_put_total)

  • Change per second
Range per second

The number of ranges seen by this member per second.

Dependent item etcd.range.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Transaction per second

The number of transactions seen by this member per second.

Dependent item etcd.txn.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Pending events

The total number of pending events to be sent.

Dependent item etcd.events.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_pending_events_total)

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{$ETCD.HOST}","{$ETCD.PORT}"])=0 Average Manual close: Yes
Etcd: Node healthcheck failed

See more details on https://etcd.io/docs/v3.5/op-guide/monitoring/#health-check.

last(/Etcd by HTTP/etcd.health)=0 Average Depends on:
  • Etcd: Service is unavailable
Etcd: Failed to fetch info data

Zabbix has not received any data for items for the last 30 minutes.

nodata(/Etcd by HTTP/etcd.is.leader,30m)=1 Warning Manual close: Yes
Depends on:
  • Etcd: Service is unavailable
Etcd: Member has no leader

If a member does not have a leader, it is totally unavailable.

last(/Etcd by HTTP/etcd.has.leader)=0 Average
Etcd: Instance has seen too many leader changes

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN} Warning
Etcd: Too many proposal failures

Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN} Warning
Etcd: Too many proposals are queued to commit

Rising pending proposals suggests there is a high client load, or the member cannot commit proposals.

min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN} Warning
Etcd: Too many HTTP requests failures

Too many requests failed on etcd instance with the 5xx HTTP code.

min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN} Warning
Etcd: Server version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0 Info Manual close: Yes
Etcd: Cluster version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0 Info Manual close: Yes
Etcd: Host has been restarted

Uptime is less than 10 minutes.

last(/Etcd by HTTP/etcd.uptime)<10m Info Manual close: Yes
Etcd: Current number of open files is too high

Heavy usage of a file descriptor (i.e., near the limit of the process's file descriptor) indicates a potential file descriptor exhaustion issue.
If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.

min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN} Warning

LLD rule gRPC codes discovery

Name Description Type Key and additional info
gRPC codes discovery Dependent item etcd.grpc_code.discovery

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for gRPC codes discovery

Name Description Type Key and additional info
RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

Dependent item etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second

Trigger prototypes for gRPC codes discovery

Name Description Expression Severity Dependencies and additional info
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE} min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN} Warning

LLD rule Peers discovery

Name Description Type Key and additional info
Peers discovery Dependent item etcd.peer.discovery

Preprocessing

  • Prometheus to JSON: etcd_network_peer_sent_bytes_total

Item prototypes for Peers discovery

Name Description Type Key and additional info
Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"})

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Send failures

The number of sent failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Receive failures

The number of received failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 7.2

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http?at=release/7.2

Etcd by HTTP

Overview

This template is designed to monitor etcd by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

The template Etcd by HTTP — collects metrics by help of the HTTP agent from /metrics endpoint.

Refer to the vendor documentation.

For the users of etcd version <= 3.4 !

In etcd v3.5 some metrics have been deprecated. See more details on Upgrade etcd from 3.4 to 3.5. Please upgrade your etcd instance, or use older Etcd by HTTP template version.

Requirements

Zabbix version: 7.2 and higher.

Tested versions

This template has been tested on:

  • Etcd 3.5.6

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

  1. Make sure that etcd allows the collection of metrics. You can test it by running: curl -L http://localhost:2379/metrics.

  2. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify it, run curl -L http://<etcd_node_address>:2379/metrics.

  3. Add the template to the etcd node. Set the hostname or IP address of the etcd host in the {$ETCD.HOST} macro. By default, the template uses a client's port. You can configure metrics endpoint location by adding --listen-metrics-urls flag.

For more details, see the etcd documentation.

Additional points to consider:

  • If you have specified a non-standard port for etcd, don't forget to change macros: {$ETCD.SCHEME} and {$ETCD.PORT}.
  • You can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template to use on a host level if necessary.
  • To test availability, run: zabbix_get -s etcd-host -k etcd.health.
  • See the macros section, as it will set the trigger values.

Macros used

Name Description Default
{$ETCD.HOST}

The hostname or IP address of the etcd API endpoint.

<SET ETCD HOST>
{$ETCD.PORT}

The port of the etcd API endpoint.

2379
{$ETCD.SCHEME}

The request scheme which may be http or https.

http
{$ETCD.USER}
{$ETCD.PASSWORD}
{$ETCD.LEADER.CHANGES.MAX.WARN}

The maximum number of leader changes.

5
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

The maximum number of proposal failures.

2
{$ETCD.HTTP.FAIL.MAX.WARN}

The maximum number of HTTP request failures.

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

The maximum number of proposals in queue.

5
{$ETCD.OPEN.FDS.MAX.WARN}

The maximum percentage of used file descriptors.

90
{$ETCD.GRPC_CODE.MATCHES}

The filter of discoverable gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

The filter to exclude discovered gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

CHANGE_IF_NEEDED
{$ETCD.GRPC.ERRORS.MAX.WARN}

The maximum number of gRPC request failures.

1
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

The filter of discoverable gRPC codes, which will create triggers.

Aborted|Unavailable

Items

Name Description Type Key and additional info
Service's TCP port state Simple check net.tcp.service["{$ETCD.SCHEME}","{$ETCD.HOST}","{$ETCD.PORT}"]

Preprocessing

  • Discard unchanged with heartbeat: 10m

Get node metrics HTTP agent etcd.get_metrics
Node health HTTP agent etcd.health

Preprocessing

  • JSON Path: $.health

  • Boolean to decimal

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Server is a leader

It defines - whether or not this member is a leader:

1 - it is;

0 - otherwise.

Dependent item etcd.is.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_is_leader)

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Server has a leader

It defines - whether or not a leader exists:

1 - it exists;

0 - it does not.

Dependent item etcd.has.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_has_leader)

  • Discard unchanged with heartbeat: 10m

Leader changes

The number of leader changes the member has seen since its start.

Dependent item etcd.leader.changes

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_leader_changes_seen_total)

Proposals committed per second

The number of consensus proposals committed.

Dependent item etcd.proposals.committed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_committed_total)

  • Change per second
Proposals applied per second

The number of consensus proposals applied.

Dependent item etcd.proposals.applied.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_applied_total)

  • Change per second
Proposals failed per second

The number of failed proposals seen.

Dependent item etcd.proposals.failed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_failed_total)

  • Change per second
Proposals pending

The current number of pending proposals to commit.

Dependent item etcd.proposals.pending

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_pending)

Reads per second

The number of read actions by get/getRecursive, local to this member.

Dependent item etcd.reads.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_reads_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Writes per second

The number of writes (e.g., set/compareAndDelete) seen by this member.

Dependent item etcd.writes.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_writes_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Client gRPC received bytes per second

The number of bytes received from gRPC clients per second.

Dependent item etcd.network.grpc.received.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_received_bytes_total)

  • Change per second
Client gRPC sent bytes per second

The number of bytes sent from gRPC clients per second.

Dependent item etcd.network.grpc.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_sent_bytes_total)

  • Change per second
HTTP requests received

The number of requests received into the system (successfully parsed and authd).

Dependent item etcd.http.requests.rate

Preprocessing

  • Prometheus to JSON: etcd_http_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
HTTP 5XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 5XX.

Dependent item etcd.http.requests.5xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"5.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
HTTP 4XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 4XX.

Dependent item etcd.http.requests.4xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"4.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs received per second

The number of RPC stream messages received on the server.

Dependent item etcd.grpc.received.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs sent per second

The number of gRPC stream messages sent by the server.

Dependent item etcd.grpc.sent.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_sent_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs started per second

The number of RPCs started on the server.

Dependent item etcd.grpc.started.rate

Preprocessing

  • Prometheus to JSON: grpc_server_started_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Get version HTTP agent etcd.get_version
Server version

The version of the etcd server.

Dependent item etcd.server.version

Preprocessing

  • JSON Path: $.etcdserver

  • Discard unchanged with heartbeat: 1d

Cluster version

The version of the etcd cluster.

Dependent item etcd.cluster.version

Preprocessing

  • JSON Path: $.etcdcluster

  • Discard unchanged with heartbeat: 1d

DB size

The total size of the underlying database.

Dependent item etcd.db.size

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_db_total_size_in_bytes)

Keys compacted per second

The number of DB keys compacted per second.

Dependent item etcd.keys.compacted.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_db_compaction_keys_total)

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Keys expired per second

The number of expired keys per second.

Dependent item etcd.keys.expired.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_store_expires_total)

  • Change per second
Keys total

The total number of keys.

Dependent item etcd.keys.total

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_keys_total)

Uptime

Etcd server uptime.

Dependent item etcd.uptime

Preprocessing

  • Prometheus pattern: VALUE(process_start_time_seconds)

  • JavaScript: The text is too long. Please see the template.

Virtual memory

The size of virtual memory expressed in bytes.

Dependent item etcd.virtual.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_virtual_memory_bytes)

Resident memory

The size of resident memory expressed in bytes.

Dependent item etcd.res.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_resident_memory_bytes)

CPU

The total user and system CPU time spent in seconds.

Dependent item etcd.cpu.util

Preprocessing

  • Prometheus pattern: VALUE(process_cpu_seconds_total)

  • Change per second
Open file descriptors

The number of open file descriptors.

Dependent item etcd.open.fds

Preprocessing

  • Prometheus pattern: VALUE(process_open_fds)

Maximum open file descriptors

The Maximum number of open file descriptors.

Dependent item etcd.max.fds

Preprocessing

  • Prometheus pattern: VALUE(process_max_fds)

Deletes per second

The number of deletes seen by this member per second.

Dependent item etcd.delete.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_delete_total)

  • Change per second
PUT per second

The number of puts seen by this member per second.

Dependent item etcd.put.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_put_total)

  • Change per second
Range per second

The number of ranges seen by this member per second.

Dependent item etcd.range.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Transaction per second

The number of transactions seen by this member per second.

Dependent item etcd.txn.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Pending events

The total number of pending events to be sent.

Dependent item etcd.events.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_pending_events_total)

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{$ETCD.HOST}","{$ETCD.PORT}"])=0 Average Manual close: Yes
Etcd: Node healthcheck failed

See more details on https://etcd.io/docs/v3.5/op-guide/monitoring/#health-check.

last(/Etcd by HTTP/etcd.health)=0 Average Depends on:
  • Etcd: Service is unavailable
Etcd: Failed to fetch info data

Zabbix has not received any data for items for the last 30 minutes.

nodata(/Etcd by HTTP/etcd.is.leader,30m)=1 Warning Manual close: Yes
Depends on:
  • Etcd: Service is unavailable
Etcd: Member has no leader

If a member does not have a leader, it is totally unavailable.

last(/Etcd by HTTP/etcd.has.leader)=0 Average
Etcd: Instance has seen too many leader changes

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN} Warning
Etcd: Too many proposal failures

Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN} Warning
Etcd: Too many proposals are queued to commit

Rising pending proposals suggests there is a high client load, or the member cannot commit proposals.

min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN} Warning
Etcd: Too many HTTP requests failures

Too many requests failed on etcd instance with the 5xx HTTP code.

min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN} Warning
Etcd: Server version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0 Info Manual close: Yes
Etcd: Cluster version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0 Info Manual close: Yes
Etcd: Host has been restarted

Uptime is less than 10 minutes.

last(/Etcd by HTTP/etcd.uptime)<10m Info Manual close: Yes
Etcd: Current number of open files is too high

Heavy usage of a file descriptor (i.e., near the limit of the process's file descriptor) indicates a potential file descriptor exhaustion issue.
If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.

min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN} Warning

LLD rule gRPC codes discovery

Name Description Type Key and additional info
gRPC codes discovery Dependent item etcd.grpc_code.discovery

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for gRPC codes discovery

Name Description Type Key and additional info
RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

Dependent item etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second

Trigger prototypes for gRPC codes discovery

Name Description Expression Severity Dependencies and additional info
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE} min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN} Warning

LLD rule Peers discovery

Name Description Type Key and additional info
Peers discovery Dependent item etcd.peer.discovery

Preprocessing

  • Prometheus to JSON: etcd_network_peer_sent_bytes_total

Item prototypes for Peers discovery

Name Description Type Key and additional info
Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"})

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Send failures

The number of sent failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Receive failures

The number of received failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 7.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http?at=release/7.0

Etcd by HTTP

Overview

This template is designed to monitor etcd by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

The template Etcd by HTTP — collects metrics by help of the HTTP agent from /metrics endpoint.

Refer to the vendor documentation.

For the users of etcd version <= 3.4 !

In etcd v3.5 some metrics have been deprecated. See more details on Upgrade etcd from 3.4 to 3.5. Please upgrade your etcd instance, or use older Etcd by HTTP template version.

Requirements

Zabbix version: 7.0 and higher.

Tested versions

This template has been tested on:

  • Etcd 3.5.6

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

  1. Make sure that etcd allows the collection of metrics. You can test it by running: curl -L http://localhost:2379/metrics.

  2. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify it, run curl -L http://<etcd_node_address>:2379/metrics.

  3. Add the template to the etcd node. Set the hostname or IP address of the etcd host in the {$ETCD.HOST} macro. By default, the template uses a client's port. You can configure metrics endpoint location by adding --listen-metrics-urls flag.

For more details, see the etcd documentation.

Additional points to consider:

  • If you have specified a non-standard port for etcd, don't forget to change macros: {$ETCD.SCHEME} and {$ETCD.PORT}.
  • You can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template to use on a host level if necessary.
  • To test availability, run: zabbix_get -s etcd-host -k etcd.health.
  • See the macros section, as it will set the trigger values.

Macros used

Name Description Default
{$ETCD.HOST}

The hostname or IP address of the etcd API endpoint.

<SET ETCD HOST>
{$ETCD.PORT}

The port of the etcd API endpoint.

2379
{$ETCD.SCHEME}

The request scheme which may be http or https.

http
{$ETCD.USER}
{$ETCD.PASSWORD}
{$ETCD.LEADER.CHANGES.MAX.WARN}

The maximum number of leader changes.

5
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

The maximum number of proposal failures.

2
{$ETCD.HTTP.FAIL.MAX.WARN}

The maximum number of HTTP request failures.

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

The maximum number of proposals in queue.

5
{$ETCD.OPEN.FDS.MAX.WARN}

The maximum percentage of used file descriptors.

90
{$ETCD.GRPC_CODE.MATCHES}

The filter of discoverable gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

The filter to exclude discovered gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

CHANGE_IF_NEEDED
{$ETCD.GRPC.ERRORS.MAX.WARN}

The maximum number of gRPC request failures.

1
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

The filter of discoverable gRPC codes, which will create triggers.

Aborted|Unavailable

Items

Name Description Type Key and additional info
Service's TCP port state Simple check net.tcp.service["{$ETCD.SCHEME}","{$ETCD.HOST}","{$ETCD.PORT}"]

Preprocessing

  • Discard unchanged with heartbeat: 10m

Get node metrics HTTP agent etcd.get_metrics
Node health HTTP agent etcd.health

Preprocessing

  • JSON Path: $.health

  • Boolean to decimal

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Server is a leader

It defines - whether or not this member is a leader:

1 - it is;

0 - otherwise.

Dependent item etcd.is.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_is_leader)

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Server has a leader

It defines - whether or not a leader exists:

1 - it exists;

0 - it does not.

Dependent item etcd.has.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_has_leader)

  • Discard unchanged with heartbeat: 10m

Leader changes

The number of leader changes the member has seen since its start.

Dependent item etcd.leader.changes

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_leader_changes_seen_total)

Proposals committed per second

The number of consensus proposals committed.

Dependent item etcd.proposals.committed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_committed_total)

  • Change per second
Proposals applied per second

The number of consensus proposals applied.

Dependent item etcd.proposals.applied.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_applied_total)

  • Change per second
Proposals failed per second

The number of failed proposals seen.

Dependent item etcd.proposals.failed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_failed_total)

  • Change per second
Proposals pending

The current number of pending proposals to commit.

Dependent item etcd.proposals.pending

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_pending)

Reads per second

The number of read actions by get/getRecursive, local to this member.

Dependent item etcd.reads.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_reads_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Writes per second

The number of writes (e.g., set/compareAndDelete) seen by this member.

Dependent item etcd.writes.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_writes_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Client gRPC received bytes per second

The number of bytes received from gRPC clients per second.

Dependent item etcd.network.grpc.received.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_received_bytes_total)

  • Change per second
Client gRPC sent bytes per second

The number of bytes sent from gRPC clients per second.

Dependent item etcd.network.grpc.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_sent_bytes_total)

  • Change per second
HTTP requests received

The number of requests received into the system (successfully parsed and authd).

Dependent item etcd.http.requests.rate

Preprocessing

  • Prometheus to JSON: etcd_http_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
HTTP 5XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 5XX.

Dependent item etcd.http.requests.5xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"5.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
HTTP 4XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 4XX.

Dependent item etcd.http.requests.4xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"4.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs received per second

The number of RPC stream messages received on the server.

Dependent item etcd.grpc.received.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs sent per second

The number of gRPC stream messages sent by the server.

Dependent item etcd.grpc.sent.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_sent_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
RPCs started per second

The number of RPCs started on the server.

Dependent item etcd.grpc.started.rate

Preprocessing

  • Prometheus to JSON: grpc_server_started_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Get version HTTP agent etcd.get_version
Server version

The version of the etcd server.

Dependent item etcd.server.version

Preprocessing

  • JSON Path: $.etcdserver

  • Discard unchanged with heartbeat: 1d

Cluster version

The version of the etcd cluster.

Dependent item etcd.cluster.version

Preprocessing

  • JSON Path: $.etcdcluster

  • Discard unchanged with heartbeat: 1d

DB size

The total size of the underlying database.

Dependent item etcd.db.size

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_db_total_size_in_bytes)

Keys compacted per second

The number of DB keys compacted per second.

Dependent item etcd.keys.compacted.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_db_compaction_keys_total)

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Keys expired per second

The number of expired keys per second.

Dependent item etcd.keys.expired.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_store_expires_total)

  • Change per second
Keys total

The total number of keys.

Dependent item etcd.keys.total

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_keys_total)

Uptime

Etcd server uptime.

Dependent item etcd.uptime

Preprocessing

  • Prometheus pattern: VALUE(process_start_time_seconds)

  • JavaScript: The text is too long. Please see the template.

Virtual memory

The size of virtual memory expressed in bytes.

Dependent item etcd.virtual.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_virtual_memory_bytes)

Resident memory

The size of resident memory expressed in bytes.

Dependent item etcd.res.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_resident_memory_bytes)

CPU

The total user and system CPU time spent in seconds.

Dependent item etcd.cpu.util

Preprocessing

  • Prometheus pattern: VALUE(process_cpu_seconds_total)

  • Change per second
Open file descriptors

The number of open file descriptors.

Dependent item etcd.open.fds

Preprocessing

  • Prometheus pattern: VALUE(process_open_fds)

Maximum open file descriptors

The Maximum number of open file descriptors.

Dependent item etcd.max.fds

Preprocessing

  • Prometheus pattern: VALUE(process_max_fds)

Deletes per second

The number of deletes seen by this member per second.

Dependent item etcd.delete.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_delete_total)

  • Change per second
PUT per second

The number of puts seen by this member per second.

Dependent item etcd.put.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_put_total)

  • Change per second
Range per second

The number of ranges seen by this member per second.

Dependent item etcd.range.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Transaction per second

The number of transactions seen by this member per second.

Dependent item etcd.txn.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Pending events

The total number of pending events to be sent.

Dependent item etcd.events.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_pending_events_total)

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{$ETCD.HOST}","{$ETCD.PORT}"])=0 Average Manual close: Yes
Etcd: Node healthcheck failed

See more details on https://etcd.io/docs/v3.5/op-guide/monitoring/#health-check.

last(/Etcd by HTTP/etcd.health)=0 Average Depends on:
  • Etcd: Service is unavailable
Etcd: Failed to fetch info data

Zabbix has not received any data for items for the last 30 minutes.

nodata(/Etcd by HTTP/etcd.is.leader,30m)=1 Warning Manual close: Yes
Depends on:
  • Etcd: Service is unavailable
Etcd: Member has no leader

If a member does not have a leader, it is totally unavailable.

last(/Etcd by HTTP/etcd.has.leader)=0 Average
Etcd: Instance has seen too many leader changes

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN} Warning
Etcd: Too many proposal failures

Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN} Warning
Etcd: Too many proposals are queued to commit

Rising pending proposals suggests there is a high client load, or the member cannot commit proposals.

min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN} Warning
Etcd: Too many HTTP requests failures

Too many requests failed on etcd instance with the 5xx HTTP code.

min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN} Warning
Etcd: Server version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0 Info Manual close: Yes
Etcd: Cluster version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0 Info Manual close: Yes
Etcd: Host has been restarted

Uptime is less than 10 minutes.

last(/Etcd by HTTP/etcd.uptime)<10m Info Manual close: Yes
Etcd: Current number of open files is too high

Heavy usage of a file descriptor (i.e., near the limit of the process's file descriptor) indicates a potential file descriptor exhaustion issue.
If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.

min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN} Warning

LLD rule gRPC codes discovery

Name Description Type Key and additional info
gRPC codes discovery Dependent item etcd.grpc_code.discovery

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for gRPC codes discovery

Name Description Type Key and additional info
RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

Dependent item etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second

Trigger prototypes for gRPC codes discovery

Name Description Expression Severity Dependencies and additional info
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE} min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN} Warning

LLD rule Peers discovery

Name Description Type Key and additional info
Peers discovery Dependent item etcd.peer.discovery

Preprocessing

  • Prometheus to JSON: etcd_network_peer_sent_bytes_total

Item prototypes for Peers discovery

Name Description Type Key and additional info
Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"})

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Send failures

The number of sent failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd peer {#ETCD.PEER}: Receive failures

The number of received failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 6.4

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http?at=release/6.4

Etcd by HTTP

Overview

This template is designed to monitor etcd by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

The template Etcd by HTTP — collects metrics by help of the HTTP agent from /metrics endpoint.

Refer to the vendor documentation.

For the users of etcd version <= 3.4 !

In etcd v3.5 some metrics have been deprecated. See more details on Upgrade etcd from 3.4 to 3.5. Please upgrade your etcd instance, or use older Etcd by HTTP template version.

Requirements

Zabbix version: 6.4 and higher.

Tested versions

This template has been tested on:

  • Etcd 3.5.6

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

  1. Make sure that etcd allows the collection of metrics. You can test it by running: curl -L http://localhost:2379/metrics.

  2. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify it, run curl -L http://<etcd_node_address>:2379/metrics.

  3. Add the template to the etcd node. Set the hostname or IP address of the etcd host in the {$ETCD.HOST} macro. By default, the template uses a client's port. You can configure metrics endpoint location by adding --listen-metrics-urls flag.

For more details, see the etcd documentation.

Additional points to consider:

  • If you have specified a non-standard port for etcd, don't forget to change macros: {$ETCD.SCHEME} and {$ETCD.PORT}.
  • You can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template to use on a host level if necessary.
  • To test availability, run: zabbix_get -s etcd-host -k etcd.health.
  • See the macros section, as it will set the trigger values.

Macros used

Name Description Default
{$ETCD.HOST}

The hostname or IP address of the etcd API endpoint.

<SET ETCD HOST>
{$ETCD.PORT}

The port of the etcd API endpoint.

2379
{$ETCD.SCHEME}

The request scheme which may be http or https.

http
{$ETCD.USER}
{$ETCD.PASSWORD}
{$ETCD.LEADER.CHANGES.MAX.WARN}

The maximum number of leader changes.

5
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

The maximum number of proposal failures.

2
{$ETCD.HTTP.FAIL.MAX.WARN}

The maximum number of HTTP request failures.

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

The maximum number of proposals in queue.

5
{$ETCD.OPEN.FDS.MAX.WARN}

The maximum percentage of used file descriptors.

90
{$ETCD.GRPC_CODE.MATCHES}

The filter of discoverable gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

The filter to exclude discovered gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

CHANGE_IF_NEEDED
{$ETCD.GRPC.ERRORS.MAX.WARN}

The maximum number of gRPC request failures.

1
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

The filter of discoverable gRPC codes, which will create triggers.

Aborted|Unavailable

Items

Name Description Type Key and additional info
Etcd: Service's TCP port state Simple check net.tcp.service["{$ETCD.SCHEME}","{$ETCD.HOST}","{$ETCD.PORT}"]

Preprocessing

  • Discard unchanged with heartbeat: 10m

Etcd: Get node metrics HTTP agent etcd.get_metrics
Etcd: Node health HTTP agent etcd.health

Preprocessing

  • JSON Path: $.health

  • Boolean to decimal

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Etcd: Server is a leader

It defines - whether or not this member is a leader:

1 - it is;

0 - otherwise.

Dependent item etcd.is.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_is_leader)

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Etcd: Server has a leader

It defines - whether or not a leader exists:

1 - it exists;

0 - it does not.

Dependent item etcd.has.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_has_leader)

  • Discard unchanged with heartbeat: 10m

Etcd: Leader changes

The number of leader changes the member has seen since its start.

Dependent item etcd.leader.changes

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_leader_changes_seen_total)

Etcd: Proposals committed per second

The number of consensus proposals committed.

Dependent item etcd.proposals.committed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_committed_total)

  • Change per second
Etcd: Proposals applied per second

The number of consensus proposals applied.

Dependent item etcd.proposals.applied.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_applied_total)

  • Change per second
Etcd: Proposals failed per second

The number of failed proposals seen.

Dependent item etcd.proposals.failed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_failed_total)

  • Change per second
Etcd: Proposals pending

The current number of pending proposals to commit.

Dependent item etcd.proposals.pending

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_pending)

Etcd: Reads per second

The number of read actions by get/getRecursive, local to this member.

Dependent item etcd.reads.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_reads_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: Writes per second

The number of writes (e.g., set/compareAndDelete) seen by this member.

Dependent item etcd.writes.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_writes_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: Client gRPC received bytes per second

The number of bytes received from gRPC clients per second.

Dependent item etcd.network.grpc.received.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_received_bytes_total)

  • Change per second
Etcd: Client gRPC sent bytes per second

The number of bytes sent from gRPC clients per second.

Dependent item etcd.network.grpc.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_sent_bytes_total)

  • Change per second
Etcd: HTTP requests received

The number of requests received into the system (successfully parsed and authd).

Dependent item etcd.http.requests.rate

Preprocessing

  • Prometheus to JSON: etcd_http_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: HTTP 5XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 5XX.

Dependent item etcd.http.requests.5xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"5.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: HTTP 4XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 4XX.

Dependent item etcd.http.requests.4xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"4.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: RPCs received per second

The number of RPC stream messages received on the server.

Dependent item etcd.grpc.received.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: RPCs sent per second

The number of gRPC stream messages sent by the server.

Dependent item etcd.grpc.sent.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_sent_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: RPCs started per second

The number of RPCs started on the server.

Dependent item etcd.grpc.started.rate

Preprocessing

  • Prometheus to JSON: grpc_server_started_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: Get version HTTP agent etcd.get_version
Etcd: Server version

The version of the etcd server.

Dependent item etcd.server.version

Preprocessing

  • JSON Path: $.etcdserver

  • Discard unchanged with heartbeat: 1d

Etcd: Cluster version

The version of the etcd cluster.

Dependent item etcd.cluster.version

Preprocessing

  • JSON Path: $.etcdcluster

  • Discard unchanged with heartbeat: 1d

Etcd: DB size

The total size of the underlying database.

Dependent item etcd.db.size

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_db_total_size_in_bytes)

Etcd: Keys compacted per second

The number of DB keys compacted per second.

Dependent item etcd.keys.compacted.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_db_compaction_keys_total)

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd: Keys expired per second

The number of expired keys per second.

Dependent item etcd.keys.expired.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_store_expires_total)

  • Change per second
Etcd: Keys total

The total number of keys.

Dependent item etcd.keys.total

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_keys_total)

Etcd: Uptime

Etcd server uptime.

Dependent item etcd.uptime

Preprocessing

  • Prometheus pattern: VALUE(process_start_time_seconds)

  • JavaScript: The text is too long. Please see the template.

Etcd: Virtual memory

The size of virtual memory expressed in bytes.

Dependent item etcd.virtual.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_virtual_memory_bytes)

Etcd: Resident memory

The size of resident memory expressed in bytes.

Dependent item etcd.res.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_resident_memory_bytes)

Etcd: CPU

The total user and system CPU time spent in seconds.

Dependent item etcd.cpu.util

Preprocessing

  • Prometheus pattern: VALUE(process_cpu_seconds_total)

  • Change per second
Etcd: Open file descriptors

The number of open file descriptors.

Dependent item etcd.open.fds

Preprocessing

  • Prometheus pattern: VALUE(process_open_fds)

Etcd: Maximum open file descriptors

The Maximum number of open file descriptors.

Dependent item etcd.max.fds

Preprocessing

  • Prometheus pattern: VALUE(process_max_fds)

Etcd: Deletes per second

The number of deletes seen by this member per second.

Dependent item etcd.delete.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_delete_total)

  • Change per second
Etcd: PUT per second

The number of puts seen by this member per second.

Dependent item etcd.put.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_put_total)

  • Change per second
Etcd: Range per second

The number of ranges seen by this member per second.

Dependent item etcd.range.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Etcd: Transaction per second

The number of transactions seen by this member per second.

Dependent item etcd.txn.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Etcd: Pending events

The total number of pending events to be sent.

Dependent item etcd.events.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_pending_events_total)

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{$ETCD.HOST}","{$ETCD.PORT}"])=0 Average Manual close: Yes
Etcd: Node healthcheck failed

See more details on https://etcd.io/docs/v3.5/op-guide/monitoring/#health-check.

last(/Etcd by HTTP/etcd.health)=0 Average Depends on:
  • Etcd: Service is unavailable
Etcd: Failed to fetch info data

Zabbix has not received any data for items for the last 30 minutes.

nodata(/Etcd by HTTP/etcd.is.leader,30m)=1 Warning Manual close: Yes
Depends on:
  • Etcd: Service is unavailable
Etcd: Member has no leader

If a member does not have a leader, it is totally unavailable.

last(/Etcd by HTTP/etcd.has.leader)=0 Average
Etcd: Instance has seen too many leader changes

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN} Warning
Etcd: Too many proposal failures

Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN} Warning
Etcd: Too many proposals are queued to commit

Rising pending proposals suggests there is a high client load, or the member cannot commit proposals.

min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN} Warning
Etcd: Too many HTTP requests failures

Too many requests failed on etcd instance with the 5xx HTTP code.

min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN} Warning
Etcd: Server version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0 Info Manual close: Yes
Etcd: Cluster version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0 Info Manual close: Yes
Etcd: Host has been restarted

Uptime is less than 10 minutes.

last(/Etcd by HTTP/etcd.uptime)<10m Info Manual close: Yes
Etcd: Current number of open files is too high

Heavy usage of a file descriptor (i.e., near the limit of the process's file descriptor) indicates a potential file descriptor exhaustion issue.
If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.

min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN} Warning

LLD rule gRPC codes discovery

Name Description Type Key and additional info
gRPC codes discovery Dependent item etcd.grpc_code.discovery

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for gRPC codes discovery

Name Description Type Key and additional info
Etcd: RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

Dependent item etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second

Trigger prototypes for gRPC codes discovery

Name Description Expression Severity Dependencies and additional info
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE} min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN} Warning

LLD rule Peers discovery

Name Description Type Key and additional info
Peers discovery Dependent item etcd.peer.discovery

Preprocessing

  • Prometheus to JSON: etcd_network_peer_sent_bytes_total

Item prototypes for Peers discovery

Name Description Type Key and additional info
Etcd: Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"})

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd: Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd: Etcd peer {#ETCD.PEER}: Send failures

The number of sent failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd: Etcd peer {#ETCD.PEER}: Receive failures

The number of received failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 6.2

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http?at=release/6.2

Etcd by HTTP

Overview

For Zabbix version: 6.2 and higher. This template is designed to monitor etcd by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

The template Etcd by HTTP — collects metrics by help of the HTTP agent from /metrics endpoint.

Refer to the vendor documentation.

For the users of etcd version <= 3.4 !

In etcd v3.5 some metrics have been deprecated. See more details on Upgrade etcd from 3.4 to 3.5. Please upgrade your etcd instance, or use older Etcd by HTTP template version.

This template has been tested on:

  • Etcd, version 3.5.6

Setup

See Zabbix template operation for basic instructions.

Follow these instructions:

  1. Import the template into Zabbix.
  2. After importing the template, make sure that etcd allows the collection of metrics. You can test it by running: curl -L http://localhost:2379/metrics.
  3. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify it, run curl -L http://<etcd_node_address>:2379/metrics.
  4. Add the template to each etcd node. By default, the template uses a client's port. You can configure metrics endpoint location by adding --listen-metrics-urls flag. (For more details, see etcd documentation).

Additional points to consider:

  • If you have specified a non-standard port for etcd, don't forget to change macros: {$ETCD.SCHEME} and {$ETCD.PORT}.
  • You can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template to use on a host level if necessary.
  • To test availability, run : zabbix_get -s etcd-host -k etcd.health.
  • See the macros section, as it will set the trigger values.

Configuration

No specific Zabbix configuration is required.

Macros used

Name Description Default
{$ETCD.GRPC.ERRORS.MAX.WARN}

The maximum number of gRPC request failures.

1
{$ETCD.GRPC_CODE.MATCHES}

The filter of discoverable gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

The filter to exclude discovered gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

CHANGE_IF_NEEDED
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

The filter of discoverable gRPC codes, which will create triggers.

`Aborted
{$ETCD.HTTP.FAIL.MAX.WARN}

The maximum number of HTTP request failures.

2
{$ETCD.LEADER.CHANGES.MAX.WARN}

The maximum number of leader changes.

5
{$ETCD.OPEN.FDS.MAX.WARN}

The maximum percentage of used file descriptors.

90
{$ETCD.PASSWORD}

-

``
{$ETCD.PORT}

The port of etcd API endpoint.

2379
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

The maximum number of proposal failures.

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

The maximum number of proposals in queue.

5
{$ETCD.SCHEME}

The request scheme which may be http or https.

http
{$ETCD.USER}

-

``

Template links

There are no template links in this template.

Discovery rules

Name Description Type Key and additional info
gRPC codes discovery

-

DEPENDENT etcd.grpc_code.discovery

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_handled_total

- JAVASCRIPT: The text is too long. Please see the template.

- DISCARD_UNCHANGED_HEARTBEAT: 1h

Filter:

AND

- {#GRPC.CODE} NOT_MATCHES_REGEX {$ETCD.GRPC_CODE.NOT_MATCHES}

- {#GRPC.CODE} MATCHES_REGEX {$ETCD.GRPC_CODE.MATCHES}

Overrides:

trigger
- {#GRPC.CODE} MATCHES_REGEX {$ETCD.GRPC_CODE.TRIGGER.MATCHES}
- TRIGGER_PROTOTYPE LIKE Too many failed gRPC requests
- DISCOVER

Peers discovery

-

DEPENDENT etcd.peer.discovery

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_network_peer_sent_bytes_total

Items collected

Group Name Description Type Key and additional info
Etcd Etcd: Service's TCP port state

-

SIMPLE net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"]

Preprocessing:

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Node health

-

HTTP_AGENT etcd.health

Preprocessing:

- JSONPATH: $.health

- BOOL_TO_DECIMAL

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Server is a leader

It defines - whether or not this member is a leader:

1 - it is;

0 - otherwise.

DEPENDENT etcd.is.leader

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_is_leader

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Server has a leader

It defines - whether or not a leader exists:

1 - it exists;

0 - it does not.

DEPENDENT etcd.has.leader

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_has_leader

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Leader changes

The number of leader changes the member has seen since its start.

DEPENDENT etcd.leader.changes

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_leader_changes_seen_total

Etcd Etcd: Proposals committed per second

The number of consensus proposals committed.

DEPENDENT etcd.proposals.committed.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_committed_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals applied per second

The number of consensus proposals applied.

DEPENDENT etcd.proposals.applied.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_applied_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals failed per second

The number of failed proposals seen.

DEPENDENT etcd.proposals.failed.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_failed_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals pending

The current number of pending proposals to commit.

DEPENDENT etcd.proposals.pending

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_pending

Etcd Etcd: Reads per second

The number of read actions by get/getRecursive, local to this member.

DEPENDENT etcd.reads.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_debugging_store_reads_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Writes per second

The number of writes (e.g., set/compareAndDelete) seen by this member.

DEPENDENT etcd.writes.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_debugging_store_writes_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Client gRPC received bytes per second

The number of bytes received from gRPC clients per second.

DEPENDENT etcd.network.grpc.received.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_client_grpc_received_bytes_total

- CHANGE_PER_SECOND

Etcd Etcd: Client gRPC sent bytes per second

The number of bytes sent from gRPC clients per second.

DEPENDENT etcd.network.grpc.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_client_grpc_sent_bytes_total

- CHANGE_PER_SECOND

Etcd Etcd: HTTP requests received

The number of requests received into the system (successfully parsed and authd).

DEPENDENT etcd.http.requests.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_received_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: HTTP 5XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 5XX.

DEPENDENT etcd.http.requests.5xx.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_failed_total{code=~"5.+"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: HTTP 4XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 4XX.

DEPENDENT etcd.http.requests.4xx.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_failed_total{code=~"4.+"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs received per second

The number of RPC stream messages received on the server.

DEPENDENT etcd.grpc.received.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_msg_received_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs sent per second

The number of gRPC stream messages sent by the server.

DEPENDENT etcd.grpc.sent.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_msg_sent_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs started per second

The number of RPCs started on the server.

DEPENDENT etcd.grpc.started.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_started_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Server version

The version of the etcd server.

DEPENDENT etcd.server.version

Preprocessing:

- JSONPATH: $.etcdserver

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Etcd Etcd: Cluster version

The version of the etcd cluster.

DEPENDENT etcd.cluster.version

Preprocessing:

- JSONPATH: $.etcdcluster

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Etcd Etcd: DB size

The total size of the underlying database.

DEPENDENT etcd.db.size

Preprocessing:

- PROMETHEUS_PATTERN: etcd_mvcc_db_total_size_in_bytes

Etcd Etcd: Keys compacted per second

The number of DB keys compacted per second.

DEPENDENT etcd.keys.compacted.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_db_compaction_keys_total

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Keys expired per second

The number of expired keys per second.

DEPENDENT etcd.keys.expired.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_store_expires_total

- CHANGE_PER_SECOND

Etcd Etcd: Keys total

The total number of keys.

DEPENDENT etcd.keys.total

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_keys_total

Etcd Etcd: Uptime

Etcd server uptime.

DEPENDENT etcd.uptime

Preprocessing:

- PROMETHEUS_PATTERN: process_start_time_seconds

- JAVASCRIPT: //use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value));

Etcd Etcd: Virtual memory

The size of virtual memory expressed in bytes.

DEPENDENT etcd.virtual.bytes

Preprocessing:

- PROMETHEUS_PATTERN: process_virtual_memory_bytes

Etcd Etcd: Resident memory

The size of resident memory expressed in bytes.

DEPENDENT etcd.res.bytes

Preprocessing:

- PROMETHEUS_PATTERN: process_resident_memory_bytes

Etcd Etcd: CPU

The total user and system CPU time spent in seconds.

DEPENDENT etcd.cpu.util

Preprocessing:

- PROMETHEUS_PATTERN: process_cpu_seconds_total

- CHANGE_PER_SECOND

Etcd Etcd: Open file descriptors

The number of open file descriptors.

DEPENDENT etcd.open.fds

Preprocessing:

- PROMETHEUS_PATTERN: process_open_fds

Etcd Etcd: Maximum open file descriptors

The Maximum number of open file descriptors.

DEPENDENT etcd.max.fds

Preprocessing:

- PROMETHEUS_PATTERN: process_max_fds

Etcd Etcd: Deletes per second

The number of deletes seen by this member per second.

DEPENDENT etcd.delete.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_mvcc_delete_total

- CHANGE_PER_SECOND

Etcd Etcd: PUT per second

The number of puts seen by this member per second.

DEPENDENT etcd.put.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_mvcc_put_total

- CHANGE_PER_SECOND

Etcd Etcd: Range per second

The number of ranges seen by this member per second.

DEPENDENT etcd.range.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_range_total

- CHANGE_PER_SECOND

Etcd Etcd: Transaction per second

The number of transactions seen by this member per second.

DEPENDENT etcd.txn.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_range_total

- CHANGE_PER_SECOND

Etcd Etcd: Pending events

The total number of pending events to be sent.

DEPENDENT etcd.events.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_pending_events_total

Etcd Etcd: RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

DEPENDENT etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to a peer with the ID {#ETCD.PEER}.

DEPENDENT etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from a peer with the ID {#ETCD.PEER}.

DEPENDENT etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_received_bytes_total{From="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Send failures

The number of sent failures from a peer with the ID {#ETCD.PEER}.

DEPENDENT etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_sent_failures_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Receive failures

The number of received failures from a peer with the ID {#ETCD.PEER}.

DEPENDENT etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_received_failures_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Zabbix raw items Etcd: Get node metrics

-

HTTP_AGENT etcd.get_metrics
Zabbix raw items Etcd: Get version

-

HTTP_AGENT etcd.get_version

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable

-

last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"])=0 AVERAGE

Manual close: YES

Etcd: Node healthcheck failed

See more details on https://etcd.io/docs/v3.5/op-guide/monitoring/#health-check.

last(/Etcd by HTTP/etcd.health)=0 AVERAGE

Depends on:

- Etcd: Service is unavailable

Etcd: Failed to fetch info data

Zabbix has not received data for items for the last 30 minutes.

nodata(/Etcd by HTTP/etcd.is.leader,30m)=1 WARNING

Manual close: YES

Depends on:

- Etcd: Service is unavailable

Etcd: Member has no leader

If a member does not have a leader, it is totally unavailable.

last(/Etcd by HTTP/etcd.has.leader)=0 AVERAGE
Etcd: Instance has seen too many leader changes

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN} WARNING
Etcd: Too many proposal failures

Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN} WARNING
Etcd: Too many proposals are queued to commit

Rising pending proposals suggests there is a high client load, or the member cannot commit proposals.

min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN} WARNING
Etcd: Too many HTTP requests failures

Too many requests failed on etcd instance with the 5xx HTTP code.

min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN} WARNING
Etcd: Server version has changed

The Etcd version has changed. Acknowledge to close manually.

last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0 INFO

Manual close: YES

Etcd: Cluster version has changed

The Etcd version has changed. Acknowledge to close manually.

last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0 INFO

Manual close: YES

Etcd: Host has been restarted

The host uptime is less than 10 minutes.

last(/Etcd by HTTP/etcd.uptime)<10m INFO

Manual close: YES

Etcd: Current number of open files is too high

Heavy usage of a file descriptor (i.e., near the limit of the process's file descriptor) indicates a potential file descriptor exhaustion issue.

If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.

min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN} WARNING
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE}

-

min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN} WARNING

Feedback

Please report any issues with the template at https://support.zabbix.com.

This template is for Zabbix version: 6.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http?at=release/6.0

Etcd by HTTP

Overview

This template is designed to monitor etcd by Zabbix that works without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

The template Etcd by HTTP — collects metrics by help of the HTTP agent from /metrics endpoint.

Refer to the vendor documentation.

For the users of etcd version <= 3.4 !

In etcd v3.5 some metrics have been deprecated. See more details on Upgrade etcd from 3.4 to 3.5. Please upgrade your etcd instance, or use older Etcd by HTTP template version.

Requirements

Zabbix version: 6.0 and higher.

Tested versions

This template has been tested on:

  • Etcd 3.5.6

Configuration

Zabbix should be configured according to the instructions in the Templates out of the box section.

Setup

Follow these instructions:

  1. Import the template into Zabbix.
  2. After importing the template, make sure that etcd allows the collection of metrics. You can test it by running: curl -L http://localhost:2379/metrics.
  3. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify it, run curl -L http://<etcd_node_address>:2379/metrics.
  4. Add the template to each etcd node. By default, the template uses a client's port. You can configure metrics endpoint location by adding --listen-metrics-urls flag. (For more details, see etcd documentation).

Additional points to consider:

  • If you have specified a non-standard port for etcd, don't forget to change macros: {$ETCD.SCHEME} and {$ETCD.PORT}.
  • You can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template to use on a host level if necessary.
  • To test availability, run : zabbix_get -s etcd-host -k etcd.health.
  • See the macros section, as it will set the trigger values.

Macros used

Name Description Default
{$ETCD.PORT}

The port of etcd API endpoint.

2379
{$ETCD.SCHEME}

The request scheme which may be http or https.

http
{$ETCD.USER}
{$ETCD.PASSWORD}
{$ETCD.LEADER.CHANGES.MAX.WARN}

The maximum number of leader changes.

5
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

The maximum number of proposal failures.

2
{$ETCD.HTTP.FAIL.MAX.WARN}

The maximum number of HTTP request failures.

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

The maximum number of proposals in queue.

5
{$ETCD.OPEN.FDS.MAX.WARN}

The maximum percentage of used file descriptors.

90
{$ETCD.GRPC_CODE.MATCHES}

The filter of discoverable gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

The filter to exclude discovered gRPC codes. See more details on https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

CHANGE_IF_NEEDED
{$ETCD.GRPC.ERRORS.MAX.WARN}

The maximum number of gRPC request failures.

1
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

The filter of discoverable gRPC codes, which will create triggers.

Aborted|Unavailable

Items

Name Description Type Key and additional info
Etcd: Service's TCP port state Simple check net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"]

Preprocessing

  • Discard unchanged with heartbeat: 10m

Etcd: Get node metrics HTTP agent etcd.get_metrics
Etcd: Node health HTTP agent etcd.health

Preprocessing

  • JSON Path: $.health

  • Boolean to decimal

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Etcd: Server is a leader

It defines - whether or not this member is a leader:

1 - it is;

0 - otherwise.

Dependent item etcd.is.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_is_leader)

    ⛔️Custom on fail: Set value to: 0

  • Discard unchanged with heartbeat: 10m

Etcd: Server has a leader

It defines - whether or not a leader exists:

1 - it exists;

0 - it does not.

Dependent item etcd.has.leader

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_has_leader)

  • Discard unchanged with heartbeat: 10m

Etcd: Leader changes

The number of leader changes the member has seen since its start.

Dependent item etcd.leader.changes

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_leader_changes_seen_total)

Etcd: Proposals committed per second

The number of consensus proposals committed.

Dependent item etcd.proposals.committed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_committed_total)

  • Change per second
Etcd: Proposals applied per second

The number of consensus proposals applied.

Dependent item etcd.proposals.applied.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_applied_total)

  • Change per second
Etcd: Proposals failed per second

The number of failed proposals seen.

Dependent item etcd.proposals.failed.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_failed_total)

  • Change per second
Etcd: Proposals pending

The current number of pending proposals to commit.

Dependent item etcd.proposals.pending

Preprocessing

  • Prometheus pattern: VALUE(etcd_server_proposals_pending)

Etcd: Reads per second

The number of read actions by get/getRecursive, local to this member.

Dependent item etcd.reads.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_reads_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: Writes per second

The number of writes (e.g., set/compareAndDelete) seen by this member.

Dependent item etcd.writes.rate

Preprocessing

  • Prometheus to JSON: etcd_debugging_store_writes_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: Client gRPC received bytes per second

The number of bytes received from gRPC clients per second.

Dependent item etcd.network.grpc.received.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_received_bytes_total)

  • Change per second
Etcd: Client gRPC sent bytes per second

The number of bytes sent from gRPC clients per second.

Dependent item etcd.network.grpc.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_client_grpc_sent_bytes_total)

  • Change per second
Etcd: HTTP requests received

The number of requests received into the system (successfully parsed and authd).

Dependent item etcd.http.requests.rate

Preprocessing

  • Prometheus to JSON: etcd_http_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: HTTP 5XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 5XX.

Dependent item etcd.http.requests.5xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"5.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: HTTP 4XX

The number of handled failures of requests (non-watches), by the method (GET/PUT etc.), and the code 4XX.

Dependent item etcd.http.requests.4xx.rate

Preprocessing

  • Prometheus to JSON: etcd_http_failed_total{code=~"4.+"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: RPCs received per second

The number of RPC stream messages received on the server.

Dependent item etcd.grpc.received.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_received_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: RPCs sent per second

The number of gRPC stream messages sent by the server.

Dependent item etcd.grpc.sent.rate

Preprocessing

  • Prometheus to JSON: grpc_server_msg_sent_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: RPCs started per second

The number of RPCs started on the server.

Dependent item etcd.grpc.started.rate

Preprocessing

  • Prometheus to JSON: grpc_server_started_total

  • JavaScript: The text is too long. Please see the template.

  • Change per second
Etcd: Get version HTTP agent etcd.get_version
Etcd: Server version

The version of the etcd server.

Dependent item etcd.server.version

Preprocessing

  • JSON Path: $.etcdserver

  • Discard unchanged with heartbeat: 1d

Etcd: Cluster version

The version of the etcd cluster.

Dependent item etcd.cluster.version

Preprocessing

  • JSON Path: $.etcdcluster

  • Discard unchanged with heartbeat: 1d

Etcd: DB size

The total size of the underlying database.

Dependent item etcd.db.size

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_db_total_size_in_bytes)

Etcd: Keys compacted per second

The number of DB keys compacted per second.

Dependent item etcd.keys.compacted.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_db_compaction_keys_total)

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd: Keys expired per second

The number of expired keys per second.

Dependent item etcd.keys.expired.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_store_expires_total)

  • Change per second
Etcd: Keys total

The total number of keys.

Dependent item etcd.keys.total

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_keys_total)

Etcd: Uptime

Etcd server uptime.

Dependent item etcd.uptime

Preprocessing

  • Prometheus pattern: VALUE(process_start_time_seconds)

  • JavaScript: The text is too long. Please see the template.

Etcd: Virtual memory

The size of virtual memory expressed in bytes.

Dependent item etcd.virtual.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_virtual_memory_bytes)

Etcd: Resident memory

The size of resident memory expressed in bytes.

Dependent item etcd.res.bytes

Preprocessing

  • Prometheus pattern: VALUE(process_resident_memory_bytes)

Etcd: CPU

The total user and system CPU time spent in seconds.

Dependent item etcd.cpu.util

Preprocessing

  • Prometheus pattern: VALUE(process_cpu_seconds_total)

  • Change per second
Etcd: Open file descriptors

The number of open file descriptors.

Dependent item etcd.open.fds

Preprocessing

  • Prometheus pattern: VALUE(process_open_fds)

Etcd: Maximum open file descriptors

The Maximum number of open file descriptors.

Dependent item etcd.max.fds

Preprocessing

  • Prometheus pattern: VALUE(process_max_fds)

Etcd: Deletes per second

The number of deletes seen by this member per second.

Dependent item etcd.delete.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_delete_total)

  • Change per second
Etcd: PUT per second

The number of puts seen by this member per second.

Dependent item etcd.put.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_mvcc_put_total)

  • Change per second
Etcd: Range per second

The number of ranges seen by this member per second.

Dependent item etcd.range.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Etcd: Transaction per second

The number of transactions seen by this member per second.

Dependent item etcd.txn.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_range_total)

  • Change per second
Etcd: Pending events

The total number of pending events to be sent.

Dependent item etcd.events.sent.rate

Preprocessing

  • Prometheus pattern: VALUE(etcd_debugging_mvcc_pending_events_total)

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"])=0 Average Manual close: Yes
Etcd: Node healthcheck failed

See more details on https://etcd.io/docs/v3.5/op-guide/monitoring/#health-check.

last(/Etcd by HTTP/etcd.health)=0 Average Depends on:
  • Etcd: Service is unavailable
Etcd: Failed to fetch info data

Zabbix has not received any data for items for the last 30 minutes.

nodata(/Etcd by HTTP/etcd.is.leader,30m)=1 Warning Manual close: Yes
Depends on:
  • Etcd: Service is unavailable
Etcd: Member has no leader

If a member does not have a leader, it is totally unavailable.

last(/Etcd by HTTP/etcd.has.leader)=0 Average
Etcd: Instance has seen too many leader changes

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN} Warning
Etcd: Too many proposal failures

Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN} Warning
Etcd: Too many proposals are queued to commit

Rising pending proposals suggests there is a high client load, or the member cannot commit proposals.

min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN} Warning
Etcd: Too many HTTP requests failures

Too many requests failed on etcd instance with the 5xx HTTP code.

min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN} Warning
Etcd: Server version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0 Info Manual close: Yes
Etcd: Cluster version has changed

Etcd version has changed. Acknowledge to close the problem manually.

last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0 Info Manual close: Yes
Etcd: Host has been restarted

Uptime is less than 10 minutes.

last(/Etcd by HTTP/etcd.uptime)<10m Info Manual close: Yes
Etcd: Current number of open files is too high

Heavy usage of a file descriptor (i.e., near the limit of the process's file descriptor) indicates a potential file descriptor exhaustion issue.
If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.

min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN} Warning

LLD rule gRPC codes discovery

Name Description Type Key and additional info
gRPC codes discovery Dependent item etcd.grpc_code.discovery

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total

  • JavaScript: The text is too long. Please see the template.

  • Discard unchanged with heartbeat: 1h

Item prototypes for gRPC codes discovery

Name Description Type Key and additional info
Etcd: RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

Dependent item etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing

  • Prometheus to JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

  • JavaScript: The text is too long. Please see the template.

  • Change per second

Trigger prototypes for gRPC codes discovery

Name Description Expression Severity Dependencies and additional info
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE} min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN} Warning

LLD rule Peers discovery

Name Description Type Key and additional info
Peers discovery Dependent item etcd.peer.discovery

Preprocessing

  • Prometheus to JSON: etcd_network_peer_sent_bytes_total

Item prototypes for Peers discovery

Name Description Type Key and additional info
Etcd: Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: VALUE(etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"})

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd: Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd: Etcd peer {#ETCD.PEER}: Send failures

The number of sent failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second
Etcd: Etcd peer {#ETCD.PEER}: Receive failures

The number of received failures from a peer with the ID {#ETCD.PEER}.

Dependent item etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing

  • Prometheus pattern: The text is too long. Please see the template.

    ⛔️Custom on fail: Set value to: 0

  • Change per second

Feedback

Please report any issues with the template at https://support.zabbix.com

You can also provide feedback, discuss the template, or ask for help at ZABBIX forums

This template is for Zabbix version: 5.4

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http?at=release/5.4

Etcd by HTTP

Overview

For Zabbix version: 5.4 and higher
The template to monitor Etcd by Zabbix that work without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template Etcd — collects metrics by HTTP agent from /metrics endpoint. See https://etcd.io/docs/v3.4.0/op-guide/monitoring/#metrics-endpoint.

This template was tested on:

  • Etcd, version 3.0+

Setup

See Zabbix template operation for basic instructions.

  1. Import template into Zabbix
  2. After importing template make sure that etcd allows for metric collection. Test by running: curl -L http://localhost:2379/metrics
  3. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify run curl -L http://<etcd_node_address>:2379/metrics
  4. Add the template to each node with etcd. By default template use client port. You can configure metrics endpoint location by --listen-metrics-urls flag (See etcd docs).

If you have specified a non-standard port for etcd, don't forget change macros {$ETCD.SCHEME}, {$ETCD.PORT}.

If you need it, you can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template for using on the host level.

Test availability: zabbix_get -s etcd-host -k etcd.health

Besides, see the macros section as it will set the trigger values.

Zabbix configuration

No specific Zabbix configuration is required.

Macros used

Name Description Default
{$ETCD.GRPC.ERRORS.MAX.WARN}

Maximum number of gRPC requests failures.

1
{$ETCD.GRPC_CODE.MATCHES}

Filter of discoverable gRPC codes https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

Filter to exclude discovered gRPC codes https://github.com/grpc/grpc/blob/master/doc/statuscodes.md.

CHANGE_IF_NEEDED
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

Filter of discoverable gRPC codes which will be create triggers.

`Aborted
{$ETCD.HTTP.FAIL.MAX.WARN}

Maximum number of HTTP requests failures.

2
{$ETCD.LEADER.CHANGES.MAX.WARN}

Maximum number of leader changes.

5
{$ETCD.OPEN.FDS.MAX.WARN}

Maximum percentage of used file descriptors.

90
{$ETCD.PASSWORD}

-

``
{$ETCD.PORT}

The port of Etcd API endpoint.

2379
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

Maximum number of proposal failures.

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

Maximum number of proposals in queue.

5
{$ETCD.SCHEME}

Request scheme which may be http or https.

http
{$ETCD.USER}

-

``

Template links

There are no template links in this template.

Discovery rules

Name Description Type Key and additional info
gRPC codes discovery

-

DEPENDENT etcd.grpc_code.discovery

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_handled_total

- JAVASCRIPT: The text is too long. Please see the template.

- DISCARD_UNCHANGED_HEARTBEAT: 1h

Filter:

AND

- {#GRPC.CODE} NOT_MATCHES_REGEX {$ETCD.GRPC_CODE.NOT_MATCHES}

- {#GRPC.CODE} MATCHES_REGEX {$ETCD.GRPC_CODE.MATCHES}

Overrides:

trigger
- {#GRPC.CODE} MATCHES_REGEX {$ETCD.GRPC_CODE.TRIGGER.MATCHES}
- TRIGGER_PROTOTYPE LIKE Too many failed gRPC requests - DISCOVER

Peers discovery

-

DEPENDENT etcd.peer.discovery

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_network_peer_sent_bytes_total

Items collected

Group Name Description Type Key and additional info
Etcd Etcd: Service's TCP port state

-

SIMPLE net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"]

Preprocessing:

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Node health

-

HTTP_AGENT etcd.health

Preprocessing:

- JSONPATH: $.health

- BOOL_TO_DECIMAL

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Server is a leader

Whether or not this member is a leader. 1 if is, 0 otherwise.

DEPENDENT etcd.is.leader

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_is_leader

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Server has a leader

Whether or not a leader exists. 1 is existence, 0 is not.

DEPENDENT etcd.has.leader

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_has_leader

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Leader changes

The the number of leader changes the member has seen since its start.

DEPENDENT etcd.leader.changes

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_leader_changes_seen_total

Etcd Etcd: Proposals committed per second

The number of consensus proposals committed.

DEPENDENT etcd.proposals.committed.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_committed_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals applied per second

The number of consensus proposals applied.

DEPENDENT etcd.proposals.applied.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_applied_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals failed per second

The number of failed proposals seen.

DEPENDENT etcd.proposals.failed.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_failed_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals pending

The current number of pending proposals to commit.

DEPENDENT etcd.proposals.pending

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_pending

Etcd Etcd: Reads per second

Number of reads action by (get/getRecursive), local to this member.

DEPENDENT etcd.reads.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_debugging_store_reads_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Writes per second

Number of writes (e.g. set/compareAndDelete) seen by this member.

DEPENDENT etcd.writes.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_debugging_store_writes_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Client gRPC received bytes per second

The number of bytes received from grpc clients per second.

DEPENDENT etcd.network.grpc.received.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_client_grpc_received_bytes_total

- CHANGE_PER_SECOND

Etcd Etcd: Client gRPC sent bytes per second

The number of bytes sent from grpc clients per second.

DEPENDENT etcd.network.grpc.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_client_grpc_sent_bytes_total

- CHANGE_PER_SECOND

Etcd Etcd: HTTP requests received

Number of requests received into the system (successfully parsed and authd).

DEPENDENT etcd.http.requests.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_received_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: HTTP 5XX

Number of handle failures of requests (non-watches), by method (GET/PUT etc.), and code 5XX.

DEPENDENT etcd.http.requests.5xx.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_failed_total{code=~"5.+"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: HTTP 4XX

Number of handle failures of requests (non-watches), by method (GET/PUT etc.), and code 4XX.

DEPENDENT etcd.http.requests.4xx.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_failed_total{code=~"4.+"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs received per second

The number of RPC stream messages received on the server.

DEPENDENT etcd.grpc.received.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_msg_received_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs sent per second

The number of gRPC stream messages sent by the server.

DEPENDENT etcd.grpc.sent.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_msg_sent_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs started per second

The number of RPCs started on the server.

DEPENDENT etcd.grpc.started.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_started_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Server version

Version of the Etcd server.

DEPENDENT etcd.server.version

Preprocessing:

- JSONPATH: $.etcdserver

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Etcd Etcd: Cluster version

Version of the Etcd cluster.

DEPENDENT etcd.cluster.version

Preprocessing:

- JSONPATH: $.etcdcluster

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Etcd Etcd: DB size

Total size of the underlying database.

DEPENDENT etcd.db.size

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_db_total_size_in_bytes

Etcd Etcd: Keys compacted per second

The number of DB keys compacted per second.

DEPENDENT etcd.keys.compacted.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_db_compaction_keys_total

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Keys expired per second

The number of expired keys per second.

DEPENDENT etcd.keys.expired.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_store_expires_total

- CHANGE_PER_SECOND

Etcd Etcd: Keys total

Total number of keys.

DEPENDENT etcd.keys.total

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_keys_total

Etcd Etcd: Uptime

Etcd server uptime.

DEPENDENT etcd.uptime

Preprocessing:

- PROMETHEUS_PATTERN: process_start_time_seconds

- JAVASCRIPT: //use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value));

Etcd Etcd: Virtual memory

Virtual memory size in bytes.

DEPENDENT etcd.virtual.bytes

Preprocessing:

- PROMETHEUS_PATTERN: process_virtual_memory_bytes

Etcd Etcd: Resident memory

Resident memory size in bytes.

DEPENDENT etcd.res.bytes

Preprocessing:

- PROMETHEUS_PATTERN: process_resident_memory_bytes

Etcd Etcd: CPU

Total user and system CPU time spent in seconds.

DEPENDENT etcd.cpu.util

Preprocessing:

- PROMETHEUS_PATTERN: process_cpu_seconds_total

- CHANGE_PER_SECOND

Etcd Etcd: Open file descriptors

Number of open file descriptors.

DEPENDENT etcd.open.fds

Preprocessing:

- PROMETHEUS_PATTERN: process_open_fds

Etcd Etcd: Maximum open file descriptors

The Maximum number of open file descriptors.

DEPENDENT etcd.max.fds

Preprocessing:

- PROMETHEUS_PATTERN: process_max_fds

Etcd Etcd: Deletes per second

The number of deletes seen by this member per second.

DEPENDENT etcd.delete.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_delete_total

- CHANGE_PER_SECOND

Etcd Etcd: PUT per second

The number of puts seen by this member per second.

DEPENDENT etcd.put.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_put_total

- CHANGE_PER_SECOND

Etcd Etcd: Range per second

The number of ranges seen by this member per second.

DEPENDENT etcd.range.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_range_total

- CHANGE_PER_SECOND

Etcd Etcd: Transaction per second

The number of transactions seen by this member per second.

DEPENDENT etcd.txn.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_range_total

- CHANGE_PER_SECOND

Etcd Etcd: Pending events

Total number of pending events to be sent.

DEPENDENT etcd.events.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_pending_events_total

Etcd Etcd: RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}.

DEPENDENT etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to peer with ID {#ETCD.PEER}.

DEPENDENT etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from peer with ID {#ETCD.PEER}.

DEPENDENT etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_received_bytes_total{From="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Send failures

The number of send failures from peer with ID {#ETCD.PEER}.

DEPENDENT etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_sent_failures_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Receive failures failures

The number of receive failures from the peer with ID {#ETCD.PEER}.

DEPENDENT etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_received_failures_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Zabbix_raw_items Etcd: Get node metrics

-

HTTP_AGENT etcd.get_metrics
Zabbix_raw_items Etcd: Get version

-

HTTP_AGENT etcd.get_version

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable

-

last(/Etcd by HTTP/net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"])=0 AVERAGE

Manual close: YES

Etcd: Node healthcheck failed

https://etcd.io/docs/v3.4.0/op-guide/monitoring/#health-check

last(/Etcd by HTTP/etcd.health)=0 AVERAGE

Depends on:

- Etcd: Service is unavailable

Etcd: Failed to fetch info data (or no data for 30m)

Zabbix has not received data for items for the last 30 minutes.

nodata(/Etcd by HTTP/etcd.is.leader,30m)=1 WARNING

Manual close: YES

Depends on:

- Etcd: Service is unavailable

Etcd: Member has no leader

If a member does not have a leader, it is totally unavailable.

last(/Etcd by HTTP/etcd.has.leader)=0 AVERAGE
Etcd: Instance has seen too many leader changes (over {$ETCD.LEADER.CHANGES.MAX.WARN} for 15m)'

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

(max(/Etcd by HTTP/etcd.leader.changes,15m)-min(/Etcd by HTTP/etcd.leader.changes,15m))>{$ETCD.LEADER.CHANGES.MAX.WARN} WARNING
Etcd: Too many proposal failures (over {$ETCD.PROPOSAL.FAIL.MAX.WARN} for 5m)'

Normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

min(/Etcd by HTTP/etcd.proposals.failed.rate,5m)>{$ETCD.PROPOSAL.FAIL.MAX.WARN} WARNING
Etcd: Too many proposals are queued to commit (over {$ETCD.PROPOSAL.PENDING.MAX.WARN} for 5m)'

Rising pending proposals suggests there is a high client load or the member cannot commit proposals.

min(/Etcd by HTTP/etcd.proposals.pending,5m)>{$ETCD.PROPOSAL.PENDING.MAX.WARN} WARNING
Etcd: Too many HTTP requests failures (over {$ETCD.HTTP.FAIL.MAX.WARN} for 5m)'

Too many reqvests failed on etcd instance with 5xx HTTP code.

min(/Etcd by HTTP/etcd.http.requests.5xx.rate,5m)>{$ETCD.HTTP.FAIL.MAX.WARN} WARNING
Etcd: Server version has changed (new version: {ITEM.VALUE})

Etcd version has changed. Ack to close.

last(/Etcd by HTTP/etcd.server.version,#1)<>last(/Etcd by HTTP/etcd.server.version,#2) and length(last(/Etcd by HTTP/etcd.server.version))>0 INFO

Manual close: YES

Etcd: Cluster version has changed (new version: {ITEM.VALUE})

Etcd version has changed. Ack to close.

last(/Etcd by HTTP/etcd.cluster.version,#1)<>last(/Etcd by HTTP/etcd.cluster.version,#2) and length(last(/Etcd by HTTP/etcd.cluster.version))>0 INFO

Manual close: YES

Etcd: has been restarted (uptime < 10m)

Uptime is less than 10 minutes

last(/Etcd by HTTP/etcd.uptime)<10m INFO

Manual close: YES

Etcd: Current number of open files is too high (over {$ETCD.OPEN.FDS.MAX.WARN}% for 5m)

Heavy file descriptor usage (i.e., near the process's file descriptor limit) indicates a potential file descriptor exhaustion issue.

If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.

min(/Etcd by HTTP/etcd.open.fds,5m)/last(/Etcd by HTTP/etcd.max.fds)*100>{$ETCD.OPEN.FDS.MAX.WARN} WARNING
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE} (over {$ETCD.GRPC.ERRORS.MAX.WARN} in 5m)

-

min(/Etcd by HTTP/etcd.grpc.handled.rate[{#GRPC.CODE}],5m)>{$ETCD.GRPC.ERRORS.MAX.WARN} WARNING

Feedback

Please report any issues with the template at https://support.zabbix.com

This template is for Zabbix version: 5.0

Source: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/etcd_http?at=release/5.0

Template App Etcd by HTTP

Overview

For Zabbix version: 5.0 and higher
The template to monitor Etcd by Zabbix that work without any external scripts. Most of the metrics are collected in one go, thanks to Zabbix bulk data collection.

Template Etcd — collects metrics by HTTP agent from /metrics endpoint. See https://etcd.io/docs/v3.4.0/op-guide/monitoring/#metrics-endpoint.

This template was tested on:

  • Etcd, version 3.0+

Setup

See Zabbix template operation for basic instructions.

  1. Import template into Zabbix
  2. After importing template make sure that etcd allows for metric collection. Test by running: curl -L http://localhost:2379/metrics
  3. Check if etcd is accessible from Zabbix proxy or Zabbix server depending on where you are planning to do the monitoring. To verify run curl -L http://<etcd_node_address>:2379/metrics
  4. Add the template to each node with etcd. By default template use client port. You can configure metrics endpoint location by --listen-metrics-urls flag (See etcd docs).

If you have specified a non-standard port for etcd, don't forget change macros {$ETCD.SCHEME}, {$ETCD.PORT}.

If you need it, you can set {$ETCD.USERNAME} and {$ETCD.PASSWORD} macros in the template for using on the host level.

Test availability: zabbix_get -s etcd-host -k etcd.health

Besides, see the macros section as it will set the trigger values.

Zabbix configuration

No specific Zabbix configuration is required.

Macros used

Name Description Default
{$ETCD.GRPC.ERRORS.MAX.WARN}

Maximum number of gRPC requests failures

1
{$ETCD.GRPC_CODE.MATCHES}

Filter of discoverable gRPC codes https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

.*
{$ETCD.GRPC_CODE.NOT_MATCHES}

Filter to exclude discovered gRPC codes https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

CHANGE_IF_NEEDED
{$ETCD.GRPC_CODE.TRIGGER.MATCHES}

Filter of discoverable gRPC codes which will create triggers

`Aborted
{$ETCD.HTTP.FAIL.MAX.WARN}

Maximum number of HTTP requests failures

2
{$ETCD.LEADER.CHANGES.MAX.WARN}

Maximum number of leader changes

5
{$ETCD.OPEN.FDS.MAX.WARN}

Maximum percentage of used file descriptors

90
{$ETCD.PASSWORD}

-

``
{$ETCD.PORT}

The port of Etcd API endpoint

2379
{$ETCD.PROPOSAL.FAIL.MAX.WARN}

Maximum number of proposal failures

2
{$ETCD.PROPOSAL.PENDING.MAX.WARN}

Maximum number of proposals in queue

5
{$ETCD.SCHEME}

Request scheme which may be http or https

http
{$ETCD.USER}

-

``

Template links

There are no template links in this template.

Discovery rules

Name Description Type Key and additional info
gRPC codes discovery DEPENDENT etcd.grpc_code.discovery

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_handled_total

- JAVASCRIPT: The text is too long. Please see the template.

- DISCARD_UNCHANGED_HEARTBEAT: 1h

Filter:

AND

- A: {#GRPC.CODE} NOT_MATCHES_REGEX {$ETCD.GRPC_CODE.NOT_MATCHES}

- B: {#GRPC.CODE} MATCHES_REGEX {$ETCD.GRPC_CODE.MATCHES}

Overrides:

trigger
- {#GRPC.CODE} MATCHES_REGEX {$ETCD.GRPC_CODE.TRIGGER.MATCHES}
- TRIGGER_PROTOTYPE LIKE Too many failed gRPC requests - DISCOVER

Peers discovery DEPENDENT etcd.peer.discovery

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_network_peer_sent_bytes_total

Items collected

Group Name Description Type Key and additional info
Etcd Etcd: Service's TCP port state

-

SIMPLE net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"]

Preprocessing:

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Node health

-

HTTP_AGENT etcd.health

Preprocessing:

- JSONPATH: $.health

- BOOL_TO_DECIMAL

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Server is a leader

Whether or not this member is a leader. 1 if is, 0 otherwise.

DEPENDENT etcd.is.leader

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_is_leader

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Server has a leader

Whether or not a leader exists. 1 is existence, 0 is not.

DEPENDENT etcd.has.leader

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_has_leader

- DISCARD_UNCHANGED_HEARTBEAT: 10m

Etcd Etcd: Leader changes

The number of leader changes the member has seen since its start.

DEPENDENT etcd.leader.changes

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_leader_changes_seen_total

Etcd Etcd: Proposals committed per second

The number of consensus proposals committed.

DEPENDENT etcd.proposals.committed.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_committed_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals applied per second

The number of consensus proposals applied.

DEPENDENT etcd.proposals.applied.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_applied_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals failed per second

The number of failed proposals seen.

DEPENDENT etcd.proposals.failed.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_failed_total

- CHANGE_PER_SECOND

Etcd Etcd: Proposals pending

The current number of pending proposals to commit.

DEPENDENT etcd.proposals.pending

Preprocessing:

- PROMETHEUS_PATTERN: etcd_server_proposals_pending

Etcd Etcd: Reads per second

Number of reads action by (get/getRecursive), local to this member.

DEPENDENT etcd.reads.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_debugging_store_reads_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Writes per second

Number of writes (e.g. set/compareAndDelete) seen by this member.

DEPENDENT etcd.writes.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_debugging_store_writes_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Client gRPC received bytes per second

The number of bytes received from grpc clients per second

DEPENDENT etcd.network.grpc.received.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_client_grpc_received_bytes_total

- CHANGE_PER_SECOND

Etcd Etcd: Client gRPC sent bytes per second

The number of bytes sent from grpc clients per second

DEPENDENT etcd.network.grpc.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_client_grpc_sent_bytes_total

- CHANGE_PER_SECOND

Etcd Etcd: HTTP requests received

Number of requests received into the system (successfully parsed and authd).

DEPENDENT etcd.http.requests.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_received_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: HTTP 5XX

Number of handle failures of requests (non-watches), by method (GET/PUT etc.), and code 5XX.

DEPENDENT etcd.http.requests.5xx.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_failed_total{code=~"5.+"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: HTTP 4XX

Number of handle failures of requests (non-watches), by method (GET/PUT etc.), and code 4XX.

DEPENDENT etcd.http.requests.4xx.rate

Preprocessing:

- PROMETHEUS_TO_JSON: etcd_http_failed_total{code=~"4.+"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs received per second

The number of RPC stream messages received on the server.

DEPENDENT etcd.grpc.received.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_msg_received_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs sent per second

The number of gRPC stream messages sent by the server.

DEPENDENT etcd.grpc.sent.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_msg_sent_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: RPCs started per second

The number of RPCs started on the server.

DEPENDENT etcd.grpc.started.rate

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_started_total

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Server version

Version of the Etcd server.

DEPENDENT etcd.server.version

Preprocessing:

- JSONPATH: $.etcdserver

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Etcd Etcd: Cluster version

Version of the Etcd cluster.

DEPENDENT etcd.cluster.version

Preprocessing:

- JSONPATH: $.etcdcluster

- DISCARD_UNCHANGED_HEARTBEAT: 1d

Etcd Etcd: DB size

Total size of the underlying database.

DEPENDENT etcd.db.size

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_db_total_size_in_bytes

Etcd Etcd: Keys compacted per second

The number of DB keys compacted per second.

DEPENDENT etcd.keys.compacted.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_db_compaction_keys_total

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Keys expired per second

The number of expired keys per second.

DEPENDENT etcd.keys.expired.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_store_expires_total

- CHANGE_PER_SECOND

Etcd Etcd: Keys total

Total number of keys.

DEPENDENT etcd.keys.total

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_keys_total

Etcd Etcd: Uptime

Etcd server uptime.

DEPENDENT etcd.uptime

Preprocessing:

- PROMETHEUS_PATTERN: process_start_time_seconds

- JAVASCRIPT: //use boottime to calculate uptime return (Math.floor(Date.now()/1000)-Number(value));

Etcd Etcd: Virtual memory

Virtual memory size in bytes.

DEPENDENT etcd.virtual.bytes

Preprocessing:

- PROMETHEUS_PATTERN: process_virtual_memory_bytes

Etcd Etcd: Resident memory

Resident memory size in bytes.

DEPENDENT etcd.res.bytes

Preprocessing:

- PROMETHEUS_PATTERN: process_resident_memory_bytes

Etcd Etcd: CPU

Total user and system CPU time spent in seconds.

DEPENDENT etcd.cpu.util

Preprocessing:

- PROMETHEUS_PATTERN: process_cpu_seconds_total

- CHANGE_PER_SECOND

Etcd Etcd: Open file descriptors

Number of open file descriptors.

DEPENDENT etcd.open.fds

Preprocessing:

- PROMETHEUS_PATTERN: process_open_fds

Etcd Etcd: Maximum open file descriptors

The Maximum number of open file descriptors.

DEPENDENT etcd.max.fds

Preprocessing:

- PROMETHEUS_PATTERN: process_max_fds

Etcd Etcd: Deletes per second

The number of deletes seen by this member per second.

DEPENDENT etcd.delete.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_delete_total

- CHANGE_PER_SECOND

Etcd Etcd: PUT per second

The number of puts seen by this member per second.

DEPENDENT etcd.put.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_put_total

- CHANGE_PER_SECOND

Etcd Etcd: Range per second

The number of ranges seen by this member per second.

DEPENDENT etcd.range.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_range_total

- CHANGE_PER_SECOND

Etcd Etcd: Transaction per second

The number of transactions seen by this member per second.

DEPENDENT etcd.txn.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_range_total

- CHANGE_PER_SECOND

Etcd Etcd: Events sent per second

The number of events sent by this member per second

DEPENDENT etcd.events.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_events_total

- CHANGE_PER_SECOND

Etcd Etcd: Pending events

Total number of pending events to be sent.

DEPENDENT etcd.events.sent.rate

Preprocessing:

- PROMETHEUS_PATTERN: etcd_debugging_mvcc_pending_events_total

Etcd Etcd: RPCs completed with code {#GRPC.CODE}

The number of RPCs completed on the server with grpc_code {#GRPC.CODE}

DEPENDENT etcd.grpc.handled.rate[{#GRPC.CODE}]

Preprocessing:

- PROMETHEUS_TO_JSON: grpc_server_handled_total{grpc_method="{#GRPC.CODE}"}

- JAVASCRIPT: The text is too long. Please see the template.

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Bytes sent

The number of bytes sent to peer with ID {#ETCD.PEER}

DEPENDENT etcd.bytes.sent.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_sent_bytes_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Bytes received

The number of bytes received from peer with ID {#ETCD.PEER}

DEPENDENT etcd.bytes.received.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_received_bytes_total{From="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Send failures

The number of send failures from peer with ID {#ETCD.PEER}

DEPENDENT etcd.sent.fail.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_sent_failures_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Etcd Etcd: Etcd peer {#ETCD.PEER}: Receive failures

The number of receive failures from the peer with ID {#ETCD.PEER}

DEPENDENT etcd.received.fail.rate[{#ETCD.PEER}]

Preprocessing:

- PROMETHEUS_PATTERN: etcd_network_peer_received_failures_total{To="{#ETCD.PEER}"}

⛔️ON_FAIL: CUSTOM_VALUE -> 0

- CHANGE_PER_SECOND

Zabbix_raw_items Etcd: Get node metrics

-

HTTP_AGENT etcd.get_metrics
Zabbix_raw_items Etcd: Get version

-

HTTP_AGENT etcd.get_version

Triggers

Name Description Expression Severity Dependencies and additional info
Etcd: Service is unavailable

-

{TEMPLATE_NAME:net.tcp.service["{$ETCD.SCHEME}","{HOST.CONN}","{$ETCD.PORT}"].last()}=0 AVERAGE

Manual close: YES

Etcd: Node healthcheck failed

https://etcd.io/docs/v3.4.0/op-guide/monitoring/#health-check

{TEMPLATE_NAME:etcd.health.last()}=0 AVERAGE

Depends on:

- Etcd: Service is unavailable

Etcd: Failed to fetch info data (or no data for 30m)

Zabbix has not received data for items for the last 30 minutes

{TEMPLATE_NAME:etcd.is.leader.nodata(30m)}=1 WARNING

Manual close: YES

Depends on:

- Etcd: Service is unavailable

Etcd: Member has no leader

"If a member does not have a leader, it is totally unavailable."

{TEMPLATE_NAME:etcd.has.leader.last()}=0 AVERAGE
Etcd: Instance has seen too many leader changes (over {$ETCD.LEADER.CHANGES.MAX.WARN} for 15m)'

Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

{TEMPLATE_NAME:etcd.leader.changes.delta(15m)}>{$ETCD.LEADER.CHANGES.MAX.WARN} WARNING
Etcd: Too many proposal failures (over {$ETCD.PROPOSAL.FAIL.MAX.WARN} for 5m)'

"Normally related to two issues: temporary failures related to a leader election or

longer downtime caused by a loss of quorum in the cluster."

{TEMPLATE_NAME:etcd.proposals.failed.rate.min(5m)}>{$ETCD.PROPOSAL.FAIL.MAX.WARN} WARNING
Etcd: Too many proposals are queued to commit (over {$ETCD.PROPOSAL.PENDING.MAX.WARN} for 5m)'

"Rising pending proposals suggests there is a high client load or the member cannot commit proposals."

{TEMPLATE_NAME:etcd.proposals.pending.min(5m)}>{$ETCD.PROPOSAL.PENDING.MAX.WARN} WARNING
Etcd: Too many HTTP requests failures (over {$ETCD.HTTP.FAIL.MAX.WARN} for 5m)'

"Too many requests failed on etcd instance with 5xx HTTP code"

{TEMPLATE_NAME:etcd.http.requests.5xx.rate.min(5m)}>{$ETCD.HTTP.FAIL.MAX.WARN} WARNING
Etcd: Server version has changed (new version: {ITEM.VALUE})

Etcd version has changed. Ack to close.

{TEMPLATE_NAME:etcd.server.version.diff()}=1 and {TEMPLATE_NAME:etcd.server.version.strlen()}>0 INFO

Manual close: YES

Etcd: Cluster version has changed (new version: {ITEM.VALUE})

Etcd version has changed. Ack to close.

{TEMPLATE_NAME:etcd.cluster.version.diff()}=1 and {TEMPLATE_NAME:etcd.cluster.version.strlen()}>0 INFO

Manual close: YES

Etcd: has been restarted (uptime < 10m)

Uptime is less than 10 minutes.

{TEMPLATE_NAME:etcd.uptime.last()}<10m INFO

Manual close: YES

Etcd: Current number of open files is too high (over {$ETCD.OPEN.FDS.MAX.WARN}% for 5m)

"Heavy file descriptor usage (i.e., near the process's file descriptor limit) indicates a potential file descriptor exhaustion issue.

If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files."

{TEMPLATE_NAME:etcd.open.fds.min(5m)}/{TEMPLATE_NAME:etcd.max.fds.last()}*100>{$ETCD.OPEN.FDS.MAX.WARN} WARNING
Etcd: Too many failed gRPC requests with code: {#GRPC.CODE} (over {$ETCD.GRPC.ERRORS.MAX.WARN} in 5m)

-

{TEMPLATE_NAME:etcd.grpc.handled.rate[{#GRPC.CODE}].min(5m)}>{$ETCD.GRPC.ERRORS.MAX.WARN} WARNING

Feedback

Please report any issues with the template at https://support.zabbix.com

Articles and documentation

+ Propose new article
👁 Image

Request custom integration

Zabbix integration team will develop custom integration based on your requirements and Zabbix best practices.

Request
👁 Image

Propose integration

Have you already developed high quality integration and want to submit to Zabbix integration repository?

Propose