VOOZH about

URL: https://deepwiki.com/friendsofhyperf/components/6.3-health-monitoring-and-fault-tolerance

⇱ Health Monitoring and Fault Tolerance | friendsofhyperf/components | DeepWiki


Loading...
Last indexed: 14 February 2026 (15d5ca)
Menu

Health Monitoring and Fault Tolerance

This document describes the health monitoring and fault tolerance mechanisms in the MySQL Binlog Trigger system. These features ensure reliable operation in production environments by providing distributed locking for high availability, position persistence for crash recovery, and active monitoring for detecting replication issues.

For information about the core architecture and event processing, see Architecture and Consumer Process. For details on trigger execution, see Event Processing and Trigger Execution.

Overview of Fault Tolerance Architecture

The Trigger system implements three primary fault tolerance mechanisms that work together to ensure reliable binlog consumption:


Sources: src/trigger/src/Consumer.php45-70 src/trigger/src/Monitor/HealthMonitor.php23-33 src/trigger/src/Mutex/RedisServerMutex.php22-59

Distributed Locking with ServerMutex

The ServerMutexInterface provides distributed locking to ensure only one consumer instance processes a given MySQL connection's binlog stream at any time. This prevents duplicate event processing in clustered deployments.

RedisServerMutex Implementation

The RedisServerMutex class implements the mutex using Redis atomic operations:


Sources: src/trigger/src/Mutex/RedisServerMutex.php66-113

Configuration Parameters

The mutex is configured in trigger.php:

ParameterDefaultDescription
enabletrueEnable/disable distributed locking
prefixtrigger:server_mutex:Redis key prefix
expires30 secondsLock expiration time (TTL)
keepalive_interval10 secondsFrequency of lock renewal
retry_interval10 secondsFrequency of lock acquisition attempts

Sources: src/trigger/publish/trigger.php32-38

Lock Lifecycle

The mutex lifecycle is managed by the Consumer class:


Sources: src/trigger/src/Mutex/RedisServerMutex.php66-113 src/trigger/src/Consumer.php112-116

Owner Identification

Each consumer instance is identified by its internal IP address, obtained via Util::getInternalIp(). This owner identifier is stored as the Redis lock value, enabling lock ownership verification and debugging.

Sources: src/trigger/src/Mutex/RedisServerMutex.php36-57 src/trigger/src/Consumer.php66

Position Persistence and Crash Recovery

The BinLogCurrentSnapshotInterface provides persistent storage of the current binlog position, enabling crash recovery by resuming from the last known position.

RedisBinLogCurrentSnapshot Implementation


Sources: src/trigger/src/Snapshot/RedisBinLogCurrentSnapshot.php32-63 src/trigger/src/Subscriber/SnapshotSubscriber.php23-41 src/trigger/src/Consumer.php210-218

Snapshot Storage Key Structure

The Redis key for position storage follows this pattern:

trigger:snapshot:binLogCurrent:{version}:{connection}

For example: trigger:snapshot:binLogCurrent:1.0:default

Sources: src/trigger/src/Snapshot/RedisBinLogCurrentSnapshot.php65-74

Position Persistence Configuration

ParameterDefaultDescription
snapshot.version1.0Version identifier for snapshot format
snapshot.expires86400 (24h)TTL for snapshot in Redis
snapshot.interval10 secondsFrequency of position persistence

Sources: src/trigger/publish/trigger.php45-49

Crash Recovery Flow

When a consumer starts, it attempts to resume from the last persisted position:


Sources: src/trigger/src/Consumer.php182-238 src/trigger/src/Snapshot/RedisBinLogCurrentSnapshot.php38-62

Health Monitor System

The HealthMonitor class actively monitors the binlog replication health by tracking position updates and detecting stalls.

Dual-Timer Architecture

The health monitor uses two independent timers:


Sources: src/trigger/src/Monitor/HealthMonitor.php34-85

Position Update Flow

The health monitor receives position updates from the SnapshotSubscriber:


Sources: src/trigger/src/Subscriber/SnapshotSubscriber.php23-41 src/trigger/src/Monitor/HealthMonitor.php62-84

Health Monitor Configuration

ParameterDefaultDescription
health_monitor.enabletrueEnable/disable health monitoring
health_monitor.interval30 secondsFrequency of position logging

Sources: src/trigger/publish/trigger.php40-43

Stall Detection Logic

The monitor compares the cached position from Redis with the current in-memory position:


Sources: src/trigger/src/Monitor/HealthMonitor.php74-81

Error Handling and Retry Mechanisms

Consumer Error Recovery

The Consumer class implements error handling with automatic retry:


Sources: src/trigger/src/Consumer.php72-117

Error Handling Code Flow

The consume loop wraps replication operations in a try-catch block:


Sources: src/trigger/src/Consumer.php96-109

Graceful Shutdown

The consumer listens for the worker exit signal:


Sources: src/trigger/src/Consumer.php90-179

Integration with Consumer Lifecycle

Initialization Sequence

The fault tolerance components are initialized during consumer construction:


Sources: src/trigger/src/Consumer.php45-70

Start Sequence with Fault Tolerance


Sources: src/trigger/src/Consumer.php72-117

Configuration Reference

Complete Fault Tolerance Configuration

The following shows all configuration options related to health monitoring and fault tolerance:


Sources: src/trigger/publish/trigger.php32-49

Command-Line Tools

Server Mutex Management

The trigger:server-mutex command provides utilities for managing distributed locks:


Usage examples:


Sources: src/trigger/src/Command/ServerMutexCommand.php31-81

Dependency Injection Configuration

The fault tolerance components are registered in the DI container:

InterfaceImplementationDescription
ServerMutexInterfaceRedisServerMutexDistributed lock provider
BinLogCurrentSnapshotInterfaceRedisBinLogCurrentSnapshotPosition persistence

Sources: src/trigger/src/ConfigProvider.php24-26

Error Scenarios and Recovery

Scenario 1: Consumer Crash

  1. Consumer crashes mid-processing
  2. Mutex lock expires after TTL (default 30s)
  3. Another instance acquires the lock
  4. New instance calls BinLogCurrentSnapshot::get()
  5. Resumes from last persisted position (max 10s data loss)

Scenario 2: Redis Failure

  1. Redis becomes unavailable
  2. Mutex operations fail
  3. Consumer continues processing (if lock already acquired)
  4. Position persistence fails (logged but non-blocking)
  5. On recovery, may need to replay events

Scenario 3: Replication Stall

  1. MySQL stops sending binlog events
  2. SnapshotSubscriber stops updating position
  3. HealthMonitor detects unchanged position
  4. OnReplicationStop event dispatched
  5. Application can take corrective action

Scenario 4: Network Partition

  1. Consumer loses network connectivity
  2. Keepalive timer fails to extend lock
  3. Lock expires, allowing another instance to take over
  4. Original consumer's isStopped() check prevents dual processing
  5. Mutex release fails gracefully

Sources: src/trigger/src/Monitor/HealthMonitor.php74-81 src/trigger/src/Mutex/RedisServerMutex.php91-103

Refresh this wiki

On this page