VOOZH about

URL: https://dev.to/michaelfv/gbase-8c-lock-conflicts-and-deadlock-troubleshooting-a-practical-guide-3lk2

⇱ GBase 8c Lock Conflicts and Deadlock Troubleshooting: A Practical Guide - DEV Community


Lock conflicts and deadlocks are frequent culprits behind business disruptions in GBase 8c distributed clusters. Unlike single-node databases, GBase 8c introduces both local locks and cross-node global locks, making diagnosis more challenging. This guide provides a field-tested workflow covering lock mechanisms, common causes, diagnosis steps, and practical solutions.

1. Core Lock Mechanisms in GBase 8c

1.1 Lock Scope

  • Local locks: Confined to a single CN or DN node (e.g., concurrent transactions on one DN).
  • Global locks: Span multiple nodes, typical in distributed transactions and DDL operations.

1.2 Common Lock Modes and Conflicts

Mode Purpose Conflict Scenarios
ShareLock (S) Shared read lock Conflicts with X locks
ExclusiveLock (X) Exclusive write lock Conflicts with all other lock modes
AccessExclusiveLock (AEL) DDL operations (e.g., TRUNCATE) Blocks all reads and writes
Intent Locks (IS/IX) Table-level intent, coordinates with row-level locks Ensures table/row lock compatibility

1.3 Deadlock Types

  • Local deadlock: Occurs within a single node, automatically detected and resolved by GBase 8c.
  • Global deadlock: Cross-node circular wait; may require manual intervention if auto-resolution fails.

2. Five Common Causes of Lock Issues

  1. Poor transaction design: Long-running transactions, inconsistent multi-table operation order.
  2. Ineffective SQL: Missing WHERE clause causing full-table lock, index failure expanding lock scope, DDL during peak hours.
  3. Stuck two-phase transactions: Distributed transactions stuck in prepared state holding locks.
  4. Improper cluster parameters: e.g., lock_wait_timeout too short/long, insufficient thread pool.
  5. Node communication anomalies: Network delays or partitions triggering global deadlocks.

3. Diagnostic Workflow (4 Steps)

3.1 Step 1: Locate Blocked Sessions

-- Find all blocked lock requests
SELECT
 pg_locks.pid,
 pg_class.relname AS table_name,
 pg_locks.mode,
 pg_locks.granted,
 pg_stat_activity.query,
 pg_stat_activity.state
FROM pg_locks
LEFT JOIN pg_class ON pg_locks.relation = pg_class.oid
LEFT JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE pg_locks.granted = false;

Also check for stranded two-phase transactions:

SELECT gxid, status, start_time
FROM pg_distributed_transactions
WHERE status = 'prepared';

3.2 Step 2: Analyze the Lock Wait Chain

WITH lock_chain AS (
 SELECT
 pg_locks.pid AS wait_pid,
 pg_locks.relation,
 pg_locks.mode AS wait_mode,
 pg_stat_activity.query AS wait_query,
 (SELECT pid FROM pg_locks l2 WHERE l2.relation = pg_locks.relation AND l2.granted = true AND l2.mode = 'ExclusiveLock') AS hold_pid
 FROM pg_locks
 LEFT JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
 WHERE pg_locks.granted = false
)
SELECT
 wait_pid AS blocked_pid,
 pg_class.relname AS table_name,
 wait_mode,
 wait_query,
 hold_pid AS holding_pid,
 (SELECT query FROM pg_stat_activity WHERE pid = hold_pid) AS holding_query
FROM lock_chain
LEFT JOIN pg_class ON lock_chain.relation = pg_class.oid;

3.3 Step 3: Distinguish Local vs. Global Deadlock

  • Local: Check node logs grep -i "deadlock" /opt/gbase/data/log/postgresql-*.log.
  • Global: Check all CN/DN logs, and query pg_distributed_transactions for prepared transactions.

3.4 Step 4: Verify Root Cause

  • Use EXPLAIN ANALYZE to check index usage.
  • Identify transactions running longer than 5 minutes.
  • Test node connectivity with ping and telnet.

4. Solutions

4.1 Optimise Transaction Design

  • Break long transactions; avoid external API calls inside transactions.
  • Enforce consistent object order across all transactions (e.g., always update user before order).

4.2 Refine SQL and Indexes

-- Add index to prevent full-table scan
CREATE INDEX idx_user_phone ON user(phone);
-- Schedule DDL off-peak, check lock status beforehand

4.3 Clean Up Stuck Prepared Transactions

SELECT pg_terminate_backend(gxid)
FROM pg_distributed_transactions
WHERE status = 'prepared' AND start_time < now() - INTERVAL '1 hour';

4.4 Tune Cluster Parameters

ALTER SYSTEM SET lock_wait_timeout = '10s';
ALTER SYSTEM SET thread_pool_attr = '512, 1, (cpubind:0-3)';
ALTER SYSTEM SET thread_pool_stream_attr = '512, 0.2, 1, (cpubind:0-3)';
-- Apply with SELECT pg_reload_conf(); or restart if necessary.

4.5 Ensure Stable Node Communication

Enable heartbeat checks and adjust inter‑node timeout settings.

5. Real‑World Case Study

Symptom: During peak hours, many queries failed with lock wait timeout, and multiple backend threads hung in waiting state.

Diagnosis: Global lock query revealed two UPDATE transactions in a circular wait — PID 1234 held an X lock on user and waited for order, while PID 5678 held an X lock on order and waited for user. Two prepared transactions also existed.

Resolution:

  1. Manually terminated the stuck prepared transactions.
  2. Standardised transaction order across all applications (user → order).
  3. Adjusted lock_wait_timeout to 10s and optimised thread pool settings.
  4. Repaired network latency and enabled node heartbeat detection.

Result: Lock blocks cleared, API response times returned to normal, and no new deadlocks occurred.

6. Common Pitfalls and Best Practices

  • Never blindly kill sessions; first identify the lock holder.
  • Regularly clean up long‑running prepared transactions.
  • Global deadlocks often need manual intervention.
  • Execute DDL only during maintenance windows.
  • Periodically verify index effectiveness and query plans.
  • Tune lock and thread pool parameters based on actual cluster load.

Effective lock troubleshooting in a gbase database demands a systematic approach — pinpoint the root cause, apply targeted fixes, and establish continuous monitoring to prevent recurrence. Use this guide to keep your GBASE clusters stable and responsive.