VOOZH about

URL: https://dev.to/hajirufai/building-a-distributed-key-value-store-with-raft-consensus-in-python-hp3

⇱ Building a Distributed Key-Value Store with Raft Consensus in Python - DEV Community


If you've ever wondered how systems like etcd, CockroachDB, or Consul maintain consistency across multiple servers, the answer is usually a consensus algorithm. The most popular one today is Raft — designed explicitly to be understandable.

I built RaftKV, a complete distributed key-value store from scratch in Python, implementing the full Raft consensus protocol. No external coordination services. No ZooKeeper. Just pure Python, TCP, and the Raft paper.

🔗 GitHub Repository | 🌐 Live Demo Page


Why Build This?

Every distributed system job interview touches on consensus. "How does Kafka guarantee ordering?" "What happens during a network partition?" "How does leader election work?"

Building Raft from scratch gives you genuine answers to all of these. Plus, it exercises nearly every systems programming skill:

  • Concurrent programming — async event loops, timers, state machines
  • Networking — HTTP RPC, timeouts, failure handling
  • Persistence — write-ahead logs, snapshots, crash recovery
  • Algorithms — leader election, log replication, commit rules

The Raft Algorithm in 5 Minutes

Raft organizes a cluster of nodes into three roles:

Follower ──[election timeout]──▶ Candidate ──[wins majority]──▶ Leader
 ▲ │ │
 │ loses/timeout │ discovers higher │
 └────────────────────────────────┘ term │
 └──────────────────────────────────────────────────────────────┘

Leader Election

  1. Every node starts as a Follower with a random election timeout (1.5–3 seconds)
  2. If a follower doesn't hear from a leader before timeout → becomes a Candidate
  3. The candidate increments its term, votes for itself, and sends RequestVote RPCs
  4. If it receives votes from a majority → becomes Leader
  5. The leader sends periodic heartbeats to maintain authority

The randomized timeouts prevent "election storms" where nodes keep splitting votes.

Log Replication

Once a leader is elected, all client writes go through it:

async def propose(self, command: dict) -> tuple[bool, str]:
 if self.state != NodeState.LEADER:
 return False, "Not leader"

 # 1. Append to local log
 entry = self.log.create_entry(self.current_term, command=command)

 # 2. Persist to WAL
 if self.wal:
 self.wal.append_entry(entry)

 # 3. Wait for majority replication
 future = asyncio.get_event_loop().create_future()
 self._pending_commands[entry.index] = future

 await asyncio.wait_for(future, timeout=5.0)
 return True, f"Committed at index {entry.index}"

The leader replicates entries to followers via AppendEntries RPCs. An entry is committed only when a majority of nodes have it.

Safety Properties

Raft ensures:

  • Election Safety: At most one leader per term
  • Leader Append-Only: Leaders never overwrite or delete log entries
  • Log Matching: If two logs contain an entry with the same index and term, all preceding entries are identical
  • Leader Completeness: A committed entry will be present in all future leaders' logs

Architecture Overview

┌─────────────────────────────────────────────┐
│ Client Request │
│ PUT /kv/name {"value":"Raft"} │
└──────────────────┬──────────────────────────┘
 ▼
┌──────────────────────────────────────────────┐
│ LEADER (Node 1) │
│ HTTP Server → Raft Engine → Log (WAL) → KV │
└─────────┬──────────────────────┬─────────────┘
 │ AppendEntries │ AppendEntries
 ▼ ▼
 ┌─────────────┐ ┌─────────────┐
 │ FOLLOWER │ │ FOLLOWER │
 │ (Node 2) │ │ (Node 3) │
 └─────────────┘ └─────────────┘

Key components:

Module Purpose
consensus.py Core Raft engine — election, replication, commit
log.py Append-only log with entries and conflict resolution
wal.py Write-ahead log persistence and crash recovery
store.py KV state machine — applies committed entries
server.py HTTP server with client API + Raft RPC
client.py Async Python client with leader auto-discovery

The Raft Log

The heart of Raft is the replicated log. Every entry has a term (election epoch) and index:

@dataclass
class LogEntry:
 term: int
 index: int
 entry_type: str = EntryType.COMMAND
 command: dict | None = None

The log handles conflict resolution — when a follower receives entries from a new leader that conflict with its existing log, it truncates:

def append_entries(self, prev_index, prev_term, entries):
 # Consistency check
 if prev_index > 0:
 prev_entry = self.get(prev_index)
 if prev_entry is None or prev_entry.term != prev_term:
 return False # Reject — log inconsistency

 for entry in entries:
 existing = self.get(entry.index)
 if existing and existing.term != entry.term:
 self._truncate_from(entry.index) # Conflict!
 self.entries.append(entry)
 elif existing is None:
 self.entries.append(entry)

 return True

Leader Election Deep Dive

The election logic is the most timing-sensitive part:

async def _start_election(self):
 self.current_term += 1
 self.state = NodeState.CANDIDATE
 self.voted_for = self.node_id

 votes_received = 1 # Self-vote
 votes_needed = self.config.majority

 vote_tasks = [
 self._request_vote(peer, self.current_term)
 for peer in self.peers
 ]
 results = await asyncio.gather(*vote_tasks, return_exceptions=True)

 for result in results:
 if result is True:
 votes_received += 1

 if votes_received >= votes_needed:
 self._become_leader()

A node only grants its vote if the candidate's term is current, the voter hasn't voted for someone else, and the candidate's log is at least as up-to-date.


Write-Ahead Log & Crash Recovery

Durability requires persisting entries before acknowledging them:

class WriteAheadLog:
 def append_entry(self, entry: LogEntry):
 with open(self.wal_path, "a") as f:
 f.write(json.dumps(entry.to_dict()) + "\n")

 def save_state(self, current_term, voted_for, commit_index):
 tmp = self.state_path.with_suffix(".tmp")
 with open(tmp, "w") as f:
 json.dump(state, f)
 tmp.rename(self.state_path) # Atomic write

On startup, nodes replay the WAL from the last snapshot to recover state.


Running a Cluster

Docker Compose

docker-compose up --build
# 3 nodes on ports 8001, 8002, 8003

CLI Usage

# Store values
raftkv put db.host "postgres.internal"
raftkv get db.host
# db.host = postgres.internal

# Check cluster
raftkv status --node http://localhost:8001

HTTP API

curl -X PUT http://localhost:8001/kv/name \
 -d '{"value": "RaftKV"}'

curl http://localhost:8001/kv/name
# {"key": "name", "value": "RaftKV"}

curl http://localhost:8001/cluster/status

Testing

89 tests across 7 modules covering unit, RPC, consensus, and multi-node integration:

@pytest.mark.asyncio
async def test_data_replicated_to_followers(three_node_cluster):
 nodes = three_node_cluster
 leader = find_leader(nodes)

 await leader.propose({"op": "put", "key": "test", "value": "hello"})
 await asyncio.sleep(1.0)

 for node in nodes:
 assert node.store.get("test") == "hello"

Key Takeaways

  1. Timing is everything — Randomized election timeouts are elegant but tricky to tune
  2. The log consistency check is the real heroprevLogIndex + prevLogTerm prevents split-brain corruption
  3. Term numbers are the universal tiebreaker — Higher term always wins
  4. Separating consensus from state machine is powerful — The Raft log is independent of the KV store
  5. Persistence must be atomic — Write-then-rename for state, JSONL append for WAL

Check out the full source code on GitHub. Star it if you find it useful!


This is project #10 in my "Building From Scratch" series. Previous projects: SQL query engine, vector search engine, workflow orchestration engine.