VOOZH about

URL: https://dev.to/pendelabhargavasai/k3s-vs-kubernetes-a-deep-dive-into-control-plane-architecture-489k

⇱ K3s vs Kubernetes: A Deep Dive into Control Plane Architecture - DEV Community


Not just "what's different" β€” but WHY it's different, HOW each component works under the hood, and WHEN to choose which.


🧠 Why This Post Exists

Every "K3s vs K8s" article you've read probably gave you a table with checkmarks and said "K3s is lightweight." That's true β€” but why is it lightweight? What did Rancher actually strip out, merge, or replace? What are the architectural trade-offs you inherit when you deploy K3s in production?

This post tears open both control planes component by component. We'll go deep into what each piece actually does at the byte level, then see how K3s reimagines it.


πŸ—οΈ The Kubernetes Control Plane: A Ground-Up Look

Before comparing, let's build a mental model of each standard Kubernetes control plane component. Not the 30-second version β€” the real one.


1. πŸ”΅ kube-apiserver β€” The Brain's Frontal Lobe

What It Actually Does

The API server is not just a REST endpoint. It is the only component in Kubernetes that talks directly to etcd. Every other component β€” scheduler, controller-manager, kubelet β€” communicates exclusively through the API server. This is a deliberate architectural decision called the hub-and-spoke pattern.

When you run kubectl apply -f deployment.yaml, here's what actually happens:

kubectl β†’ HTTPS β†’ kube-apiserver
 β”‚
 β”œβ”€β”€ 1. Authentication (Who are you?)
 β”‚ └── x509 certs / Bearer tokens / OIDC /Webhook
 β”‚
 β”œβ”€β”€ 2. Authorization (Can you do this?)
 β”‚ └── RBAC / ABAC / Node / Webhook evaluators
 β”‚
 β”œβ”€β”€ 3. Admission Control (Should this be allowed?)
 β”‚ β”œβ”€β”€ Mutating Webhooks ← can MODIFY the object
 β”‚ └── Validating Webhooks ← can REJECT theobject
 β”‚
 β”œβ”€β”€ 4. Schema Validation
 β”‚ └── OpenAPI v3 schema enforcement per GVK
 β”‚
 └── 5. Persist to etcd
 └── /registry/deployments/default/my-app

The Watch Mechanism β€” The Heartbeat of Kubernetes

The API server implements a long-poll watch mechanism over HTTP/2. This is what makes Kubernetes reactive rather than polling-based.

# You can see this yourself
kubectl get pods --watch -v=9
# Watch the raw HTTP stream β€” it's a chunked HTTP response that stays open

Every controller, scheduler, and kubelet maintains a persistent informer β€” a cached watch stream from the API server. The informer pattern:

  1. Does an initial LIST to populate local cache
  2. Starts a WATCH from the resource version of that LIST
  3. On disconnect, re-watches from the last known resourceVersion
  4. The API server buffers events in a watchCache in memory (configurable with --watch-cache-sizes)
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ kube-apiserver β”‚
 β”‚ β”‚
 β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
 β”‚ β”‚ etcd watch β”‚ β”‚
 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
 β”‚ β”‚ β”‚
 β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
 β”‚ β”‚ watchCache β”‚ β”‚ ← In-memory ring buffer
 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
 β”‚ β”‚ β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ β”‚ β”‚
 β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
 β”‚Schedulerβ”‚ β”‚Controller β”‚ β”‚ kubelet β”‚
 β”‚Informer β”‚ β”‚ Informer β”‚ β”‚ Informer β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Aggregation Layer & CRDs

The API server can extend itself via two mechanisms:

  • CRDs (Custom Resource Definitions): Schema is stored in etcd, handled natively by the API server itself
  • Aggregation Layer (AA): Proxy traffic to an external API server (used by metrics-server, KEDA, etc.)
# CRD β€” API server owns the storage
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
 name: widgets.example.com

# AA β€” API server proxies to external server
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
 name: v1beta1.metrics.k8s.io
spec:
 service:
 name: metrics-server
 namespace: kube-system

Production Tuning Knobs

kube-apiserver \
 --max-requests-inflight=400 \  # Max non-mutating concurrent requests
 --max-mutating-requests-inflight=200 \ # Max mutating concurrent requests
 --watch-cache-sizes=pods#1000 \  # Per-resource watch cache sizes
 --enable-admission-plugins=NodeRestriction,PodSecurity \
 --audit-log-path=/var/log/audit.log \
 --audit-policy-file=/etc/k8s/audit-policy.yaml

2. 🟣 etcd β€” The Distributed Brain's Memory

What etcd Actually Is

etcd is a distributed key-value store built on the Raft consensus algorithm. It's not a database in the traditional sense β€” it's a fault-tolerant state machine where every write must be agreed upon by a quorum of nodes before it's committed.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ etcd-0 β”‚ β”‚ etcd-1 β”‚ β”‚ etcd-2 β”‚
β”‚ (LEADER) │◄────│ (FOLLOWER) β”‚ β”‚ (FOLLOWER) β”‚
β”‚ │────►│ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”˜
 β”‚ β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 Raft Heartbeats

Raft in Plain English

  1. Leader Election: One node becomes leader. It sends heartbeats. If 2+ nodes don't hear a heartbeat, they call an election.
  2. Log Replication: Every write goes to the leader. Leader appends it to its log and replicates it to followers. Once a majority acknowledges, the write is committed.
  3. Quorum Math: (n/2) + 1 nodes must agree. For 3 nodes: 2. For 5 nodes: 3.
etcd write path:
Client β†’ Leader APPEND entry to log
 Leader SEND AppendEntries RPC to all followers
 Followers ACKNOWLEDGE
 Leader COMMITS when the majority ack
 Leader RESPONDS to client
 Leader NOTIFIES followers of the commit

How Kubernetes Data Lives in etcd

All Kubernetes objects are stored under /registry/ with the structure:

/registry/{resource-type}/{namespace}/{name}

Examples:
/registry/pods/default/nginx-7d8b9f-xyz
/registry/deployments/kube-system/coredns
/registry/secrets/default/my-secret
/registry/events/default/pod-scheduled-event

The data is serialized using protobuf (not JSON!) for efficiency. You can inspect it:

# Decode an etcd value
etcdctl get /registry/pods/default/nginx \
 --endpoints=https://127.0.0.1:2379 \
 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
 --cert=/etc/kubernetes/pki/etcd/server.crt \
 --key=/etc/kubernetes/pki/etcd/server.key \
 | auger decode # github.com/jpbetz/auger

MVCC β€” Multi-Version Concurrency Control

etcd uses MVCC, meaning it keeps multiple historical versions of every key. Each write increments a global revision counter. The API server uses this resourceVersion for watch ordering and conflict detection.

# See the revision
etcdctl get /registry/pods/default/nginx -w json | jq .header.revision

When etcd's keyspace grows too large (default compaction at 2GB), older revisions are compacted β€” deleted. This is why very old watches can fail with "compacted" errors.

etcd Failure Modes You Must Know

Scenario What Happens
1 node fails (3-node cluster) Cluster continues. Writes still work.
2 nodes fail (3-node cluster) CLUSTER STOPS ACCEPTING WRITES. API server returns 503.
Leader fails Election happens. ~150-300ms downtime while new leader is elected.
Network partition Minority partition goes read-only. Majority continues.
etcd OOM API server loses state store. Catastrophic.

⚠️ This is the critical difference with K3s. If you're running K3s with embedded SQLite, you get zero HA for the datastore by default.


3. 🟑 kube-scheduler β€” The CPU-Time Auctioneer

What It Actually Does

The scheduler watches for Pods in Pending state (no nodeName assigned) and decides which Node they should run on. It does NOT place the pod β€” it simply writes the chosen nodeName to the Pod spec in etcd via the API server. The kubelet on that node then sees its name and starts the pod.

Pod created (nodeName: "") β†’ Scheduler sees it via watch
 β†’ Runs filtering + scoring
 β†’ Writes nodeName to Pod
 β†’ kubelet on that node sees the Pod
 β†’ kubelet pulls image + starts container

The Scheduling Framework β€” Two-Phase Deep Dive

Scheduling happens in two phases: Filtering and Scoring.

Phase 1: Filtering (Hard Constraints β€” binary pass/fail)

All Nodes
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Filter Plugins (run in parallel, any fail = remove) β”‚
β”‚ β”‚
β”‚ β€’ NodeUnschedulable β€” node.spec.unschedulable? β”‚
β”‚ β€’ NodeAffinity β€” matchLabels on node? β”‚
β”‚ β€’ TaintToleration β€” pod tolerates node taints? β”‚
β”‚ β€’ PodTopologySpread β€” spread constraints met? β”‚
β”‚ β€’ VolumeBinding β€” PVC can bind to this node? β”‚
β”‚ β€’ NodeResourcesFit β€” enough CPU/mem/GPU? β”‚
β”‚ β€’ NodePorts β€” hostPort conflicts? β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
Feasible Nodes (subset)

Phase 2: Scoring (Soft Preferences β€” 0-100 score)

Feasible Nodes
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Score Plugins (weighted sum) β”‚
β”‚ β”‚
β”‚ β€’ LeastAllocated β€” prefer less loaded nodes β”‚
β”‚ β€’ NodeAffinity β€” preferred affinities β”‚
β”‚ β€’ InterPodAffinity β€” co-locate or spread pods β”‚
β”‚ β€’ ImageLocality β€” prefer nodes with image β”‚
β”‚ β€’ TaintToleration β€” fewer preferred taints β”‚
β”‚ β€’ TopologySpreadConstraint β€” balance spread β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
Highest Score Node β†’ Binding (nodeName written)

Preemption β€” What Happens When No Node Passes Filtering

If no node can fit the Pod, the scheduler checks if lower priority pods can be evicted to make room:

  1. Find nodes where evicting lower-priority pods creates enough room
  2. Pick the node that requires evicting the fewest/lowest-priority pods
  3. Send eviction requests β†’ evicted pods are deleted β†’ pending pod is scheduled
# Priority classes matter here
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
 name: high-priority
value: 1000000
globalDefault: false
---
# System-critical pods have value: 2000001000
# They will preempt your workloads if nodes are tight

The Binding Cache β€” Optimistic Concurrency

The scheduler maintains an assumed pod cache. After scoring but before the API server confirms the bind, the scheduler optimistically assumes the pod is placed and accounts for that node's capacity. This prevents scheduling thrash in high-throughput clusters.


4. 🟒 kube-controller-manager β€” The Reconciliation Engine

What It Actually Is

The controller manager is a single binary that runs ~30+ independent control loops as goroutines. Each controller watches specific resource types and reconciles desired state vs actual state.

# The reconciliation loop in pseudocode (every controller)
for {
 desired := get_desired_state_from_api_server()
 actual := get_actual_state_from_world()

 if desired != actual {
 take_action_to_make_actual_match_desired()
 }

 sleep(resync_period) // default: 10min
}

Key Controllers and What They Actually Do

ReplicaSet Controller

Watches: ReplicaSets, Pods
Loop:
 current_pods = list pods with matching selector
 delta = replicaset.spec.replicas - len(current_pods)
 if delta > 0: create `delta` pods
 if delta < 0: delete abs(delta) pods (by priority: unscheduled first)

Deployment Controller

Watches: Deployments, ReplicaSets
Loop:
 desired_rs = compute_hash(deployment.spec.template)
 if no RS with that hash: create new RS
 scale up new RS, scale down old RS (by strategy: RollingUpdate or Recreate)
 update deployment.status (readyReplicas, conditions, etc.)

Node Controller β€” This one is critical to understand

Watches: Nodes
Loop:
 for each node:
 if no heartbeat for node-monitor-grace-period (default 40s):
 set NodeReady=Unknown
 if no heartbeat for pod-eviction-timeout (default 5min):
 taint node with node.kubernetes.io/unreachable:NoExecute
 (this triggers pod eviction by the taint manager)

EndpointSlice Controller β€” How Services actually work

Watches: Services, Pods
Loop:
 for each service:
 pods = list pods matching service.spec.selector where pod.status.ready=true
 build EndpointSlices (groups of 100 endpoints each)
 write EndpointSlices to API server
 (kube-proxy watches EndpointSlices and updates iptables/ipvs rules)

Informer + WorkQueue Architecture

Every controller is built on the same pattern:

API Server Watch
 β”‚
 β–Ό
 Informer
 (local cache)
 β”‚
 β–Ό (on change event)
 WorkQueue ←──── rate-limited, deduplicated
 β”‚
 β–Ό
 Worker goroutines (usually 1-5)
 β”‚
 β–Ό
 Reconcile function
 β”‚
 β”œβ”€β”€ Success β†’ remove from queue
 └── Failure β†’ re-queue with exponential backoff

This pattern means controllers are eventually consistent β€” they don't act on every single event, they converge to the desired state over time.


5. πŸ”΄ cloud-controller-manager β€” The Cloud API Bridge

What It Actually Does

The CCM was extracted from kube-controller-manager in Kubernetes 1.11 specifically to decouple Kubernetes from cloud provider APIs. It runs cloud-specific control loops:

Node Controller (cloud variant)

On new Node joining:
 1. Fetch instance metadata from cloud API
 (AWS EC2 DescribeInstances / GCP ComputeInstances)
 2. Apply cloud provider labels:
 - topology.kubernetes.io/zone = us-east-1a
 - node.kubernetes.io/instance-type = m5.xlarge
 3. Set node addresses (internal/external IP from cloud metadata)
 4. Check if instance still exists periodically
 β†’ If terminated in cloud: delete the Node object

Route Controller (AWS/GCP specific)

For each node:
 ensure cloud routing table has route:
 pod-cidr (e.g., 10.244.1.0/24) β†’ node instance-id

This is how pod-to-pod routing works across nodes
WITHOUT an overlay network on supported clouds

Service Controller β€” The LoadBalancer Magic

Watch Services with type=LoadBalancer:
 on CREATE: call cloud API β†’ create load balancer
 update service.status.loadBalancer.ingress with external IP
 on UPDATE: update LB listener rules / health checks
 on DELETE: delete cloud load balancer

This is why kubectl get svc shows <pending> for LoadBalancer services until the cloud LB is provisioned.


⚑ The K3s Control Plane: Architectural Reimagination

Now let's look at what K3s does differently β€” not just "it's smaller" but architecturally why.


K3s Single Binary Philosophy

K3s ships as a single ~70MB binary (k3s) that embeds:

k3s binary
β”œβ”€β”€ k3s-server (control plane)
β”‚ β”œβ”€β”€ kube-apiserver
β”‚ β”œβ”€β”€ kube-controller-manager
β”‚ β”œβ”€β”€ kube-scheduler
β”‚ β”œβ”€β”€ kubelet
β”‚ β”œβ”€β”€ kube-proxy
β”‚ β”œβ”€β”€ embedded containerd
β”‚ β”œβ”€β”€ embedded CoreDNS
β”‚ β”œβ”€β”€ embedded Flannel (CNI)
β”‚ β”œβ”€β”€ embedded Traefik (ingress)
β”‚ β”œβ”€β”€ embedded ServiceLB (load balancer)
β”‚ └── embedded local-path-provisioner (storage)
└── k3s-agent (worker)
 β”œβ”€β”€ kubelet
 β”œβ”€β”€ kube-proxy
 └── embedded containerd

This is not containerized β€” these are linked as Go packages into a single binary. Startup goes from ~3 minutes (typical K8s) to under 30 seconds.


1. K3s API Server β€” Same Core, Slimmer Defaults

The K3s API server is still the upstream kube-apiserver β€” but K3s wraps it with:

Removed/Disabled by Default:

  • Alpha feature gates are disabled
  • Cloud provider plugins: --cloud-provider=external not set (no CCM)
  • Several admission plugins that assume cloud infra

The K3s Tunnel Proxy β€” Replacing the CCM Node Controller

K3s introduces a reverse tunnel from agent β†’ server. In standard K8s, the API server connects to the kubelet for exec/logs/port-forward. In K3s:

Standard K8s:
 kube-apiserver β†’ kubelet:10250 (API server initiates)
 Requires API server to reach all nodes directly

K3s:
 k3s-agent β†’ k3s-server:6443 (agent initiates)
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ WebSocket tunnel maintained by agent β”‚
 β”‚ All kubelet traffic flows THROUGH this tunnel β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This is why K3s works behind NAT without special networking β€” agents reach out, not the server. This is a fundamental architectural shift that enables edge/IoT deployments.


2. SQLite / Kine β€” The etcd Abstraction Layer

This is the most significant architectural difference.

K3s introduces Kine (Kubernetes Is Not Etcd) β€” a shim that translates etcd's gRPC API into SQL queries.

kube-apiserver
 β”‚
 β”‚ etcd gRPC v3 protocol (ListWatch, Txn, etc.)
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Kine β”‚ ← translation layer
β”‚ (etcd shim) β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚ SQL queries
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SQLite / β”‚ ← actual datastore
β”‚ PostgreSQL β”‚
β”‚ MySQL β”‚
β”‚ DQLite β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

How Kine Implements the etcd Watch API:

etcd's watch is event-driven via gRPC streams. SQL databases don't natively support this. Kine implements it via:

-- Kine's core table
CREATE TABLE kine (
 id INTEGER PRIMARY KEY AUTOINCREMENT, -- acts as etcd revision
 name TEXT, -- the key (/registry/pods/...)
 created INTEGER,
 deleted INTEGER,
 create_revision INTEGER,
 prev_revision INTEGER,
 lease INTEGER,
 value BLOB, -- the protobuf-encoded object
 old_value BLOB
);

-- Watch is implemented as polling:
-- SELECT * FROM kine WHERE id > last_seen_id ORDER BY id
-- Run every ~100ms β€” NOT event-driven like real etcd

The Implications:

  • For small clusters: unnoticeable
  • For large clusters: polling adds latency to watch events
  • SQLite: single-writer, no HA (single node only)
  • PostgreSQL/MySQL with Kine: HA possible but watch latency higher than etcd

DQLite β€” Embedded Distributed SQLite (Experimental)

For HA without an external DB, K3s can use DQLite β€” a distributed SQLite implementation using Raft (similar to etcd but built on SQLite). It's embedded in the binary and doesn't require an external DB.

# K3s with embedded HA using DQLite
k3s server --cluster-init # First server (bootstrap)
k3s server --server https://first-server:6443 --token <token> # Join as HA peer

3. K3s Controller Manager β€” Pruned and Extended

K3s runs the upstream kube-controller-manager with several modifications:

Removed Controllers:

  • cloud-node controller (no cloud metadata fetching)
  • cloud-node-lifecycle controller
  • route controller (no cloud routes)
  • service controller (replaced by ServiceLB)

Added: ServiceLB (a.k.a. Klipper LoadBalancer)

Instead of calling a cloud API to provision a load balancer, K3s runs a DaemonSet-based solution:

Service type=LoadBalancer created
 β”‚
 β–Ό
ServiceLB Controller watches for it
 β”‚
 β–Ό
Creates a DaemonSet:
 - Runs a pod on every node with hostPort matching service ports
 - The pod does iptables DNAT β†’ service ClusterIP
 β”‚
 β–Ό
Every node's IP becomes a valid entry point
(no external LB needed)
# What ServiceLB actually deploys under the hood
apiVersion: apps/v1
kind: DaemonSet
metadata:
 name: svclb-my-service
 namespace: kube-system
spec:
 template:
 spec:
 hostNetwork: true
 containers:
 - name: lb-port-80
 image: rancher/klipper-lb:latest
 ports:
 - hostPort: 80 # binds on every node
 containerPort: 80
 env:
 - name: SRC_PORT
 value: "80"
 - name: DEST_PROTO
 value: TCP
 - name: DEST_IP
 value: "10.96.100.50" # ClusterIP
 - name: DEST_PORT
 value: "80"

4. K3s Scheduler β€” Unchanged but Co-located

The scheduler in K3s is the unmodified upstream kube-scheduler. However, it runs as a goroutine inside the k3s-server binary rather than as a separate process.

The key difference is operational:

  • In K8s: scheduler can be independently scaled, upgraded, or replaced (e.g., with Volcano, Yunikorn)
  • In K3s: scheduler is embedded β€” replacing it requires rebuilding or running an external scheduler with leader election disabled on the built-in one

5. The Flannel CNI β€” Embedded Networking

Standard K8s requires you to install a CNI (Calico, Cilium, Flannel, Weave) separately. K3s embeds Flannel with VXLAN as the default backend.

Pod on Node 1 (10.42.1.5) β†’ Pod on Node 2 (10.42.2.7)

Standard K8s + Calico:
 10.42.1.5 β†’ BGP route β†’ 10.42.2.7 (no encapsulation on supported networks)

K3s + Flannel VXLAN:
 10.42.1.5 β†’ VXLAN encapsulate β†’ eth0:8472 β†’ Node 2 β†’ decapsulate β†’ 10.42.2.7
 (works everywhere, slight overhead from encapsulation)

K3s also supports swapping Flannel for Cilium or Calico if you disable the built-in:

k3s server --flannel-backend=none --disable-network-policy
# Then install Cilium/Calico manually

πŸ“Š Side-by-Side Deep Comparison

Dimension Standard Kubernetes K3s
Deployment model Separate processes (+ etcd cluster) Single binary, all-in-one
API server Full upstream, all features Full upstream, conservative defaults
Datastore etcd (Raft, event-driven watch) SQLite/Kine (SQL polling) or embedded DQLite
Watch latency ~10ms (event-driven) ~100ms (polling on SQL backends)
HA datastore etcd cluster (3/5 nodes) External DB + Kine OR embedded DQLite
Control plane HA Multiple API server replicas Multiple k3s-server nodes possible
Cloud integration cloud-controller-manager No CCM, uses ServiceLB + node-ip flags
LoadBalancer Cloud LB (AWS ELB, GCP GLB) ServiceLB DaemonSet (hostPort)
Ingress Bring your own (nginx, traefik) Traefik v2 embedded
CNI Bring your own Flannel (VXLAN) embedded
DNS Bring your own CoreDNS CoreDNS embedded
Storage Bring your own CSI local-path-provisioner embedded
Kubelet location Separate binary on worker Embedded in k3s binary
API server β†’ kubelet Direct connection (port 10250) Reverse WebSocket tunnel
Memory (control plane) ~2GB+ (separate processes) ~512MB (single process)
Startup time 2-5 minutes 20-30 seconds
Alpha feature gates Available Disabled by default
Admission webhooks Full support Full support
CRDs Full support Full support
RBAC Full support Full support
Audit logging Configurable Configurable
Scheduler extensibility Scheduler profiles, plugins Embedded; replace with external
Controller extensibility Separate binary, hot-swap Embedded goroutine
Upgrades Independent component upgrades Single binary upgrade
Edge/NAT traversal Requires direct reachability Native via reverse tunnel
ARM support Separate builds Native multi-arch in single release

πŸ”‘ When to Choose What

Choose Standard Kubernetes When:

βœ… 100+ node clusters
βœ… Financial / regulated workloads requiring etcd for compliance
βœ… You need independent control plane component upgrades
βœ… You're using cloud-managed control planes (EKS, GKE, AKS)
βœ… You need custom scheduler profiles (ML batch, GPU scheduling)
βœ… Multi-tenancy with strong isolation requirements
βœ… You need external etcd for ultra-high availability
βœ… Team has K8s expertise and infra budget

Choose K3s When:

βœ… Edge computing (retail, industrial, remote sites)
βœ… IoT / ARM devices (Raspberry Pi clusters)
βœ… CI/CD ephemeral clusters (fast startup is critical)
βœ… Development environments (minimal resource usage)
βœ… Single-node homelab or small on-prem clusters
βœ… Clusters behind NAT (reverse tunnel is a killer feature)
βœ… Teams that want "it just works" with less Ops overhead
βœ… Bare metal without a cloud provider
βœ… Air-gapped environments (single binary, easy to ship)

πŸ”­ The Architecture Decision Tree

Do you need >50 nodes?
β”œβ”€β”€ YES β†’ Standard K8s (EKS/GKE/AKS or kubeadm)
└── NO
 β”œβ”€β”€ Are you on edge/IoT/ARM?
 β”‚ └── YES β†’ K3s (purpose-built for this)
 β”œβ”€β”€ Do you need cloud LoadBalancer integration?
 β”‚ └── YES β†’ Standard K8s with CCM
 β”œβ”€β”€ Is startup speed critical? (CI/CD, dev envs)
 β”‚ └── YES β†’ K3s
 β”œβ”€β”€ Do you need etcd for compliance/audit?
 β”‚ └── YES β†’ Standard K8s
 └── Default recommendation for <20 nodes on-prem?
 └── K3s (less to manage, same K8s API)

🎯 Closing Thoughts

K3s isn't "Kubernetes with stuff removed." It's a purpose-built reimagining of the control plane for constrained environments. Rancher made deliberate trade-offs:

  • etcd β†’ Kine/SQLite: Sacrificed watch latency and native HA for operational simplicity
  • Separate binaries β†’ Single binary: Sacrificed independent upgradeability for atomic deployments
  • CCM β†’ ServiceLB: Sacrificed cloud-native LB for zero-dependency load balancing
  • Direct kubelet access β†’ Reverse tunnel: Sacrificed simplicity for NAT traversal capability

The result is a distribution that runs the full Kubernetes API on a Raspberry Pi with 512MB of RAM, starts in 30 seconds, and works behind NAT β€” things standard K8s simply wasn't designed for.

Both are Kubernetes. Both run your workloads. The control plane is where the real difference lives.