The main keyword for this guide is “CRIU and Sidecar Operators”, and this article shows a practical path to using CRIU and sidecar operators to checkpoint, transfer, and restore running stateful processes so you can move services across Kubernetes nodes or clusters with minimal downtime. This hands‑on guide explains process snapshot/restore, operator design patterns, and networking tricks required for live migration of stateful workloads.
Why checkpointing matters for stateful mobility
Modern cloud-native orchestration is excellent at scheduling, but moving a running stateful process—complete with open sockets, in‑memory caches, and transient kernel state—remains challenging. Checkpoint/restore enables:
- Planned node maintenance without service interruptions
- Load‑aware rebalancing of memory- or CPU‑heavy stateful pods
- Hybrid/multi‑cluster evacuations and cross‑region mobility
Core building blocks
Successful live migration on Kubernetes needs three coordinated pieces:
- CRIU (Checkpoint/Restore in Userspace) — the low‑level tool to freeze a process and serialize kernel/userland state to disk.
- Sidecar operator or controller pattern — automated control plane that orchestrates checkpointing, storage, and restore steps per pod.
- Network and storage plumbing — tricks to preserve or transparently reconnect sockets, and to ensure persistent volumes follow the service.
Design pattern: Sidecar operator + coordinator
Rather than embedding CRIU logic into the main application container, use a sidecar that runs with elevated capabilities (or uses a host helper) to perform checkpoints and coordinate with an operator. Typical responsibilities:
- Sidecar: runs CRIU, manages local image directories, and communicates with remote storage (S3/MinIO) or object store.
- Operator: Kubernetes controller that discovers eligible pods, triggers checkpoint flows, updates Endpoint/Service objects, and handles leader election for consistency.
- Coordinator: optional lightweight central service (or CRD) that records migration state and orchestrates multi‑pod migrations.
Why sidecar?
Sidecars isolate privilege needs (CRIU requires SYS_ADMIN and CAP_SYS_PTRACE in many cases), let the main container remain immutable, and make operator logic testable and reusable across workloads.
Practical migration workflow
Below is a compact, reliable workflow for a single‑pod live migration. This assumes the pod uses a sidecar that can run CRIU and that persistent volumes are available via a CSI driver that supports volume relocation or multi‑attach.
Step 1 — Prepare the pod
- Ensure the sidecar image contains CRIU matching the node kernel and required kernel configs (CONFIG_CHECKPOINT_RESTORE, netlink, namespaces).
- Grant the sidecar minimal elevated capabilities: CAP_SYS_ADMIN and CAP_DAC_READ_SEARCH, or run as privileged on trusted nodes.
- Mount an image directory (emptyDir) for CRIU image files and a tokenized credentials volume for object storage.
Step 2 — Quiesce and checkpoint
The operator triggers a graceful quiesce (application‑level freeze) when possible; if not, use CRIU’s userfaultfd and pre‑dump strategies for large memory. Example CRIU commands (conceptual):
criu dump -t <pid> -D /criu/images --images-dir /criu/images --tcp-established --shell-job --leave-running=false
Notes:
- Use –tcp-established to capture open connections (requires kernel support and careful restore handling).
- For large apps, use pre-dump iterations to transfer most memory incrementally.
Step 3 — Transfer images
Sidecar uploads the CRIU image directory to object storage (S3/MinIO) using atomic markers (e.g., upload to /migrations/<id>/ready). The operator records the migration target node/cluster in a CRD and ensures the destination has the image available before restoring.
Step 4 — Prepare destination pod
- Schedule a destination pod with an identical container image layout, compatible kernel, and the same sidecar present.
- Ensure persistent storage is attached (either the same PV bound, CSI volume clone, or an application-level state resync mechanism).
- Set up a temporary network proxy (traffic buffer) so incoming connections are accepted while the destination restores.
Step 5 — Restore and cut over
On the destination, sidecar downloads image files and calls CRIU restore:
criu restore -D /criu/images --images-dir /criu/images --tcp-established --shell-job
Once the process is restored and health checks pass, the operator updates Service Endpoints and load‑balancer rules to direct traffic to the new pod, drains the temporary proxy, and removes the original pod.
Networking tricks for near‑zero downtime
Maintaining TCP continuity is the hardest part. Strategies include:
- TCP handoff via proxy: Use a lightweight sidecar proxy (Envoy or socat) that holds connections during migration. Proxy can be pointed to new pod after restore.
- IP takeovers and veth tricks: On single‑node mobility, move the network namespace (veth/ip link move) and IP address to the destination; this is complex and limited to same-host or cluster networks.
- Service mesh session draining: Use service mesh traffic shifting (weighted routing) to gradually move new sessions while existing ones are drained after restore.
- Application reconnection logic: Design apps to reestablish stateful connections (session tokens, checkpointed in-memory session store) if full TCP preservation is impossible.
Storage considerations
Stateful services usually depend on persistent volumes. Options:
- Shared storage (NFS, Ceph, S3): Destination can immediately mount same volume.
- CSI cloning/replication: Use storage backend that supports fast clone/attach to move the PV with the pod.
- Application-level sync: For caches or ephemeral state, checkpoint the critical in-memory state and repopulate after restore.
Operator patterns and CRD design
Design a Migration CRD with fields: sourcePod, targetNode/cluster, storageRef, imageManifest, ttl, and status phases. Operator responsibilities:
- Validate kernel and CRIU compatibility preflight
- Trigger pre-dump/dump and orchestrate uploads
- Create destination pod with matching spec and coordinate restore
- Manage cutover, endpoint updates, retries and rollback
Limitations and hard lessons
CRIU is powerful but has constraints: kernel feature parity across nodes, support for certain kernel modules (e.g., seccomp, eBPF), and complexity with GPU or hardware devices. TCP preservation is not bulletproof—plan for partial reconnection or application-aware session migration. Always test on a staging cluster that mirrors production.
Checklist for a reliable migration run
- Sidecar and operator images tested and versioned
- CRIU built for node kernels and kernel configs confirmed
- Object storage accessible from both source and destination
- Persistent volume strategy validated (shared/clone/replicate)
- Network proxy or mesh flow prepared for traffic buffering
- Health checks and automated rollback paths in operator
Checkpointing with CRIU and orchestrating via sidecar operators is a practical route to achieving near‑zero downtime mobility for stateful Kubernetes workloads; the approach demands careful attention to kernel compatibility, networking, and storage plumbing but yields powerful operational flexibility.
Ready to try this in your environment? Start by building a CRIU‑enabled sidecar image, test a simple TCP service migration on a dev cluster, and iterate from there.
