On paper, high-availability sounds like a neat little trick for Proxmox users. Being able to access your LXCs and VMs even if the underlying host goes down is pretty handy, especially when you’re running a DNS server, firewall, and other mission-critical services on your nodes. Plus, Proxmox’s HA provisions can migrate your virtual guests to other nodes automatically without requiring any manual input whatsoever.
If you’re rocking a production-heavy home lab or just want to tinker with automation projects, high-availability clusters are definitely worth checking out. Otherwise, the novelty wears off really quickly once you realize you have to buy extra hardware for clusters and commit a lot of time to maintaining your high-availability PVE environment.
4 Proxmox changes I made that delivered 90% of my reliability gains
These Proxmox tricks made my self-hosting and experimentation server more reliable
You have to contend with the quorum issue of PVE clusters
HA requires at least three nodes (or semi-functional workarounds)
Leaving pve-zsync aside because it’s meant to provide offsite backups instead of redundancy, you’ll typically have to deploy a Proxmox cluster for a high-availability setup. Although the process has become somewhat easier with recent versions of PVE, you’ll need at least three nodes to maintain a quorum for high-availability. Quorum is a mechanism that lets your PVE nodes make decisions for migration, HA failovers, and other cluster tasks via majority voting. Without implementing quorum-based voting, Proxmox would be susceptible to split-brain issues, where sudden disconnections between server nodes could cause them to work independently and result in inconsistencies in cluster operations.
Let’s say you’ve got three nodes and one of them goes down, the quorum is maintained, as 2 out of 3 hosts are operational. But if you’ve built a Proxmox cluster with just two machines, the entire setup (including the web interface) will become unusable should either of the two nodes go down. That’s because you’ve only got 50% of the nodes capable of voting, as opposed to the 51% required to authorize migrations. You could look into workarounds involving a third, ultra-cheap device acting as the QDevice, but this fix is far from ideal for Ceph clusters that require at least three nodes to function.
Ceph requires a lot of hardware and time investment
And ZFS replication has its own problems
While we’re on the subject of Ceph storage, it’s one of the best ways to provide distributed storage for Proxmox clusters. That said, it’s really easy to mess up your Ceph cluster when you’re putting it together. For example, I always recommend distributing the storage devices across your cluster nodes instead of turning a single system into a massive HDD server and leaving the processing tasks to other hosts.
Even with a fairly minimal setup, Ceph requires a lot of headroom on the network front, and you’d want to go for a 2.5G stack at the very least. It’s also a good idea to go for more than three nodes (an odd number, of course) when deploying a Ceph cluster, though I can’t say I fully agree with this statement as my triple node HA setup works fine as long as two systems are operational. Still, troubleshooting Ceph can be a royal pain, and even with the extensive utilities available on Proxmox’s web UI, you’ll have to invest a lot of time fixing things when they go wrong. Plus, if your secondary nodes have fewer system resources than the primary PVE host, they might not be able to run the virtual guests if something happens to the latter.
Then you’ve got ZFS replication, which works pretty well for budget-friendly setups, including 2-node Proxmox configurations. Well, you’ll still need a third system for quorum votes, but you can configure two nodes for the HA workflows. ZFS replication uses snapshots to transfer your virtual guest data between host machines, and these synchronization operations run at fixed intervals. In case one of the nodes goes down, Proxmox migrates the virtual guests onto the other host and uses the snapshots to spin up a new instance with all the data intact since the last replication task.
The drawback here is that you’ll have to balance the replication intervals for your LXCs and VMs, or risk choking the network with too many sync operations. Then you have to deal with the data loss issue. If your node goes down before it syncs data with the others, you’ll lose all work since the last replication task, and this can be annoying when your nodes transfer snapshots over longer periods (say, a few hours). I’ve also noticed that the migration and recovery times on ZFS replication are a little longer than those on Ceph clusters during my tests, and this is when I’d set the sync intervals to 30 minutes on the former.
Standalone Proxmox nodes are better for most home labs
This is going to sound really controversial for an article on clusters, but the deeper I go into the HA rabbit hole, the more I realize the utility of entirely separate Proxmox nodes. Look, I love clusters as much as any other tinkerer, but considering the monetary investment required for multiple nodes, extra storage, and better networking paraphernalia, it’s not really worth it for home labbers – especially with the RAM apocalypse shooting PC hardware prices through the roof.
But if you love creating DIY projects and don’t want to leave your essential services vulnerable to a botched experiment, going with two separate nodes is a decent option – and I’m willing to fight on this hill until my last breath. You can have a high-end server that caters to your tinkering needs, while a budget-friendly Proxmox node can house mission-critical containers and virtual machines. For extra protection, you can instead invest the money you’d otherwise spend on expensive clusters to deploy a 3-2-1 Proxmox Backup Server workflow, while a Proxmox Datacenter Manager VM running on the “safe” PVE node can help you migrate virtual guests between your tinkering machines.
I automated my Proxmox home lab with Terraform
The all-powerful Terraform pairs surprisingly well with casual home servers
