It took me far too long to get Proxmox up and running in my home lab, but now that everything is running smoothly, my focus has shifted to mapping, documenting, monitoring, and adding redundancies to ensure minimal downtime. Proxmox itself offers many features to aid with this. The web UI is absolutely awesome to use, and you can easily configure remote access for managing the home lab while away, but I wanted to use a few tools to take this a step further, bolstering on more protections with Proxmox's backups, snapshots, and other useful functionality.

Using free tools, I managed to map, monitor, and configure my home lab to automatically heal itself. That sounds like quite the achievement, but it was easier to configure than I initially assumed. The goal was to have my entire home lab running like a production environment and not some PC cases inside a data cabinet. Without overcomplicating things (which is the worst thing to do with a home lab), I wanted to map what actually existed in my home lab and on the LAN. I then wanted to configure some monitoring to check that everything is running.

Finally, I wanted my home lab to be resilient, but not so much as to automatically relaunch services, but to be able to recover from the most common failures without requiring human intervention. I managed to achieve all this in a few hours and with some well-documented tools.

The home lab grew large enough

I needed some help keeping tabs on everything

Credit: 

The LAN expanded from just a single switch and fiber link to the outside world to multiple switches, trunking, and access points. I have multiple areas to cover. Throw in network-attached storage (NAS), a custom OPNSense firewall, some LXCs and VMs on Proxmox nodes, and the need to keep all running services online, and I suddenly found myself needing more checks and balances in place. The home lab went from a small hobby on the side to run some cool apps to becoming an integral part of the home. Home Assistant alone is a perfect example of this.

To begin with, I mapped out my entire network. OPNsense was fine for listing all DHCP clients, but I wanted an easier and more thorough scan, which led me to Nmap. This utility is absolutely brilliant. There's also Zenmap, which is a GUI version that can be used to achieve simialr results. Using either tool allowed me to sweep the entire network in seconds, discovering everything on the LAN. I didn't want fancy diagrams or a visual representation of the network, so Netbox was out of the question. It's far too overkill for my needs anyway, but it's definitely worth checking out if you want the best.

8 Questions ยท Test Your Knowledge

The ultimate Promxox home lab monitor
Trivia challenge

Think you know your dashboards from your data sources? Test your home lab monitoring skills now.

ProxmoxMonitoringGrafanaPrometheusHigh Availability
01 / 8High Availability

In Proxmox VE, what is the minimum number of nodes required to form a supported High Availability cluster?

Correct! Proxmox HA requires a minimum of 3 nodes to achieve a reliable quorum. This prevents split-brain scenarios where two halves of a cluster disagree on the state of resources, ensuring only one side can take action at a time.
Not quite. The answer is 3 nodes. Proxmox uses the Corosync quorum system, which needs a majority vote to make decisions โ€” with 2 nodes, there is no clear majority if the nodes disagree, making 3 the practical minimum for a stable cluster.
02 / 8Prometheus

What is the default port that Prometheus uses to expose its own metrics and web UI?

Correct! Prometheus listens on port 9090 by default for both its web UI and its own internal metrics endpoint. You can access the expression browser and targets page at http://localhost:9090 after a standard installation.
Not quite. The answer is 9090. Port 3000 belongs to Grafana, port 9100 is used by the Node Exporter, and port 8086 is associated with InfluxDB โ€” all common tools in a monitoring stack, but not Prometheus itself.
03 / 8Grafana

Which Grafana feature allows you to send notifications to Slack or email when a metric crosses a defined threshold?

Correct! Grafana's Alerting system lets you define rules based on query results and route notifications through contact points like Slack, email, PagerDuty, and more. It is an essential tool for proactive home lab monitoring rather than reactive troubleshooting.
Not quite. The answer is Alerting. Annotations are used to mark events on a graph, dashboard variables make panels dynamic, and Explore mode is for ad-hoc querying โ€” none of these send threshold-based notifications the way Grafana Alerting does.
04 / 8Proxmox

Which Prometheus exporter is most commonly used to collect hardware and OS-level metrics from Proxmox nodes, such as CPU usage and memory?

Correct! The Prometheus Node Exporter is the standard tool for exposing Linux system metrics including CPU, memory, disk I/O, and network statistics. Installing it on each Proxmox node allows Prometheus to scrape host-level data for your Grafana dashboards.
Not quite. The answer is Node Exporter. cAdvisor is designed for container metrics, Blackbox Exporter probes endpoints like HTTP or DNS, and Pushgateway accepts metrics pushed from short-lived jobs โ€” none are purpose-built for general OS-level hardware stats.
05 / 8High Availability

In Proxmox HA, what happens to a virtual machine when the node it runs on becomes unavailable?

Correct! Proxmox HA monitors VM states and, upon detecting a node failure, restarts the affected VMs on surviving cluster nodes. This is a recovery restart, not a live migration โ€” there is a brief downtime, but the service is automatically restored without manual intervention.
Not quite. The answer is that the VM is restarted on another available node. Proxmox HA does not perform live migration during a failure โ€” instead it fences the failed node and cold-starts the VM elsewhere, which differs from zero-downtime migration done on healthy nodes.
06 / 8Prometheus

In Prometheus, what is the term for the time-series data collection process where Prometheus contacts each target to retrieve metrics?

Correct! Prometheus uses a pull-based model called scraping, where it periodically sends HTTP requests to configured target endpoints to collect metrics. The scrape interval is configurable and defines how frequently each target is queried, typically every 15 or 30 seconds in a home lab.
Not quite. The answer is scraping. While 'polling' sounds similar, Prometheus specifically calls this process scraping. Pushing is the opposite model โ€” used by tools like Pushgateway โ€” where applications send metrics to a collector rather than waiting to be queried.
07 / 8Grafana

What is the name of the query language used natively by Prometheus and supported as a data source in Grafana?

Correct! PromQL, short for Prometheus Query Language, is the functional query language built into Prometheus. In Grafana, you write PromQL expressions in panel editors to aggregate, filter, and visualize your time-series metrics with powerful functions like rate(), avg_over_time(), and histogram_quantile().
Not quite. The answer is PromQL. InfluxQL is used with InfluxDB, LogQL is Grafana Loki's query language for logs, and FluxQL (Flux) is InfluxDB's newer scripting language โ€” each belongs to a different backend data source, not Prometheus.
08 / 8Monitoring

Which pre-built Grafana dashboard ID on Grafana.com is widely used in the community to visualize Proxmox node metrics collected via Node Exporter?

Correct! Dashboard ID 1860, known as the 'Node Exporter Full' dashboard, is one of the most popular community dashboards on Grafana.com. It provides comprehensive visualizations for CPU, memory, disk, and network metrics scraped from Linux hosts like Proxmox nodes, and can be imported in seconds using the dashboard import feature.
Not quite. The answer is Dashboard ID 1860, the 'Node Exporter Full' dashboard. It is one of the highest-rated community dashboards on Grafana.com and provides an extensive view of Linux host metrics. You can import it directly into Grafana using the Import dashboard option and entering the ID number.
Challenge Complete

Your Score

/ 8

Thanks for playing!

For my setup, I'm going with Gitea, a self-hosted git platform. I've got a label maker and wanted to use QR codes and printed data to show what each device does, with the ability to scan the QR code on a phone to quickly bring up the Gitea MD file for that network client. It's also great for inventory, so I could kill all birds with a single self-hosted stone. Nmap and Zenmap were used to collate everything. nmap -sV -O -oX scan_results.xml 10.0.0.0/24 (swap 10.0.0.0 for your network range) provided ample data for Gitea to be populated, and the -oX flag exported all the data into an XML file. This was invaluable.

๐Ÿ‘ Running NetBox on a monitor
I mapped every machine in my home lab with this free tool

Capable of documenting even the most complex server setups and network stacks, NetBox is a must-have for any home labber

Moving everything to Gitea

Within Gitea, I created a home lab and network repository. This would be used to show all the relevant data for the entire network stack and home lab hardware. All services would have their own MD file with everything required to know precisely what to do when a failure occurs, and all required details for connecting and administering self-hosted content. And because Gitea is bonkers when it comes to self-hosting a git platform, full account access is required for anything to be readable, adding a security layer for those who fancy themselves black hat hackers through scanning the QR codes.

The README.md file acts as a sort of network dashboard, which then links out to various subfolders within the repo. I've got everything categorized in /servers, /smart-home, /network, and /storage. The last isn't relating to NAS and other storage devices, but rather inventory for anything that's stashed away in the garage. Gone are the days when I'd think, "Do I still have that old kit of DDR4 RAM now valued as much as my home?" and I can quickly check Gitea to see where it is or which device has it installed. I've inventoried everything, including all parts in all systems.

Remember those QR codes and the printer? Well, I've even stacked the codes and created small scannable labels that are now stuck to all my components. Quickly scanning that particular device will bring up an MD file with all the details. One could argue this as overkill, much like Netbox on a home network, but I find it liberating to finally understand how everything comes together, not just on the surface, but within each chassis. I can look at my Proxmox nodes and know precisely what's inside each system.

Monitor the entire stack, not just apps

Making sure every service, system, and app is running

Proxmox already does a great job at presenting data for all the nodes within the cluster and datacenter, but there's a way to improve this by including all running services, VMs, and more. My esteemed colleague Adam Conway wrote a fantastic guide on using Grafana, Loki, Prometheus, and Alloy to create the ultimate home lab dashboard, and it's one I used as the foundation for my own solution. In short, Prometheus scrapes the Proxmox stack and provides all the necessary data, which can then be used to present everything within a self-contained dashboard.

It quite literally takes five minutes and involves running just two virtual instances to bring everything together. Compared to using something like Homepage and logging into Proxmox to check system status, I can use Grafana to view everything within the home lab. There's even a way to pull this into Home Assistant to create the ultimate smart home platform that allows me to glance at anything relating to the home, network, or home lab in one location. I'm talking CPU by host, RAM usage, ZFS pool space, failed systemd units, network errors, uptime, and more.

But as well as making sure the underlying instances and hardware are all up and running, I needed a way to check that a service is accessible. I opted to use Uptime Kuma with a remote system to check external connections at home. This is two-fold. First, it ensures reverse proxies are still working, certificates are live, and connections are possible. But it can also alert me to when individual services go offline, so I can quickly check them even if Grafana and the local shell report everything is well.

Re-rolling the homelab as a Cleric

Without an LLM running in the background

Making the home lab more resilient to failure doesn't mean I've connected a locally-hosted LLM to the entire stack. It's about ensuring that, should a systemd service fail for some reason, it's automatically restarted once. If a container on one of the Proxmox nodes stops, it's restarted. If a VM goes offline and doesn't reboot but is marked as required in my setup, a start command is sent through. It's nothing exciting, nor is it groundbreaking, but that's precisely the point. I want my home lab to be responsive ... to a point.

Deals

Save on Storage & Networking deals for your home lab

Find discounts on switches, NAS, routers, access points, cables, and monitoring gear to harden and monitor your home lab. Shop offers on storage drives, redundant power supplies, and rack accessories to boost uptime and save on networking essentials.

Proxmox powers my home lab, so I've got Proxmox High Availability (HA) at the ready. This handy tool can restart or even relocate HA-managed guests if a node fails. Instead of shrugging its shoulders and waiting for me to log in and fix something, Proxmox HA can automatically reset a node if a watchdog timer expires. It's actually really cool once you've configured the cluster to be self-sufficient to a degree. Not everything is covered by HA, but specific instances (Home Assistant, monitoring, etc.) that warrant it are better protected.

Coupled with Proxmox HA is Ansible, which can be invaluable for restarting systemd units, restarting failed Docker services, and even updating entire systems. It's really neat and worth exploring if you're serious about the home lab. Combining all of this, I've completely transformed my network and home lab from a fragmented mess of sorts to a cohesive, semi-automated Factorio mega-complex. The best part? It's actually easier to manage and allows me to focus on everything else in life, like experimenting with ways to break it all apart.

๐Ÿ‘ Monitoring containers in Uptime Kuma
5 of the best tools for monitoring your home lab

Keep a weather eye on your home server with these incredible services