A server health audit catches degradation before it becomes an outage. Whether you run Linux or Windows servers on-premises or in the cloud, this checklist covers the critical metrics — CPU, memory, disk, services, logs, and security posture — that determine server reliability.
On this page
Linux Server Checklist
Windows Server Checklist
Resource Utilisation Thresholds
| Metric | Normal | Warning | Critical | Immediate Action |
|---|---|---|---|---|
| CPU Usage (sustained) | < 60% | 60–80% | > 80% | Investigate top processes, scale compute, check for runaway jobs |
| Memory Usage | < 70% | 70–85% | > 85% | Check for memory leaks, add swap, scale instance, restart leaking service |
| Disk Usage | < 70% | 70–85% | > 90% | Run disk cleanup, archive logs, expand volume, investigate top consumers |
| Disk I/O Await (Linux) | < 10ms | 10–20ms | > 20ms | Check SMART, inspect disk-heavy processes, consider SSD/IOPS upgrade |
| Load Average / CPU Cores | < 0.7× | 0.7–1.0× | > 1.0× | Profile CPU-heavy processes, optimise, or scale horizontally |
| Network Packet Loss | 0% | < 0.1% | > 0.5% | Check NIC, switch port, ISP, or investigate potential attack traffic |
CPU Usage (sustained)
- Normal
- < 60%
- Warning
- 60–80%
- Critical
- > 80%
- Immediate Action
- Investigate top processes, scale compute, check for runaway jobs
Memory Usage
- Normal
- < 70%
- Warning
- 70–85%
- Critical
- > 85%
- Immediate Action
- Check for memory leaks, add swap, scale instance, restart leaking service
Disk Usage
- Normal
- < 70%
- Warning
- 70–85%
- Critical
- > 90%
- Immediate Action
- Run disk cleanup, archive logs, expand volume, investigate top consumers
Disk I/O Await (Linux)
- Normal
- < 10ms
- Warning
- 10–20ms
- Critical
- > 20ms
- Immediate Action
- Check SMART, inspect disk-heavy processes, consider SSD/IOPS upgrade
Load Average / CPU Cores
- Normal
- < 0.7×
- Warning
- 0.7–1.0×
- Critical
- > 1.0×
- Immediate Action
- Profile CPU-heavy processes, optimise, or scale horizontally
Network Packet Loss
- Normal
- 0%
- Warning
- < 0.1%
- Critical
- > 0.5%
- Immediate Action
- Check NIC, switch port, ISP, or investigate potential attack traffic
Security Hardening Checklist
Automation & Monitoring Tools
- Prometheus + Grafana — open-source metric collection and dashboarding. Excellent for Linux/containers, free, widely adopted.
- Zabbix — open-source monitoring with agent-based and agentless monitoring, network discovery, and alerting.
- Nagios Core — the classic server monitoring tool. Still widely used, extensive plugin ecosystem.
- Datadog — commercial, excellent for cloud + container environments, strong APM and anomaly detection.
- New Relic — commercial, strong application performance monitoring, free tier available.
- AWS CloudWatch / Azure Monitor / GCP Cloud Monitoring — native cloud monitoring, ideal for cloud-hosted servers.
- Wazuh — open-source SIEM + EDR. Excellent for security log analysis and compliance (runs on Linux/Windows agents).
Automate This Checklist
Schedule this audit to run monthly at minimum. Most monitoring tools (Nagios, Zabbix, Prometheus) can run these checks continuously and alert on threshold breaches — moving you from reactive to proactive operations.