Triage at Scale: Health Checks That Actually Matter

With a fleet this large, the daily job isn’t “check everything”—it’s “detect what matters first.” I built a lightweight triage loop around a small set of high-signal indicators: uptime and latency trends, SSL status, disk and backup health, error logs, and WordPress-specific failure modes like stuck cron, plugin update drift, and admin lockouts. The technical focus was reducing noise so alerts were actionable, which is especially important when operating across industries where compliance sensitivity (medical cannabis) and conversion sensitivity (realty/construction lead funnels) shape what “urgent” really means.