Watch Owl alert rules
Define threshold-based alerts on Watch Owl host metrics — CPU, memory, disk, network, patches, and offline detection.
Watch Owl alert rules let you fire notifications when host metrics cross a threshold for a sustained duration. Rules are evaluated server-side against the metrics your agents push, so there's no per-host configuration — define a rule once, target it at all hosts or a subset, and route the firings to any of your integrations.
Find them at System Monitor → Alerts.
How rules work
Each rule answers four questions:
- What metric is being watched (CPU, memory, disk, network, patches, etc).
- What threshold triggers the rule (operator + value, like
> 90). - How long the threshold must hold before the rule fires (the
duration). - Which hosts the rule applies to (all, by tag, or a single host).
When the condition holds for duration_seconds straight, the rule fires — an
event row is created in the timeline and the configured integrations dispatch a
notification. When the metric drops back below the threshold, the event resolves
automatically and a "resolved" notification goes out.
After a fire, the rule enters a cooldown (default: 1 hour) before it can fire again for the same host. This prevents flapping alerts from spamming the on-call.
Available metrics
| Metric | Type | Notes |
|---|---|---|
cpu_percent | numeric | System-wide CPU usage, 0–100 |
memory_percent | numeric | Memory used, 0–100 |
disk_percent | numeric | Per-mount fill, 0–100. Pair with a mount filter |
network_rx_bytes_per_sec | numeric | Inbound throughput per interface |
network_tx_bytes_per_sec | numeric | Outbound throughput per interface |
reboot_required | boolean | Whether the OS flagged a pending reboot |
pending_updates | numeric | Count of available package updates |
security_updates | numeric | Count of security-only updates |
host_offline_seconds | numeric | Seconds since the last heartbeat |
disk_percent requires a mount filter (e.g. /, /var, /data) — without one,
a single noisy mount on one host could fire the rule for the wrong reason. Network
metrics similarly take an interface filter (e.g. eth0, ens5).
Operators
gt, gte, lt, lte, eq, ne — applied to the numeric or boolean threshold
you set. For boolean metrics like reboot_required, use eq true.
Scope: which hosts the rule covers
| Scope type | Behavior |
|---|---|
| All hosts | The rule evaluates against every active host in your organization |
| Tagged | The rule evaluates only against hosts whose tags match. Tags are simple key/value pairs you set on the host (e.g. env=prod, role=db) |
| Specific host | The rule applies to one named host. Useful for one-off rules on critical singletons |
Tag-based scoping is the most flexible — tag your hosts on enrollment and you can add or remove machines from a rule's scope just by editing the host, without touching the rule.
Duration and cooldown
- Duration is the "must hold for" window before firing. Set it to
0to fire immediately on the first sample that crosses the threshold; set it to300(5 minutes) to require sustained breach. For noisy metrics like CPU, longer durations cut false positives sharply. - Cooldown is the post-fire silence period. After a rule fires for a host, it won't fire again for that host until the cooldown expires (default: 3600 seconds / 1 hour). The rule still evaluates and the event still auto-resolves — the cooldown only suppresses a new fire.
Routing alerts to integrations
Each rule has an integration list — pick one or more configured integrations from your Integrations page. When the rule fires (or resolves), every selected integration receives the event. See the Integrations overview for the full set: Email, Discord, Slack, Microsoft Teams, ntfy, and custom webhooks.
A rule with no integrations selected still records events in the timeline — useful if you only want a UI-visible audit trail without a notification.
Event timeline
System Monitor → Alerts → Events shows every fire and resolve, scoped to your organization. Each event records:
- Which rule fired and which host triggered it.
- The exact metric value at the moment of fire (e.g. CPU 94.2%).
- The value at resolution.
- Per-integration dispatch status (
sent,failed,skipped) so you can confirm the notification actually went out.
Events are kept indefinitely for post-incident review. If a host is removed from your organization, its historical events are retained — the host UUID is stored on the event, not a foreign key.
Examples
Fire if any production database host is over 90% disk for 10 minutes:
Name: prod-db disk near full
Metric: disk_percent
Operator: gt
Threshold: 90
Mount filter: /var/lib/postgresql
Duration: 600
Scope: Tagged → env=prod, role=db
Cooldown: 3600
Fire immediately if any host requires a reboot for security:
Name: Reboot pending
Metric: reboot_required
Operator: eq
Threshold: true
Duration: 0
Scope: All hosts
Cooldown: 86400 // once per day per host
Fire if a host hasn't checked in for 5 minutes:
Name: Host offline
Metric: host_offline_seconds
Operator: gt
Threshold: 300
Duration: 0
Scope: All hosts
Best practices
- Start broad, narrow with tags. Begin with a single "all hosts" rule per metric, then split into env- or role-tagged variants once you know which thresholds make sense where.
- Always set a duration on noisy metrics. CPU and memory spike briefly during normal workloads. A 5-minute duration eliminates 95% of false alarms.
- Use cooldown to throttle, not silence. If you're tempted to set a 24-hour cooldown to stop spam, that's usually a sign the rule's threshold or duration is wrong — fix those first.
- Pair
host_offline_secondswithpending_updates. Together they cover both "host is down" and "host is up but neglected" — the two failure modes of a Linux server fleet.
Limitations
- Alert rules evaluate against metrics from Watch Owl agents only. To alert on uptime monitors (HTTP, TCP, ping, SSL, domain), use the per-monitor alerting on the monitor itself.
- There are no escalation policies (yet) — every selected integration receives the notification simultaneously. Build escalation in your downstream tool (PagerDuty, Opsgenie) via a custom webhook integration if needed.
- There are no maintenance windows on alert rules (yet). To silence a rule temporarily, disable it from the rules list.