Skip to content

Watch Owl alert rules

Define threshold-based alerts on Watch Owl host metrics — CPU, memory, disk, network, patches, and offline detection.

Last updated April 26, 2026

Watch Owl alert rules let you fire notifications when host metrics cross a threshold for a sustained duration. Rules are evaluated server-side against the metrics your agents push, so there's no per-host configuration — define a rule once, target it at all hosts or a subset, and route the firings to any of your integrations.

Find them at System Monitor → Alerts.

How rules work

Each rule answers four questions:

  1. What metric is being watched (CPU, memory, disk, network, patches, etc).
  2. What threshold triggers the rule (operator + value, like > 90).
  3. How long the threshold must hold before the rule fires (the duration).
  4. Which hosts the rule applies to (all, by tag, or a single host).

When the condition holds for duration_seconds straight, the rule fires — an event row is created in the timeline and the configured integrations dispatch a notification. When the metric drops back below the threshold, the event resolves automatically and a "resolved" notification goes out.

After a fire, the rule enters a cooldown (default: 1 hour) before it can fire again for the same host. This prevents flapping alerts from spamming the on-call.

Available metrics

MetricTypeNotes
cpu_percentnumericSystem-wide CPU usage, 0–100
memory_percentnumericMemory used, 0–100
disk_percentnumericPer-mount fill, 0–100. Pair with a mount filter
network_rx_bytes_per_secnumericInbound throughput per interface
network_tx_bytes_per_secnumericOutbound throughput per interface
reboot_requiredbooleanWhether the OS flagged a pending reboot
pending_updatesnumericCount of available package updates
security_updatesnumericCount of security-only updates
host_offline_secondsnumericSeconds since the last heartbeat

disk_percent requires a mount filter (e.g. /, /var, /data) — without one, a single noisy mount on one host could fire the rule for the wrong reason. Network metrics similarly take an interface filter (e.g. eth0, ens5).

Operators

gt, gte, lt, lte, eq, ne — applied to the numeric or boolean threshold you set. For boolean metrics like reboot_required, use eq true.

Scope: which hosts the rule covers

Scope typeBehavior
All hostsThe rule evaluates against every active host in your organization
TaggedThe rule evaluates only against hosts whose tags match. Tags are simple key/value pairs you set on the host (e.g. env=prod, role=db)
Specific hostThe rule applies to one named host. Useful for one-off rules on critical singletons

Tag-based scoping is the most flexible — tag your hosts on enrollment and you can add or remove machines from a rule's scope just by editing the host, without touching the rule.

Duration and cooldown

  • Duration is the "must hold for" window before firing. Set it to 0 to fire immediately on the first sample that crosses the threshold; set it to 300 (5 minutes) to require sustained breach. For noisy metrics like CPU, longer durations cut false positives sharply.
  • Cooldown is the post-fire silence period. After a rule fires for a host, it won't fire again for that host until the cooldown expires (default: 3600 seconds / 1 hour). The rule still evaluates and the event still auto-resolves — the cooldown only suppresses a new fire.

Routing alerts to integrations

Each rule has an integration list — pick one or more configured integrations from your Integrations page. When the rule fires (or resolves), every selected integration receives the event. See the Integrations overview for the full set: Email, Discord, Slack, Microsoft Teams, ntfy, and custom webhooks.

A rule with no integrations selected still records events in the timeline — useful if you only want a UI-visible audit trail without a notification.

Event timeline

System Monitor → Alerts → Events shows every fire and resolve, scoped to your organization. Each event records:

  • Which rule fired and which host triggered it.
  • The exact metric value at the moment of fire (e.g. CPU 94.2%).
  • The value at resolution.
  • Per-integration dispatch status (sent, failed, skipped) so you can confirm the notification actually went out.

Events are kept indefinitely for post-incident review. If a host is removed from your organization, its historical events are retained — the host UUID is stored on the event, not a foreign key.

Examples

Fire if any production database host is over 90% disk for 10 minutes:

text
Name:     prod-db disk near full
Metric:   disk_percent
Operator: gt
Threshold: 90
Mount filter: /var/lib/postgresql
Duration: 600
Scope:    Tagged → env=prod, role=db
Cooldown: 3600

Fire immediately if any host requires a reboot for security:

text
Name:     Reboot pending
Metric:   reboot_required
Operator: eq
Threshold: true
Duration: 0
Scope:    All hosts
Cooldown: 86400  // once per day per host

Fire if a host hasn't checked in for 5 minutes:

text
Name:     Host offline
Metric:   host_offline_seconds
Operator: gt
Threshold: 300
Duration: 0
Scope:    All hosts

Best practices

  • Start broad, narrow with tags. Begin with a single "all hosts" rule per metric, then split into env- or role-tagged variants once you know which thresholds make sense where.
  • Always set a duration on noisy metrics. CPU and memory spike briefly during normal workloads. A 5-minute duration eliminates 95% of false alarms.
  • Use cooldown to throttle, not silence. If you're tempted to set a 24-hour cooldown to stop spam, that's usually a sign the rule's threshold or duration is wrong — fix those first.
  • Pair host_offline_seconds with pending_updates. Together they cover both "host is down" and "host is up but neglected" — the two failure modes of a Linux server fleet.

Limitations

  • Alert rules evaluate against metrics from Watch Owl agents only. To alert on uptime monitors (HTTP, TCP, ping, SSL, domain), use the per-monitor alerting on the monitor itself.
  • There are no escalation policies (yet) — every selected integration receives the notification simultaneously. Build escalation in your downstream tool (PagerDuty, Opsgenie) via a custom webhook integration if needed.
  • There are no maintenance windows on alert rules (yet). To silence a rule temporarily, disable it from the rules list.