DevOps & Audit April 27, 2026

OpenClaw launchd ThrottleInterval, KeepAlive, and Health Probes: 2026 Matrix on Rented Mac mini

VmMac Engineering Team April 27, 2026 ~23 min read

Platform engineers running OpenClaw under LaunchAgent on rented Apple Silicon Mac mini from VmMac inherit subtle defaults: ThrottleInterval that hides crash loops, KeepAlive that resurrects poisoned processes, and SuccessfulExit semantics that disagree with how Node gateways exit during upgrades. This 2026 matrix names safe combinations, shows how false-positive restart storms happen, lists headless health-probe patterns, and ties observability to Hong Kong, Japan, Korea, Singapore, and the United States parity. Read alongside gateway recovery, staging vs production isolation, and structured log rotation so plist edits never become guesswork.

Use install & deploy for baseline paths, help for access, and pricing when you split canary gateways onto dedicated hosts.

Why launchd Policy Matters More Than the Node Binary Version

A perfect OpenClaw semver pin still melts if launchd respawns the gateway every 12 seconds because your health probe mistakes LLM latency for death. Treat plist keys as part of the SLA contract: they define how the OS interprets success, failure, and backoff. On VmMac you do not have a hypervisor supervisor—launchd is the supervisor.

Node gateways also inherit macOS signal semantics: SIGTERM during deploy must close HTTP servers, flush structured logs, and exit before launchd escalates to SIGKILL. If your process ignores SIGTERM because an upstream WebSocket hangs, you will see “successful” manual restarts in an interactive shell while agents look “flaky.” Document the expected signal ladder in the same README where you pin Node LTS, and rehearse it on a disposable VmMac mini before touching production plists.

Finally, remember that ThrottleInterval is not a substitute for fixing bugs—it only changes how loudly failures scream. Pair OS-level backoff with application-level circuit breakers so a poisoned model configuration cannot peg CPU across all regions during an incident.

  • ThrottleInterval caps restart frequency but can mask sustained partial outages.
  • KeepAlive keeps daemons resident but complicates draining upgrades.
  • ExitTimeOut decides how long graceful shutdown gets before SIGKILL.

Matrix: ThrottleInterval vs KeepAlive vs ExitTimeOut Trade-offs

Goal Primary knob Risk Mitigation
Stop crash loops ThrottleInterval ≥ 30s Slow recovery after real crashes Pair with external pager on exit code
Always-on gateway KeepAlive true + SuccessfulExit false Respawn during intentional shutdown Use separate maintenance label
Graceful drain ExitTimeOut 25–40s Hung shutdown blocks restart Watchdog SIGKILL after budget
Prevent thundering herd after reboot Staggered StartCalendarInterval or randomized sleep in wrapper Delayed readiness vs peers Health gate waits for deps, not wall clock

When you co-locate OpenClaw with other launchd agents on the same user session, consider Nice and LowPriorityIO plist keys so a runaway log shipper cannot starve the gateway’s event loop. Those knobs do not replace capacity planning, but they buy minutes during partial outages—often enough for on-call to widen ThrottleInterval deliberately instead of fighting an accidental fork bomb of health wrappers.

False-Positive Restart Storms: Symptoms and Root Causes

Storms often begin when health checks hit localhost while the gateway is still binding sockets after disk pressure delayed module loads. Another pattern: Spotlight or mds spikes lengthen cold start, tripping your probe timeout—correlate with indexing policy before blaming OpenClaw itself.

Guardrail: never set probe interval shorter than your cold-start p95 without a warmup gate.

Another storm pattern is dependency flapping: if your probe checks an external SaaS before localhost readiness, regional routing blips in Japan or Singapore can mark the gateway unhealthy even though the process is fine. Layer probes so the outer monitor tests loopback first, then optionally samples an external canary with looser timeouts. That split keeps launchd from coupling process lifetime to third-party SLAs you do not control.

Disk-full episodes create the cruel illusion of a crash loop: the gateway writes JSONL until APFS returns ENOSPC, exits non-zero, KeepAlive resurrects it, and logs immediately fail again—often faster than ThrottleInterval can cool the system. Mirror the disk guardrails from structured log rotation so restarts slow down for the right reason.

Headless Health Probe Patterns That Work Without VNC

Use a tiny wrapper that curls the gateway health port with TLS verification disabled only on loopback—or better, calls OpenClaw’s own CLI status subcommand if your version exposes one. Log HTTP status, TLS handshake ms, and PID after probe so on-call can tell “process up but wedged” from “process down.”

Where possible, emit a synthetic X-Probe-Trace-Id header and echo it in gateway logs so you can stitch probe failures to in-process slow queries. That single correlation trick has ended more Sev2s than any plist tweak because it proves whether launchd acted correctly given the data it had.

curl -fsS --max-time 3 http://127.0.0.1:18789/health || exit 1

Eight-Step Tuning Ladder for 2026

  1. Capture baseline exit codes for seven days before changing plist.
  2. Set ThrottleInterval to stop any sub-20s restart pattern.
  3. Align ExitTimeOut with documented graceful shutdown SLA.
  4. Add structured logs for every restart reason.
  5. Run canary in one VmMac region for 72 hours.
  6. Compare restart counts across HK, JP, KR, SG, US.
  7. Document rollback plist in git tag launchd-YYYYMMDD.
  8. Quarterly game day: kill -9 the gateway and measure recovery.

Between steps four and five, freeze feature work on the canary label for 72 hours even if metrics look green—many restart storms only appear under weekend traffic shapes. Capture sample and footprint snapshots during the canary window so you can compare memory regressions across VmMac regions without guessing.

Recovery articles describe what to restart; this matrix describes when the OS should restart for you. If numbers disagree, humans lose trust and start kill -9 habits. Keep the same numeric knobs in staging and prod per staging isolation so promotions are diffable, not tribal.

Numeric habit: page if restarts exceed 6 per hour for more than 20 minutes—that is almost never “normal LLM variance.”

Observability Table: What to Log per Restart

Field Purpose
launchd_exit_status Distinguish OOM vs clean exit
uptime_s Detect infant mortality loops
rss_mb Correlate with leaks before SIGKILL
fds_open Spot descriptor leaks that keepalive magnifies
last_http_5xx_ts Separate upstream outages from local death

Ship these fields as JSON lines to the same sink you already use for gateway requests so SREs can pivot from “launchd restarted job” to “job restarted because RSS crossed 1.2× baseline” without SSHing into each mini. Consistent field names across Hong Kong, Japan, Korea, Singapore, and the United States matter more than perfect cardinality during the first iteration.

FAQ: launchd and OpenClaw on Mac mini

Should I use LaunchDaemon instead of LaunchAgent? Only with explicit security review—agents inherit user TCC boundaries you may rely on.

Can I tune per region? Only with written exceptions; default is parity.

What about upgrades? Temporarily widen ThrottleInterval during npm migrations.

Does SuccessfulExit false always mean “restart on any exit”? Effectively yes for most gateways—pair it with application exit codes that distinguish drain-complete from crash, or you will fight the job during blue/green deploys.

Should health probes run as root? Prefer the same user context as the gateway so file permissions and keychain/TCC assumptions stay aligned; escalate only when a read-only probe truly requires it.

How do I silence alerts during known maintenance? Toggle a maintenance file your wrapper checks before failing the probe, and auto-expire that file so engineers cannot forget it over a long weekend.

Why Mac mini M4 and VmMac Fit Always-On Gateways

Mac mini M4 gives enough sustained performance to absorb occasional cold starts without tripping aggressive probes, while unified memory reduces OOM flaps compared to oversubscribed VPS hosts. VmMac lets you place gateways in Hong Kong, Japan, Korea, Singapore, and the United States next to your users and data residency needs—then rent a second mini for canary plist experiments without risking production restart semantics.

Renting also shortens the feedback loop when Apple ships macOS security patches that subtly change launchd behavior: you can snapshot your plist matrix, apply the patch on a non-production mini, and replay the same health-probe harness before touching customer-facing automation. That discipline is cheaper than a single overnight page caused by an undocumented restart policy change.

Canary a New plist Before Production

Rent an extra VmMac Mac mini in Singapore or Tokyo to validate ThrottleInterval changes without waking on-call.