OpenClaw launchd ThrottleInterval, KeepAlive, and Health Probes: 2026 Matrix on Rented Mac mini
Platform engineers running OpenClaw under LaunchAgent on rented Apple Silicon Mac mini from VmMac inherit subtle defaults: ThrottleInterval that hides crash loops, KeepAlive that resurrects poisoned processes, and SuccessfulExit semantics that disagree with how Node gateways exit during upgrades. This 2026 matrix names safe combinations, shows how false-positive restart storms happen, lists headless health-probe patterns, and ties observability to Hong Kong, Japan, Korea, Singapore, and the United States parity. Read alongside gateway recovery, staging vs production isolation, and structured log rotation so plist edits never become guesswork.
Use install & deploy for baseline paths, help for access, and pricing when you split canary gateways onto dedicated hosts.
Why launchd Policy Matters More Than the Node Binary Version
A perfect OpenClaw semver pin still melts if launchd respawns the gateway every 12 seconds because your health probe mistakes LLM latency for death. Treat plist keys as part of the SLA contract: they define how the OS interprets success, failure, and backoff. On VmMac you do not have a hypervisor supervisor—launchd is the supervisor.
Node gateways also inherit macOS signal semantics: SIGTERM during deploy must close HTTP servers, flush structured logs, and exit before launchd escalates to SIGKILL. If your process ignores SIGTERM because an upstream WebSocket hangs, you will see “successful” manual restarts in an interactive shell while agents look “flaky.” Document the expected signal ladder in the same README where you pin Node LTS, and rehearse it on a disposable VmMac mini before touching production plists.
Finally, remember that ThrottleInterval is not a substitute for fixing bugs—it only changes how loudly failures scream. Pair OS-level backoff with application-level circuit breakers so a poisoned model configuration cannot peg CPU across all regions during an incident.
- ThrottleInterval caps restart frequency but can mask sustained partial outages.
- KeepAlive keeps daemons resident but complicates draining upgrades.
- ExitTimeOut decides how long graceful shutdown gets before SIGKILL.
Matrix: ThrottleInterval vs KeepAlive vs ExitTimeOut Trade-offs
| Goal | Primary knob | Risk | Mitigation |
|---|---|---|---|
| Stop crash loops | ThrottleInterval ≥ 30s | Slow recovery after real crashes | Pair with external pager on exit code |
| Always-on gateway | KeepAlive true + SuccessfulExit false | Respawn during intentional shutdown | Use separate maintenance label |
| Graceful drain | ExitTimeOut 25–40s | Hung shutdown blocks restart | Watchdog SIGKILL after budget |
| Prevent thundering herd after reboot | Staggered StartCalendarInterval or randomized sleep in wrapper | Delayed readiness vs peers | Health gate waits for deps, not wall clock |
When you co-locate OpenClaw with other launchd agents on the same user session, consider Nice and LowPriorityIO plist keys so a runaway log shipper cannot starve the gateway’s event loop. Those knobs do not replace capacity planning, but they buy minutes during partial outages—often enough for on-call to widen ThrottleInterval deliberately instead of fighting an accidental fork bomb of health wrappers.
False-Positive Restart Storms: Symptoms and Root Causes
Storms often begin when health checks hit localhost while the gateway is still binding sockets after disk pressure delayed module loads. Another pattern: Spotlight or mds spikes lengthen cold start, tripping your probe timeout—correlate with indexing policy before blaming OpenClaw itself.
Another storm pattern is dependency flapping: if your probe checks an external SaaS before localhost readiness, regional routing blips in Japan or Singapore can mark the gateway unhealthy even though the process is fine. Layer probes so the outer monitor tests loopback first, then optionally samples an external canary with looser timeouts. That split keeps launchd from coupling process lifetime to third-party SLAs you do not control.
Disk-full episodes create the cruel illusion of a crash loop: the gateway writes JSONL until APFS returns ENOSPC, exits non-zero, KeepAlive resurrects it, and logs immediately fail again—often faster than ThrottleInterval can cool the system. Mirror the disk guardrails from structured log rotation so restarts slow down for the right reason.
Headless Health Probe Patterns That Work Without VNC
Use a tiny wrapper that curls the gateway health port with TLS verification disabled only on loopback—or better, calls OpenClaw’s own CLI status subcommand if your version exposes one. Log HTTP status, TLS handshake ms, and PID after probe so on-call can tell “process up but wedged” from “process down.”
Where possible, emit a synthetic X-Probe-Trace-Id header and echo it in gateway logs so you can stitch probe failures to in-process slow queries. That single correlation trick has ended more Sev2s than any plist tweak because it proves whether launchd acted correctly given the data it had.
curl -fsS --max-time 3 http://127.0.0.1:18789/health || exit 1
Eight-Step Tuning Ladder for 2026
- Capture baseline exit codes for seven days before changing plist.
- Set ThrottleInterval to stop any sub-20s restart pattern.
- Align ExitTimeOut with documented graceful shutdown SLA.
- Add structured logs for every restart reason.
- Run canary in one VmMac region for 72 hours.
- Compare restart counts across HK, JP, KR, SG, US.
- Document rollback plist in git tag
launchd-YYYYMMDD. - Quarterly game day: kill -9 the gateway and measure recovery.
Between steps four and five, freeze feature work on the canary label for 72 hours even if metrics look green—many restart storms only appear under weekend traffic shapes. Capture sample and footprint snapshots during the canary window so you can compare memory regressions across VmMac regions without guessing.
How This Matrix Connects to Gateway Recovery Runbooks
Recovery articles describe what to restart; this matrix describes when the OS should restart for you. If numbers disagree, humans lose trust and start kill -9 habits. Keep the same numeric knobs in staging and prod per staging isolation so promotions are diffable, not tribal.
Observability Table: What to Log per Restart
| Field | Purpose |
|---|---|
launchd_exit_status |
Distinguish OOM vs clean exit |
uptime_s |
Detect infant mortality loops |
rss_mb |
Correlate with leaks before SIGKILL |
fds_open |
Spot descriptor leaks that keepalive magnifies |
last_http_5xx_ts |
Separate upstream outages from local death |
Ship these fields as JSON lines to the same sink you already use for gateway requests so SREs can pivot from “launchd restarted job” to “job restarted because RSS crossed 1.2× baseline” without SSHing into each mini. Consistent field names across Hong Kong, Japan, Korea, Singapore, and the United States matter more than perfect cardinality during the first iteration.
FAQ: launchd and OpenClaw on Mac mini
Should I use LaunchDaemon instead of LaunchAgent? Only with explicit security review—agents inherit user TCC boundaries you may rely on.
Can I tune per region? Only with written exceptions; default is parity.
What about upgrades? Temporarily widen ThrottleInterval during npm migrations.
Does SuccessfulExit false always mean “restart on any exit”? Effectively yes for most gateways—pair it with application exit codes that distinguish drain-complete from crash, or you will fight the job during blue/green deploys.
Should health probes run as root? Prefer the same user context as the gateway so file permissions and keychain/TCC assumptions stay aligned; escalate only when a read-only probe truly requires it.
How do I silence alerts during known maintenance? Toggle a maintenance file your wrapper checks before failing the probe, and auto-expire that file so engineers cannot forget it over a long weekend.
Why Mac mini M4 and VmMac Fit Always-On Gateways
Mac mini M4 gives enough sustained performance to absorb occasional cold starts without tripping aggressive probes, while unified memory reduces OOM flaps compared to oversubscribed VPS hosts. VmMac lets you place gateways in Hong Kong, Japan, Korea, Singapore, and the United States next to your users and data residency needs—then rent a second mini for canary plist experiments without risking production restart semantics.
Renting also shortens the feedback loop when Apple ships macOS security patches that subtly change launchd behavior: you can snapshot your plist matrix, apply the patch on a non-production mini, and replay the same health-probe harness before touching customer-facing automation. That discipline is cheaper than a single overnight page caused by an undocumented restart policy change.
Canary a New plist Before Production
Rent an extra VmMac Mac mini in Singapore or Tokyo to validate ThrottleInterval changes without waking on-call.