Gateway restart post-check stuck RCA — 2026-05-10
過去レポートのView/ソース規律バックフィルで生成したView。
What happened
A Gateway restart was scheduled to apply the x-login browser profile. A durable checkpoint was correctly created, but the actual post-restart verification still depended on a later assistant/heartbeat turn executing the runbook. That turn failed behaviorally: it could say nothing (NO_REPLY) without running the mandated recovery tools. The checkpoint remained pending until a manual recovery run completed it.
Root cause
The design still had one human-ish weak link: “LLM remembers to call the tools after restart.” That is not a reliable recovery primitive. Restart recovery should be a deterministic worker over durable state, not a conversational promise.
Best-practice fix applied
- Durable checkpoint remains the source of truth:
memory/gateway-restart-checkpoint.json. - Idempotent worker remains post-check-only:
scripts/gateway_restart_resume_watch.pynever restarts Gateway. - Scheduler/watchdog now mechanically checks that checkpoint first:
scripts/openclaw_watchdog.pyruns the resumer wheneverstatus=pending_postcheckandapproved=true. - Restart entrypoint added:
scripts/openclaw_gateway_restart_guarded.pycreates the checkpoint before callingopenclaw gateway restart. - Detection interval tightened: watchdog cron is now every 2 minutes instead of 10 minutes.
- Regression test added: checkpoint-pending detection is covered in
scripts/test_openclaw_watchdog.py.
Verification
python3 -m py_compile scripts/openclaw_watchdog.py scripts/test_openclaw_watchdog.py scripts/openclaw_gateway_restart_guarded.py scripts/gateway_restart_resume_watch.pypassed.python3 scripts/test_openclaw_watchdog.pyran 5 tests OK.WATCHDOG_DRY_RUN=1 WATCHDOG_NTFY_DRY_RUN=1 python3 scripts/openclaw_watchdog.pyexited cleanly.- Manual cron run completed status
okwithNO_REPLY, which is expected because the checkpoint was already completed and no alert was needed. memory/gateway-restart-checkpoint.jsonis nowstatus=completedwith health/config/task/autonomy evidence.
Remaining note
This prevents the same class of stall for Gateway restart post-checks. It does not mean every possible future workflow is automatically safe; the rule is: any workflow that can be interrupted by Gateway restart must leave a durable checkpoint and have a non-LLM scheduled worker resume it.
X follow-up via x-browser — 2026-05-10 15:33 JST
Follow-up source path was the dedicated x-browser workflow, not generic Chrome: Brave Browser via CDP 9222, persistent profile ~/.brave-x-profile, scripts under ~/.claude/plugins/nuchi-skills/skills/x-browser/scripts/.
Additional X findings reinforced, rather than changed, the fix direction:
- Durable execution is active in agent discussions: examples included Temporal/Restate/Inngest/DBOS/Airflow-style durable workflows and cached/replayed steps for failed AI-agent runs.
- HumanLayer/dexhorthy framing points toward chunking workflows for reliability, rather than trusting one large LLM turn.
- Durable
sleep()/await(event)for agents maps cleanly to OpenClaw cron/TaskFlow/checkpoint-worker patterns. - Hamel/Harrison/LangChain-adjacent eval-loop commentary reinforces that production agents need iterative improvement loops and behavior checks, not one-off fixes.
Resulting rule: runtime recovery work should be treated as durable execution + event resume + eval loop:
- persist state before interruptible operations;
- make resume steps idempotent;
- let scheduler/watchdog/TaskFlow resume from durable state;
- verify with regression tests or operational evidence;
- stop only at explicit human-approval boundaries.
Raw x-browser caches used:
~/.cache/x-browser/2026-05-10_153124_-durable-execution-agents-llm.json~/.cache/x-browser/2026-05-10_153136_from-hamelhusain-evals-agents.json~/.cache/x-browser/2026-05-10_153206_from-dexhorthy-12-factor-agents-.json~/.cache/x-browser/2026-05-10_153210_from-dexhorthy-durable-agents.json~/.cache/x-browser/2026-05-10_153214_from-dexhorthy-human-in-the-loop-agents.json~/.cache/x-browser/2026-05-10_153218_from-hwchase17-evaluate-agents.json