Gateway restart post-check stuck RCA — 2026-05-10

過去レポートのView/ソース規律バックフィルで生成したView。

What happened

A Gateway restart was scheduled to apply the x-login browser profile. A durable checkpoint was correctly created, but the actual post-restart verification still depended on a later assistant/heartbeat turn executing the runbook. That turn failed behaviorally: it could say nothing (NO_REPLY) without running the mandated recovery tools. The checkpoint remained pending until a manual recovery run completed it.

Root cause

The design still had one human-ish weak link: “LLM remembers to call the tools after restart.” That is not a reliable recovery primitive. Restart recovery should be a deterministic worker over durable state, not a conversational promise.

Best-practice fix applied

Durable checkpoint remains the source of truth: memory/gateway-restart-checkpoint.json.
Idempotent worker remains post-check-only: scripts/gateway_restart_resume_watch.py never restarts Gateway.
Scheduler/watchdog now mechanically checks that checkpoint first: scripts/openclaw_watchdog.py runs the resumer whenever status=pending_postcheck and approved=true.
Restart entrypoint added: scripts/openclaw_gateway_restart_guarded.py creates the checkpoint before calling openclaw gateway restart.
Detection interval tightened: watchdog cron is now every 2 minutes instead of 10 minutes.
Regression test added: checkpoint-pending detection is covered in scripts/test_openclaw_watchdog.py.

Verification

python3 -m py_compile scripts/openclaw_watchdog.py scripts/test_openclaw_watchdog.py scripts/openclaw_gateway_restart_guarded.py scripts/gateway_restart_resume_watch.py passed.
python3 scripts/test_openclaw_watchdog.py ran 5 tests OK.
WATCHDOG_DRY_RUN=1 WATCHDOG_NTFY_DRY_RUN=1 python3 scripts/openclaw_watchdog.py exited cleanly.
Manual cron run completed status ok with NO_REPLY, which is expected because the checkpoint was already completed and no alert was needed.
memory/gateway-restart-checkpoint.json is now status=completed with health/config/task/autonomy evidence.

Remaining note

This prevents the same class of stall for Gateway restart post-checks. It does not mean every possible future workflow is automatically safe; the rule is: any workflow that can be interrupted by Gateway restart must leave a durable checkpoint and have a non-LLM scheduled worker resume it.

X follow-up via x-browser — 2026-05-10 15:33 JST

Follow-up source path was the dedicated x-browser workflow, not generic Chrome: Brave Browser via CDP 9222, persistent profile ~/.brave-x-profile, scripts under ~/.claude/plugins/nuchi-skills/skills/x-browser/scripts/.

Additional X findings reinforced, rather than changed, the fix direction:

Durable execution is active in agent discussions: examples included Temporal/Restate/Inngest/DBOS/Airflow-style durable workflows and cached/replayed steps for failed AI-agent runs.
HumanLayer/dexhorthy framing points toward chunking workflows for reliability, rather than trusting one large LLM turn.
Durable sleep() / await(event) for agents maps cleanly to OpenClaw cron/TaskFlow/checkpoint-worker patterns.
Hamel/Harrison/LangChain-adjacent eval-loop commentary reinforces that production agents need iterative improvement loops and behavior checks, not one-off fixes.

Resulting rule: runtime recovery work should be treated as durable execution + event resume + eval loop:

persist state before interruptible operations;
make resume steps idempotent;
let scheduler/watchdog/TaskFlow resume from durable state;
verify with regression tests or operational evidence;
stop only at explicit human-approval boundaries.

Raw x-browser caches used:

~/.cache/x-browser/2026-05-10_153124_-durable-execution-agents-llm.json
~/.cache/x-browser/2026-05-10_153136_from-hamelhusain-evals-agents.json
~/.cache/x-browser/2026-05-10_153206_from-dexhorthy-12-factor-agents-.json
~/.cache/x-browser/2026-05-10_153210_from-dexhorthy-durable-agents.json
~/.cache/x-browser/2026-05-10_153214_from-dexhorthy-human-in-the-loop-agents.json
~/.cache/x-browser/2026-05-10_153218_from-hwchase17-evaluate-agents.json