← View index

Gateway restart post-check stuck RCA — 2026-05-10

過去レポートのView/ソース規律バックフィルで生成したView。

What happened

A Gateway restart was scheduled to apply the x-login browser profile. A durable checkpoint was correctly created, but the actual post-restart verification still depended on a later assistant/heartbeat turn executing the runbook. That turn failed behaviorally: it could say nothing (NO_REPLY) without running the mandated recovery tools. The checkpoint remained pending until a manual recovery run completed it.

Root cause

The design still had one human-ish weak link: “LLM remembers to call the tools after restart.” That is not a reliable recovery primitive. Restart recovery should be a deterministic worker over durable state, not a conversational promise.

Best-practice fix applied

Verification

Remaining note

This prevents the same class of stall for Gateway restart post-checks. It does not mean every possible future workflow is automatically safe; the rule is: any workflow that can be interrupted by Gateway restart must leave a durable checkpoint and have a non-LLM scheduled worker resume it.

X follow-up via x-browser — 2026-05-10 15:33 JST

Follow-up source path was the dedicated x-browser workflow, not generic Chrome: Brave Browser via CDP 9222, persistent profile ~/.brave-x-profile, scripts under ~/.claude/plugins/nuchi-skills/skills/x-browser/scripts/.

Additional X findings reinforced, rather than changed, the fix direction:

Resulting rule: runtime recovery work should be treated as durable execution + event resume + eval loop:

  1. persist state before interruptible operations;
  2. make resume steps idempotent;
  3. let scheduler/watchdog/TaskFlow resume from durable state;
  4. verify with regression tests or operational evidence;
  5. stop only at explicit human-approval boundaries.

Raw x-browser caches used:

質問したい箇所を選択
この箇所について質問
✓ 質問を送信しました