Dev Intel: MCP resume needs lifecycle health
Codex MCP resume confusion, stale stdio MCP stacks, and Claude Code observability point to one small OpenClaw health packet.
Generated: 2026-05-24 11:43 JST
Lane: 開発ネタ発掘
Goal: Turn the latest Codex/Claude Code tool-runtime signals into one practical OpenClaw improvement idea without sending another clustered dev-intel notification.
Assumption: The useful owner takeaway is not "MCP is buggy"; it is that long-lived agent sessions need explicit checks for whether tools are still callable and whether their helper processes have returned to a bounded baseline.
Smallest edit/action: Saved this source-backed packet and rendered a View; no runtime or production change.
Why this is useful:
Recent evidence points to the same operational shape from two sides. One Codex issue reports that after resuming a session on v0.121.0, the model could still talk about MCP tools but repeatedly used shell instead; compacting the conversation made direct MCP calls work again. Another Codex issue reports long-lived multi-agent sessions accumulating duplicate stdio MCP launcher stacks after subagents finish. SigNoz's Claude Code OpenTelemetry post shows the parallel trend: coding agents are becoming measurable systems, with token usage, sessions, latency, success rate, model distribution, and accept/reject decisions flowing into dashboards.
For OpenClaw, the stealable idea is a tiny agent_tool_health packet on resume and after subagent close:
tool_visibility: tool is listedtool_invocation: a harmless direct call actually succeedsprocess_baseline: helper process count returns to expected rangelast_compaction_at: whether a context transition happened before failureowner_action: reauth / restart / compact / no action
This would catch the dangerous middle state where the agent "knows" a tool exists but is not actually invoking it, and the quieter state where completed parallel work leaves stale local processes behind.
What I made/changed:
- Created this report under
projects/heartbeat-editorial-room/. - Rendered a View for later reading.
- Chose
Notify: nobecause two dev-intel notifications already shipped this morning; this is good material, but not worth another phone ping right now.
Sources/Evidence:
- Codex issue #18233: resumed session can confuse MCP tool invocation until compaction: https://github.com/openai/codex/issues/18233
- Codex issue #14233: long-lived multi-agent sessions can accumulate duplicate stdio MCP stacks: https://github.com/openai/codex/issues/14233
- SigNoz, "Bringing Observability to Claude Code: OpenTelemetry in Action" (updated 2026-05-19): https://signoz.io/blog/claude-code-monitoring-with-opentelemetry/
Prediction:
If OpenClaw adds a resume/close health packet for tool invocation and helper-process baseline, future heartbeat/runtime failures should become easier to classify as context_resume_confusion, stale_tool_processes, auth_needed, or real_tool_error instead of becoming vague "agent got confused" reports.
Verify by:
- Render the View successfully.
- Update creative state with
notify:false. - Run a small JSON validation for the state file.
Observed:
- Web fetch succeeded for all three sources.
- The signal overlaps with existing OpenClaw themes: MCP parallelism, outcome traces, and tool-fault classification.
- Notification budget argues for silent save rather than another Telegram message.
Next safe action:
Add a small local script later, agent_tool_health.py, that can record harmless direct-tool probes and local helper-process counts into a JSONL event file after resume/subagent close. Keep it read-only at first.
Notify: no — source-backed and useful, but this morning already had nearby dev-intel pings; saving it is the higher-quality move.