fix: stop broker readiness race and reap prior-session orphans#404
fix: stop broker readiness race and reap prior-session orphans#404cjmckenna wants to merge 1 commit into
Conversation
Two related reliability problems make `task`/review dispatches intermittent
and slow, most visibly on Windows:
1. Broker readiness is a fixed-timer race. ensureBrokerSession spawns a
`codex app-server` broker and waits a hardcoded 2000ms for it to bind its
endpoint (and 150ms when probing an existing one). A cold app-server on
Windows (Defender on-open scanning + MCP server init) routinely needs
longer, so the wait times out, the broker is torn down, and the caller
falls back to spawning a *second* app-server — two cold starts for one
request, non-deterministically. waitForBrokerEndpoint also only watches the
wall clock, so a healthy-but-slow broker is killed the instant it crosses
the deadline.
- Make both windows configurable via CODEX_COMPANION_BROKER_TIMEOUT_MS and
CODEX_COMPANION_BROKER_REUSE_PROBE_MS, and raise the win32 defaults
(15000ms / 750ms; 4000ms / 250ms elsewhere).
- Give waitForBrokerEndpoint an optional child handle so it abandons early
when the spawned broker has actually exited, instead of burning the
(now longer) timeout.
2. Orphans accumulate. Cleanup only runs at SessionEnd and only for the
current session, but on Windows that hook frequently never fires (abrupt
terminal close / crash / window reload). Dead-pid brokers and ghost
running/queued job records from prior sessions pile up, so /codex:status
reports phantom jobs and a stale activeTask can block new dispatches.
- SessionStart now self-heals: neutralize dead-pid running/queued job
records (state + per-job files) and tear down a dead persisted broker.
Best-effort; never blocks startup.
Adds tests/broker-lifecycle.test.mjs covering the timeout and the
child-exit early-bail.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9af8d20c24
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| job.status = "failed"; | ||
| job.pid = null; |
There was a problem hiding this comment.
Terminate stale job process groups before marking failed
When a detached task-worker exits but leaves its Codex child running in the same detached process group, isPidAlive(job.pid) is false here, so SessionStart only flips the record to failed and never calls the existing terminateProcessTree(job.pid) cleanup path. Since spawnDetachedTaskWorker creates detached workers and terminateProcessTree is what kills their process groups on Unix, this still leaks the orphaned app-server/task subprocesses while making the job look reaped.
Useful? React with 👍 / 👎.
Problem
task/review dispatches are intermittently slow or fail to start, most visibly on Windows. Two related causes in the broker lifecycle:1. Broker readiness is a fixed-timer race.
ensureBrokerSessionspawns acodex app-serverbroker and waits a hardcoded 2000 ms for it to bind its endpoint (150 ms when probing an existing one). A cold app-server on Windows — Defender on-open scanning plus MCP server init — routinely needs longer, so the wait times out, the broker is torn down, and the caller falls back to spawning a second app-server. That's two cold starts for one request, non-deterministically.waitForBrokerEndpointalso only watches the wall clock, so a healthy-but-slow broker gets killed the instant it crosses the deadline.2. Orphans accumulate. Cleanup runs only at
SessionEndand only for the current session, but on Windows that hook frequently never fires (abrupt terminal close / crash / window reload). Dead-pid brokers and ghostrunning/queuedjob records from prior sessions pile up, so/codex:statusreports phantom jobs and a staleactiveTaskcan block new dispatches.Changes
CODEX_COMPANION_BROKER_TIMEOUT_MSandCODEX_COMPANION_BROKER_REUSE_PROBE_MS— and raise the win32 defaults (15000 ms / 750 ms; 4000 ms / 250 ms elsewhere).waitForBrokerEndpointan optional child handle so it abandons early when the spawned broker has actually exited, instead of burning the (now longer) timeout.SessionStartnow self-heals prior-session orphans: neutralize dead-pidrunning/queuedjob records (state + per-job files) and tear down a dead persisted broker. Best-effort; never blocks startup.tests/broker-lifecycle.test.mjscovering the timeout and the child-exit early-bail.Testing
tests/broker-lifecycle.test.mjs: 3/3 pass.broker-endpoint/stateare pre-existing Windows path-separator assertions, identical onmain).