Skip to content

fix: stop broker readiness race and reap prior-session orphans#404

Open
cjmckenna wants to merge 1 commit into
openai:mainfrom
cjmckenna:fix/broker-readiness-and-orphan-reaping
Open

fix: stop broker readiness race and reap prior-session orphans#404
cjmckenna wants to merge 1 commit into
openai:mainfrom
cjmckenna:fix/broker-readiness-and-orphan-reaping

Conversation

@cjmckenna

Copy link
Copy Markdown

Problem

task/review dispatches are intermittently slow or fail to start, most visibly on Windows. Two related causes in the broker lifecycle:

1. Broker readiness is a fixed-timer race. ensureBrokerSession spawns a codex app-server broker and waits a hardcoded 2000 ms for it to bind its endpoint (150 ms when probing an existing one). A cold app-server on Windows — Defender on-open scanning plus MCP server init — routinely needs longer, so the wait times out, the broker is torn down, and the caller falls back to spawning a second app-server. That's two cold starts for one request, non-deterministically. waitForBrokerEndpoint also only watches the wall clock, so a healthy-but-slow broker gets killed the instant it crosses the deadline.

2. Orphans accumulate. Cleanup runs only at SessionEnd and only for the current session, but on Windows that hook frequently never fires (abrupt terminal close / crash / window reload). Dead-pid brokers and ghost running/queued job records from prior sessions pile up, so /codex:status reports phantom jobs and a stale activeTask can block new dispatches.

Changes

  • Make both readiness windows configurable — CODEX_COMPANION_BROKER_TIMEOUT_MS and CODEX_COMPANION_BROKER_REUSE_PROBE_MS — and raise the win32 defaults (15000 ms / 750 ms; 4000 ms / 250 ms elsewhere).
  • Give waitForBrokerEndpoint an optional child handle so it abandons early when the spawned broker has actually exited, instead of burning the (now longer) timeout.
  • SessionStart now self-heals prior-session orphans: neutralize dead-pid running/queued job records (state + per-job files) and tear down a dead persisted broker. Best-effort; never blocks startup.
  • Add tests/broker-lifecycle.test.mjs covering the timeout and the child-exit early-bail.

Testing

  • New tests/broker-lifecycle.test.mjs: 3/3 pass.
  • Existing unit tests unaffected by the change (the two failures in broker-endpoint/state are pre-existing Windows path-separator assertions, identical on main).

Two related reliability problems make `task`/review dispatches intermittent
and slow, most visibly on Windows:

1. Broker readiness is a fixed-timer race. ensureBrokerSession spawns a
   `codex app-server` broker and waits a hardcoded 2000ms for it to bind its
   endpoint (and 150ms when probing an existing one). A cold app-server on
   Windows (Defender on-open scanning + MCP server init) routinely needs
   longer, so the wait times out, the broker is torn down, and the caller
   falls back to spawning a *second* app-server — two cold starts for one
   request, non-deterministically. waitForBrokerEndpoint also only watches the
   wall clock, so a healthy-but-slow broker is killed the instant it crosses
   the deadline.

   - Make both windows configurable via CODEX_COMPANION_BROKER_TIMEOUT_MS and
     CODEX_COMPANION_BROKER_REUSE_PROBE_MS, and raise the win32 defaults
     (15000ms / 750ms; 4000ms / 250ms elsewhere).
   - Give waitForBrokerEndpoint an optional child handle so it abandons early
     when the spawned broker has actually exited, instead of burning the
     (now longer) timeout.

2. Orphans accumulate. Cleanup only runs at SessionEnd and only for the
   current session, but on Windows that hook frequently never fires (abrupt
   terminal close / crash / window reload). Dead-pid brokers and ghost
   running/queued job records from prior sessions pile up, so /codex:status
   reports phantom jobs and a stale activeTask can block new dispatches.

   - SessionStart now self-heals: neutralize dead-pid running/queued job
     records (state + per-job files) and tear down a dead persisted broker.
     Best-effort; never blocks startup.

Adds tests/broker-lifecycle.test.mjs covering the timeout and the
child-exit early-bail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cjmckenna cjmckenna requested a review from a team June 25, 2026 14:40

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9af8d20c24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +117 to +118
job.status = "failed";
job.pid = null;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Terminate stale job process groups before marking failed

When a detached task-worker exits but leaves its Codex child running in the same detached process group, isPidAlive(job.pid) is false here, so SessionStart only flips the record to failed and never calls the existing terminateProcessTree(job.pid) cleanup path. Since spawnDetachedTaskWorker creates detached workers and terminateProcessTree is what kills their process groups on Unix, this still leaks the orphaned app-server/task subprocesses while making the job look reaped.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant