fix: stop broker readiness race and reap prior-session orphans by cjmckenna · Pull Request #404 · openai/codex-plugin-cc

cjmckenna · 2026-06-25T14:40:08Z

Problem

task/review dispatches are intermittently slow or fail to start, most visibly on Windows. Two related causes in the broker lifecycle:

1. Broker readiness is a fixed-timer race. ensureBrokerSession spawns a codex app-server broker and waits a hardcoded 2000 ms for it to bind its endpoint (150 ms when probing an existing one). A cold app-server on Windows — Defender on-open scanning plus MCP server init — routinely needs longer, so the wait times out, the broker is torn down, and the caller falls back to spawning a second app-server. That's two cold starts for one request, non-deterministically. waitForBrokerEndpoint also only watches the wall clock, so a healthy-but-slow broker gets killed the instant it crosses the deadline.

2. Orphans accumulate. Cleanup runs only at SessionEnd and only for the current session, but on Windows that hook frequently never fires (abrupt terminal close / crash / window reload). Dead-pid brokers and ghost running/queued job records from prior sessions pile up, so /codex:status reports phantom jobs and a stale activeTask can block new dispatches.

Changes

Make both readiness windows configurable — CODEX_COMPANION_BROKER_TIMEOUT_MS and CODEX_COMPANION_BROKER_REUSE_PROBE_MS — and raise the win32 defaults (15000 ms / 750 ms; 4000 ms / 250 ms elsewhere).
Give waitForBrokerEndpoint an optional child handle so it abandons early when the spawned broker has actually exited, instead of burning the (now longer) timeout.
SessionStart now self-heals prior-session orphans: neutralize dead-pid running/queued job records (state + per-job files) and tear down a dead persisted broker. Best-effort; never blocks startup.
Add tests/broker-lifecycle.test.mjs covering the timeout and the child-exit early-bail.

Testing

New tests/broker-lifecycle.test.mjs: 3/3 pass.
Existing unit tests unaffected by the change (the two failures in broker-endpoint/state are pre-existing Windows path-separator assertions, identical on main).

Two related reliability problems make `task`/review dispatches intermittent and slow, most visibly on Windows: 1. Broker readiness is a fixed-timer race. ensureBrokerSession spawns a `codex app-server` broker and waits a hardcoded 2000ms for it to bind its endpoint (and 150ms when probing an existing one). A cold app-server on Windows (Defender on-open scanning + MCP server init) routinely needs longer, so the wait times out, the broker is torn down, and the caller falls back to spawning a *second* app-server — two cold starts for one request, non-deterministically. waitForBrokerEndpoint also only watches the wall clock, so a healthy-but-slow broker is killed the instant it crosses the deadline. - Make both windows configurable via CODEX_COMPANION_BROKER_TIMEOUT_MS and CODEX_COMPANION_BROKER_REUSE_PROBE_MS, and raise the win32 defaults (15000ms / 750ms; 4000ms / 250ms elsewhere). - Give waitForBrokerEndpoint an optional child handle so it abandons early when the spawned broker has actually exited, instead of burning the (now longer) timeout. 2. Orphans accumulate. Cleanup only runs at SessionEnd and only for the current session, but on Windows that hook frequently never fires (abrupt terminal close / crash / window reload). Dead-pid brokers and ghost running/queued job records from prior sessions pile up, so /codex:status reports phantom jobs and a stale activeTask can block new dispatches. - SessionStart now self-heals: neutralize dead-pid running/queued job records (state + per-job files) and tear down a dead persisted broker. Best-effort; never blocks startup. Adds tests/broker-lifecycle.test.mjs covering the timeout and the child-exit early-bail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9af8d20c24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-25T14:46:32Z

+    job.status = "failed";
+    job.pid = null;


Terminate stale job process groups before marking failed

When a detached task-worker exits but leaves its Codex child running in the same detached process group, isPidAlive(job.pid) is false here, so SessionStart only flips the record to failed and never calls the existing terminateProcessTree(job.pid) cleanup path. Since spawnDetachedTaskWorker creates detached workers and terminateProcessTree is what kills their process groups on Unix, this still leaks the orphaned app-server/task subprocesses while making the job look reaped.

Useful? React with 👍 / 👎.

cjmckenna requested a review from a team June 25, 2026 14:40

chatgpt-codex-connector Bot reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: stop broker readiness race and reap prior-session orphans#404

fix: stop broker readiness race and reap prior-session orphans#404
cjmckenna wants to merge 1 commit into
openai:mainfrom
cjmckenna:fix/broker-readiness-and-orphan-reaping

cjmckenna commented Jun 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cjmckenna commented Jun 25, 2026

Problem

Changes

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant