fix(witness): STH persistence + transparent bootstrap-412 retry (#298)#300
Merged
Conversation
…always carries current STH MCP server (client robustness, #298): - Persist the learned Signed Tree Head (STH) to a small per-server state file (default under the OS temp dir; override via LOOPCTL_STH_STATE_PATH) so a fresh MCP process loads it and sends X-Loopctl-Last-Known-STH on its first request, never tripping the one-time bootstrap-grace 412. - Transparently retry the SAME request exactly ONCE when a 412 carries an x-loopctl-current-sth header: cache the STH and resend with the real header, so the tool call succeeds instead of surfacing witness_bootstrap_already_consumed. Bounded to a single attempt (never a loop). Safe because the witness plug halts before the operation runs, so the rejected request had no side effect. - Retry + persistence logic extracted to mcp-server/lib/witness-sth.js (shared with the test suite); apiCall delegates to it. All file I/O degrades gracefully (missing/corrupt/unwritable -> in-memory + retry). Node test suite extended. - Docs: README "Witness protocol (STH)" section + LOOPCTL_STH_STATE_PATH; version bump 2.32.0 -> 2.33.0 with CHANGELOG entry. Server (protocol hardening, security-neutral): - The witness_bootstrap_already_consumed 412 now reuses the STH already read before the atomic consume (no second DB round-trip), so it can never become a 500 and ALWAYS carries a valid x-loopctl-current-sth header for the client's one-shot retry. - Adds a machine-readable retry contract to error.remediation.retry and echoes current_sth in the body so future clients don't rediscover the protocol. - Adds a distinct info log + [:loopctl, :witness, :bootstrap_already_consumed] telemetry event so operators can meter consumed-grace 412s separately from the divergence-alerting path. The one-time bootstrap grace (api_keys.sth_bootstrap_consumed_at, custody-03 / GHSA-36g5-mcrh-rcrm) is unchanged: the request is still rejected, the grace is still one-time and non-repeatable. Only the already-present resync path is made reliable, self-documenting, and observable. Dispatch-based (v2) clients that mint a fresh key per dispatch were already unaffected. Fixes #298.
…-through + cold-start coalescing Addresses the 10 enhanced-review findings on #300 (custody-03 stays intact). MCP server (client): - HIGH-1 (CWE-59): STH state file is written atomically + symlink-safely — a private 0600 temp file opened with O_EXCL, then renamed over the target, so a pre-planted symlink is replaced, never followed (no arbitrary-file clobber); loads lstat the path and refuse a symlinked/foreign-owned file. Fixes the torn-file race too. - HIGH-2: cache file AND in-memory client are keyed by sha256(serverUrl + ":" + apiKey) — distinct keys/tenants on one host no longer collide (which caused spurious 409s + false divergence telemetry). Only a non-secret hash of the key is used; the key never hits disk. README/CHANGELOG corrected: dispatch v2 clients now genuinely get a clean per-key bootstrap. - MEDIUM-6: concurrent cold-start sends coalesce into ONE bootstrap (singleflight). - MEDIUM-4 / LOW-8: retry stays 412-only and now anchors to error.code == witness_bootstrap_already_consumed; 409 witness_divergence is documented as intentionally surfaced (genuine-fork signal), not auto-retried. Server: - HIGH-3: fresh-tenant zero-STH 409 loop fixed — while a tenant has no sealed STH (the ~60s pre-seal window), the all-zero placeholder the bootstrap issued is a pass-through (nothing to diverge from), gated strictly on no sealed STH + the exact zero placeholder. Regression + discriminating tests added. - LOW-9: the bootstrap-consume rescue is narrowed to the STH read + UPDATE only (consume_grace/2); on a DB error it re-checks with a fresh read before choosing the branch (already-consumed 412 vs retryable missing_header). - MEDIUM-7: test asserts the already-consumed 412 reads audit_signed_tree_heads exactly once (reuse, no re-query) — proving the no-second-round-trip claim. - MEDIUM-5: pipeline ordering documented — the +1 rate-limit tick / last_seen write on a retry is accepted (retry fires at most once per cold-start; keeping RateLimiter ahead of the witness plug protects its DB work from a DoS flood). - LOW-10: the weak zero-STH-header test folded into the MEDIUM-7 / HIGH-3 tests. None of this weakens custody-03: the one-time grace column and its atomic conditional UPDATE are unchanged; a consumed key's request is still rejected. Verify: mix precommit green (3658 tests, 0 failures, dialyzer 0); mcp-server node suite 104/104. Refs #298.
…ap-412 # Conflicts: # mcp-server/CHANGELOG.md # mcp-server/package.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #298 (and the #299 root cause): a fresh MCP-server process reusing a
long-lived env-var key (
LOOPCTL_AGENT_KEY,LOOPCTL_ORCH_KEY, …) failed itsfirst tool call with
412 witness_bootstrap_already_consumed, because thewitness protocol's one-time bootstrap grace is consumed once per API key and the
MCP server kept the learned Signed Tree Head (STH) in memory only.
Two parts: a client-robustness fix in the MCP server, and a careful,
security-neutral server-side hardening of the witness plug.
Part A — MCP server client robustness
state file (default
<os-tmp>/loopctl-mcp-sth-<hash>.json, override viaLOOPCTL_STH_STATE_PATH) so a fresh process loads it and sendsX-Loopctl-Last-Known-STHon its first request — avoiding the 412 entirely.STHs are append-only (one row per
(tenant_id, chain_position)), so a persistedhistorical STH still validates; it is never a stale-STH 409 trap.
412carrying anx-loopctl-current-sthheaderis caught, the STH is cached, and the SAME request is retried exactly once
with the real header — so the tool call succeeds instead of surfacing the 412.
Bounded to a single attempt (never a loop). Safe because the witness plug halts
before the operation runs, so the rejected request had no side effect (even
for POSTs).
mcp-server/lib/witness-sth.js(shared with the test suite,single source of truth);
apiCalldelegates to it. All file I/O degradesgracefully (missing/corrupt/unwritable → in-memory + retry).
LOOPCTL_STH_STATE_PATH;version bump 2.32.0 → 2.33.0 + CHANGELOG entry.
mcp-server/test/witness_sth.test.js— 412-with-STH triggers a singletransparent retry that succeeds; persisted STH loaded on startup + sent on the
first request; corrupt/missing/unwritable state file degrades to bootstrap +
retry; retry bounded to one attempt. 99/99 node tests pass.
Part B — server-side protocol hardening (security-neutral)
In
lib/loopctl_web/plugs/validate_witness_header.ex:witness_bootstrap_already_consumed412 now reuses the STH already read(successfully, inside the rescue-guarded block) before the atomic consume — no
second DB round-trip — so it can never become a 500 and ALWAYS carries a
valid
x-loopctl-current-sthheader for the client's one-shot retry. This isthe linchpin that makes retry-once reliable.
error.remediation.retry+current_sth) so future clients don't rediscover the protocol.[:loopctl, :witness, :bootstrap_already_consumed]telemetry event so operators can meter consumed-grace 412s separately from the
divergence-alerting path.
Why this does NOT weaken chain-of-custody
The one-time bootstrap grace (
api_keys.sth_bootstrap_consumed_at, custody-03 /GHSA-36g5-mcrh-rcrm) is unchanged: the atomic conditional UPDATE guarded by
is_nil(sth_bootstrap_consumed_at)still lets exactly one first-request win, anda consumed key's request is still rejected and halted (no operation runs). We
only make the already-present resync path (the STH was always returned in the
412 header) reliable, self-documenting, and observable. The STH handed back is
already public via
GET /api/v1/audit/sth/:tenant_id. No role, RLS, or trustboundary changed.
Deliberately NOT done (would re-open custody-03): making the grace repeatable,
per-request, unbounded, or letting a client skip witness enforcement. If reducing
412s further required weakening the gate, this PR stops at the safe changes above.
Dispatch-based (v2) clients are already unaffected — each dispatch mints a
fresh ephemeral key that gets its own clean one-time bootstrap. The affected
population is exactly the long-lived-env-key clients (the MCP server), which
Part A fixes.
Verification
mix precommitgreen (compile--warnings-as-errors, format, credo--strict,deps.audit, dialyzer 0, 3655 tests, 0 failures) — incl. the witness/custody
suite (
validate_witness_header_test.exs, +4 new tests).Draft — do not merge.