Skip to content

fix(witness): STH persistence + transparent bootstrap-412 retry (#298)#300

Merged
mkreyman merged 3 commits into
masterfrom
fix/witness-bootstrap-412
Jul 4, 2026
Merged

fix(witness): STH persistence + transparent bootstrap-412 retry (#298)#300
mkreyman merged 3 commits into
masterfrom
fix/witness-bootstrap-412

Conversation

@mkreyman

@mkreyman mkreyman commented Jul 4, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes #298 (and the #299 root cause): a fresh MCP-server process reusing a
long-lived env-var key (LOOPCTL_AGENT_KEY, LOOPCTL_ORCH_KEY, …) failed its
first tool call with 412 witness_bootstrap_already_consumed, because the
witness protocol's one-time bootstrap grace is consumed once per API key and the
MCP server kept the learned Signed Tree Head (STH) in memory only.

Two parts: a client-robustness fix in the MCP server, and a careful,
security-neutral server-side hardening of the witness plug.

Part A — MCP server client robustness

  • Cross-process persistence. The learned STH is cached to a small per-server
    state file (default <os-tmp>/loopctl-mcp-sth-<hash>.json, override via
    LOOPCTL_STH_STATE_PATH) so a fresh process loads it and sends
    X-Loopctl-Last-Known-STH on its first request — avoiding the 412 entirely.
    STHs are append-only (one row per (tenant_id, chain_position)), so a persisted
    historical STH still validates; it is never a stale-STH 409 trap.
  • Transparent retry-once. Any 412 carrying an x-loopctl-current-sth header
    is caught, the STH is cached, and the SAME request is retried exactly once
    with the real header — so the tool call succeeds instead of surfacing the 412.
    Bounded to a single attempt (never a loop). Safe because the witness plug halts
    before the operation runs, so the rejected request had no side effect (even
    for POSTs).
  • Logic extracted to mcp-server/lib/witness-sth.js (shared with the test suite,
    single source of truth); apiCall delegates to it. All file I/O degrades
    gracefully (missing/corrupt/unwritable → in-memory + retry).
  • Docs: README "Witness protocol (STH)" section + LOOPCTL_STH_STATE_PATH;
    version bump 2.32.0 → 2.33.0 + CHANGELOG entry.
  • Tests: mcp-server/test/witness_sth.test.js — 412-with-STH triggers a single
    transparent retry that succeeds; persisted STH loaded on startup + sent on the
    first request; corrupt/missing/unwritable state file degrades to bootstrap +
    retry; retry bounded to one attempt. 99/99 node tests pass.

Part B — server-side protocol hardening (security-neutral)

In lib/loopctl_web/plugs/validate_witness_header.ex:

  • The witness_bootstrap_already_consumed 412 now reuses the STH already read
    (successfully, inside the rescue-guarded block) before the atomic consume — no
    second DB round-trip — so it can never become a 500 and ALWAYS carries a
    valid x-loopctl-current-sth header
    for the client's one-shot retry. This is
    the linchpin that makes retry-once reliable.
  • Adds a machine-readable retry contract (error.remediation.retry +
    current_sth) so future clients don't rediscover the protocol.
  • Adds a distinct info log + [:loopctl, :witness, :bootstrap_already_consumed]
    telemetry event
    so operators can meter consumed-grace 412s separately from the
    divergence-alerting path.

Why this does NOT weaken chain-of-custody

The one-time bootstrap grace (api_keys.sth_bootstrap_consumed_at, custody-03 /
GHSA-36g5-mcrh-rcrm) is unchanged: the atomic conditional UPDATE guarded by
is_nil(sth_bootstrap_consumed_at) still lets exactly one first-request win, and
a consumed key's request is still rejected and halted (no operation runs). We
only make the already-present resync path (the STH was always returned in the
412 header) reliable, self-documenting, and observable. The STH handed back is
already public via GET /api/v1/audit/sth/:tenant_id. No role, RLS, or trust
boundary changed.

Deliberately NOT done (would re-open custody-03): making the grace repeatable,
per-request, unbounded, or letting a client skip witness enforcement. If reducing
412s further required weakening the gate, this PR stops at the safe changes above.

Dispatch-based (v2) clients are already unaffected — each dispatch mints a
fresh ephemeral key that gets its own clean one-time bootstrap. The affected
population is exactly the long-lived-env-key clients (the MCP server), which
Part A fixes.

Verification

  • mix precommit green (compile --warnings-as-errors, format, credo --strict,
    deps.audit, dialyzer 0, 3655 tests, 0 failures) — incl. the witness/custody
    suite (validate_witness_header_test.exs, +4 new tests).
  • mcp-server node suite: 99/99 pass.

Draft — do not merge.

mkreyman added 3 commits July 3, 2026 19:49
…always carries current STH

MCP server (client robustness, #298):
- Persist the learned Signed Tree Head (STH) to a small per-server state file
  (default under the OS temp dir; override via LOOPCTL_STH_STATE_PATH) so a fresh
  MCP process loads it and sends X-Loopctl-Last-Known-STH on its first request,
  never tripping the one-time bootstrap-grace 412.
- Transparently retry the SAME request exactly ONCE when a 412 carries an
  x-loopctl-current-sth header: cache the STH and resend with the real header, so
  the tool call succeeds instead of surfacing witness_bootstrap_already_consumed.
  Bounded to a single attempt (never a loop). Safe because the witness plug halts
  before the operation runs, so the rejected request had no side effect.
- Retry + persistence logic extracted to mcp-server/lib/witness-sth.js (shared
  with the test suite); apiCall delegates to it. All file I/O degrades gracefully
  (missing/corrupt/unwritable -> in-memory + retry). Node test suite extended.
- Docs: README "Witness protocol (STH)" section + LOOPCTL_STH_STATE_PATH; version
  bump 2.32.0 -> 2.33.0 with CHANGELOG entry.

Server (protocol hardening, security-neutral):
- The witness_bootstrap_already_consumed 412 now reuses the STH already read
  before the atomic consume (no second DB round-trip), so it can never become a
  500 and ALWAYS carries a valid x-loopctl-current-sth header for the client's
  one-shot retry.
- Adds a machine-readable retry contract to error.remediation.retry and echoes
  current_sth in the body so future clients don't rediscover the protocol.
- Adds a distinct info log + [:loopctl, :witness, :bootstrap_already_consumed]
  telemetry event so operators can meter consumed-grace 412s separately from the
  divergence-alerting path.

The one-time bootstrap grace (api_keys.sth_bootstrap_consumed_at, custody-03 /
GHSA-36g5-mcrh-rcrm) is unchanged: the request is still rejected, the grace is
still one-time and non-repeatable. Only the already-present resync path is made
reliable, self-documenting, and observable. Dispatch-based (v2) clients that mint
a fresh key per dispatch were already unaffected.

Fixes #298.
…-through + cold-start coalescing

Addresses the 10 enhanced-review findings on #300 (custody-03 stays intact).

MCP server (client):
- HIGH-1 (CWE-59): STH state file is written atomically + symlink-safely — a
  private 0600 temp file opened with O_EXCL, then renamed over the target, so a
  pre-planted symlink is replaced, never followed (no arbitrary-file clobber);
  loads lstat the path and refuse a symlinked/foreign-owned file. Fixes the
  torn-file race too.
- HIGH-2: cache file AND in-memory client are keyed by sha256(serverUrl + ":" +
  apiKey) — distinct keys/tenants on one host no longer collide (which caused
  spurious 409s + false divergence telemetry). Only a non-secret hash of the key
  is used; the key never hits disk. README/CHANGELOG corrected: dispatch v2
  clients now genuinely get a clean per-key bootstrap.
- MEDIUM-6: concurrent cold-start sends coalesce into ONE bootstrap (singleflight).
- MEDIUM-4 / LOW-8: retry stays 412-only and now anchors to
  error.code == witness_bootstrap_already_consumed; 409 witness_divergence is
  documented as intentionally surfaced (genuine-fork signal), not auto-retried.

Server:
- HIGH-3: fresh-tenant zero-STH 409 loop fixed — while a tenant has no sealed STH
  (the ~60s pre-seal window), the all-zero placeholder the bootstrap issued is a
  pass-through (nothing to diverge from), gated strictly on no sealed STH + the
  exact zero placeholder. Regression + discriminating tests added.
- LOW-9: the bootstrap-consume rescue is narrowed to the STH read + UPDATE only
  (consume_grace/2); on a DB error it re-checks with a fresh read before choosing
  the branch (already-consumed 412 vs retryable missing_header).
- MEDIUM-7: test asserts the already-consumed 412 reads audit_signed_tree_heads
  exactly once (reuse, no re-query) — proving the no-second-round-trip claim.
- MEDIUM-5: pipeline ordering documented — the +1 rate-limit tick / last_seen
  write on a retry is accepted (retry fires at most once per cold-start; keeping
  RateLimiter ahead of the witness plug protects its DB work from a DoS flood).
- LOW-10: the weak zero-STH-header test folded into the MEDIUM-7 / HIGH-3 tests.

None of this weakens custody-03: the one-time grace column and its atomic
conditional UPDATE are unchanged; a consumed key's request is still rejected.

Verify: mix precommit green (3658 tests, 0 failures, dialyzer 0); mcp-server node
suite 104/104.

Refs #298.
…ap-412

# Conflicts:
#	mcp-server/CHANGELOG.md
#	mcp-server/package.json
@mkreyman mkreyman marked this pull request as ready for review July 4, 2026 02:56
@mkreyman mkreyman merged commit 996f452 into master Jul 4, 2026
10 checks passed
@mkreyman mkreyman deleted the fix/witness-bootstrap-412 branch July 4, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

witness protocol: consumed bootstrap grace makes every fresh client process 412 — persist STH in the MCP server and document the retry contract

1 participant