Harden edgezero #269 runtime config-store load (HTTP layer)#783
Harden edgezero #269 runtime config-store load (HTTP layer)#783prk-Jr wants to merge 28 commits into
Conversation
14a91cc to
bd85e3e
Compare
Replace the custom trusted-server CLI lifecycle and config payload plumbing with a thin EdgeZero delegation layer using typed config push/validate flows. Add TrustedServerAppConfig wrapper in core with deploy-time validation and move blob reconstruction into runtime helpers. Drop flattened config entry publishing and route app-config through EdgeZero blob envelope handling while keeping edgezero flag reads in trusted_server_config. Update CLI and architecture docs for the new model and adjust fastly adapter store selection.
76031c1 to
84f35fe
Compare
79911d6 to
bdb9284
Compare
Reads (blob key + each chunk) map to ConfigStoreUnavailable (503); envelope/ chunk verification and settings validation stay Configuration (500). Covers the blob/chunk load model on the updated ts-cli-audit base.
84f35fe to
854edef
Compare
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Automated review:
Review Summary
Reviewed PR #783 against feature/ts-cli-audit, focusing on the runtime config-store load path, error-stack context/status behavior, Fastly/EdgeZero HTTP error rendering, public error-body leakage, and the added regression tests. I did not find any blocking correctness, security, data-loss, authorization, or severe compatibility issues in the changed code.
Findings
No blocking findings.
CI / Existing Reviews
- GitHub checks currently show
prepare integration artifactsfailing, with downstream jobs skipped. The failure is dependency-lock parity drift incrates/trusted-server-integration-tests/Cargo.lock(for exampleanyhow,cssparser,html5ever,scraper,uuid, and related transitive dependencies). This should be resolved or confirmed as inherited from the stacked/base branch before merge. - Existing PR reviews: none returned by the GitHub API. Existing inline review comments: none.
- Local focused check passed:
cargo test -p trusted-server-core settings_data --target wasm32-wasip1. A local Fastly adapter-focused test could not complete because this environment hasviceroy 0.16.4while the repo pins0.17.0.
dfa8a29 to
20f8674
Compare
aram356
left a comment
There was a problem hiding this comment.
Summary
Scoped, well-tested runtime hardening: a config-store read failure now maps to TrustedServerError::ConfigStoreUnavailable → 503 (retryable), while verify/parse failures stay 500. The real change is 4 files / +279 −13; verified locally at head (all new core + adapter tests pass). The two blocking items concern the PR's base/CI plumbing, not the code itself.
1 of the inline comments below carries a one-click GitHub
suggestion. The remaining findings are body-level because they concern the PR base, CI, or code outside the diff hunks and can't be auto-applied.
Blocking
🔧 wrench
- Wrong PR base inflates the diff to 66 files — see Cross-cutting below.
- CI
prepare integration artifactsis failing — see Cross-cutting below.
Non-blocking
⛏ nitpick
- Chunk-read failures reuse the top-level "not seeded" hint — see inline at
crates/trusted-server-core/src/settings_data.rs:95.
🤔 thinking / 🌱 seedling
- 503 body says "An internal error occurred" — see Cross-cutting.
with_capacity(envelope_len)trusts an unvalidated length — see Cross-cutting.
Cross-cutting / body-level findings
-
🔧 Wrong PR base → unreviewable diff & merge risk. The PR targets
feature/ts-cli-audit, but the branch is stacked onfeature/ts-cli-next(per the PR body) and is diverged from both (vsts-cli-audit: 18 ahead / 25 behind). GitHub shows 66 files / +9,550 −1,772; the actual change is 4 files / +279 −13 (the 4 topmost commits: theConfigStoreUnavailablevariant, the read-failure reclassification, the adapter 503 lock, and the design doc). Merging against this base risks pulling ~60 unrelated files into the target, and the design-doc header ("stacked onfeature/ts-cli-audit") disagrees with the PR body ("stacked onfeature/ts-cli-next"). Recommend rebasing onto the true parent's current tip (so it's a clean ancestor), then retargeting the base — ormainonce the stack lands — so the PR shows only the 4 files. -
🔧 CI
prepare integration artifactsis FAILING. Transitive-dependency parity drift (markup5ever,match_token,scraper,selectors,uuid,wasm-bindgen-futures,web-sys) between the workspace andtrusted-server-integration-tests. The failure originates in the stacked base (integration-tests lock), not the 4-file 503 change, and isn't branch-protection-required here (base is a feature branch) — but it's red on the PR and CLAUDE.md treats it as a gate. It should clear once the branch sits on the correct, lock-reconciled parent (see above). -
🌱
String::with_capacity(pointer.envelope_len)(settings_data.rs:119) pre-allocates from an unvalidated length. Pre-existing and outside this diff, but squarely in this PR's "harden the config-store load" theme: a corrupted pointer with a hugeenvelope_lentriggers a large allocation on the per-request settings-load path — a potential trap/abort instead of the graceful 503/500 this PR adds. Consider boundingenvelope_len(and cross-checking it against the sum ofchunk.len) before pre-allocating. Follow-up, not this PR's job. -
🤔 503 body reads "An internal error occurred". The generic catch-all
user_message()describes a 500, but 503 is retryable. The generic body is deliberate (design §4 — nouser_messagearm added, which correctly avoids leaking internals), so this is only a suggestion: a 503-appropriate phrase (e.g. "Service temporarily unavailable") would let clients/monitoring distinguish retryable from terminal without leaking detail.
CI Status
- integration tests: SKIPPED
- integration tests (EdgeZero entry point): SKIPPED
- browser integration tests: SKIPPED
- prepare integration artifacts: FAIL
- .github/dependabot.yml: PASS
20f8674 to
8c3c3fc
Compare
Resolve 22 conflicts from main's EdgeZero adapter/canary evolution against the branch's config-store 503 work. - settings_data.rs: union the branch's ConfigStoreUnavailable (503) read classification with main's Fastly chunk-length hardening; keep both test sets. - adapter-fastly/main.rs: adopt main's rollout-percentage canary and settings_snapshot finalize refactor (branch 503 logic lives in error.rs). - CLI audit: adopt main's #800 implementation over the branch's parallel copy. - Manifests/lockfiles: take main's per-platform adapters and edgezero rev e483. - proxy.rs: drop a duplicate request_body_bytes left by a false auto-merge. Verified: fmt, clippy (5 targets), all adapter + core + cli test suites, integration parity, and JS vitest all pass.
- read_config_entry: reword to cover unseeded/missing/transient reads, since it is the shared seam for both the blob key and each chunk key. - user_message: return 'Service temporarily unavailable' for the retryable 503 variants (ConfigStoreUnavailable, KvStore) instead of the 500-flavored generic body; still leaks no internal detail. Tests updated.
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Automated review:
Review Summary
Reviewed PR #783 against main, focusing on the config-store settings load path, HTTP status/body mapping, Fastly EdgeZero routing behavior, CI coverage, and the previously raised review threads. The core 503 classification is generally well covered and I did not find a blocking correctness, security, data-loss, authorization, or severe compatibility issue. I did find a few non-blocking issues worth addressing before merge, especially an env-config compatibility regression in store-name resolution.
Findings
P0 / Blockers
None.
P1 / High
- Store-name env resolution no longer honors EdgeZero fallback semantics — see inline comment on
crates/trusted-server-core/src/settings_data.rs.
P2 / Medium
- The JA4 debug preflight still returns 500 for config-store read failures —
crates/trusted-server-adapter-fastly/src/main.rs:401. This line is outside the PR diff, so I could not leave it inline, but the newConfigStoreUnavailablebehavior is incomplete on this HTTP path:GET /_ts/debug/ja4callsload_settings_from_config_store()before app construction and hard-codes500 Internal Server Erroron every load failure instead of usinge.current_context().status_code()/user_message(). In an unseeded or transient config-store case, normal EdgeZero startup responses now return 503 while this debug endpoint still returns 500. Suggested fix: map theReport<TrustedServerError>through the sameIntoHttpResponsemethods used by the startup/error response path. - The EdgeZero integration job can pass without proving the EdgeZero entry point was used — see inline comment on
crates/trusted-server-integration-tests/tests/common/ec.rs.
P3 / Low
- The implemented 503 body behavior contradicts the new design doc — see inline comment on
docs/superpowers/specs/2026-06-27-edgezero-http-config-503-design.md. - Duplicate CI/local verification steps were added — the diff adds exact duplicate host CLI clippy/test steps in
.github/workflows/format.ymland.github/workflows/test.yml, duplicate Viceroy config generation in.github/workflows/integration-tests.yml/scripts/integration-tests.sh, and duplicate CLI-test docs inCLAUDE.md. This is not correctness-blocking, but it increases CI time and maintenance noise.
CI / Existing Reviews
- GitHub checks are currently passing.
- Existing review context: earlier automated reviews/comments raised the chunk-read hint and 503 body wording; the latest commits appear to have addressed those specific comments.
- Local focused checks run during this review:
cargo test -p trusted-server-core settings_data --target x86_64-unknown-linux-gnu,cargo test -p trusted-server-core config_store_unavailable --target x86_64-unknown-linux-gnu, andgit diff --check $(git merge-base HEAD origin/main)...HEADall passed.
…arity - Route default_config_store_name through EnvConfig::store_name so a blank, whitespace-only, or control-character EDGEZERO__STORES__CONFIG__APP_CONFIG__NAME override falls back to the logical id instead of opening an empty store name; add regression tests via the new config_store_name_from seam. - Render EdgeZero JA4 preflight settings-load failures through to_error_response so an unseeded/transient config store returns 503 there like every other path, instead of a hard-coded 500. - Update the design doc: 503 body is now the shared retry-flavored user_message arm; note the local chunk resolver duplicates edgezero's crate-private chunked_config and the upstream-export path to delete it.
The merge of main doubled the host-target CLI clippy/test steps in format.yml and test.yml, the Viceroy config generation step in integration-tests.yml and scripts/integration-tests.sh, and the host-target CLI test snippet in CLAUDE.md. Keep one copy of each.
The config store the HTTP layer bootstraps from is opened by a hardcoded logical id (`app_config` = DEFAULT_CONFIG_STORE_ID/CONFIG_BLOB_KEY) before Settings load, and `ts config push` targets `[stores.config].default`. The unified-manifest rewrite had renamed the config store to `trusted_server_config`, so push and provision would target a store the runtime never reads, 503-ing every request on a fresh deploy. Restore the config id to `app_config` (matching main, the runtime, and integration seeding) and add a guard comment. Revert the kv/secrets ids to main (`ec_identity_store`, `secrets`) too, updating the Spin manifest test lock. Also drop drift the branch introduced incidentally: - Re-exclude `trusted-server-openrtb-codegen` from the workspace (main excludes it; dropping the exclude orphaned it so its codegen `cargo run` failed with "believes it's in a workspace when it's not"). - Remove the unused `get_settings_from_services` (no callers; not on main). - Revert `body_as_reader` in proxy.rs/publisher.rs to main's signature; the branch had wrapped it in an infallible `Result`, inconsistent with the other `into_bytes().unwrap_or_default()` sites.
The Spin adapter compiled the example config in at build time (`include_str!`), so it ignored `ts config push --adapter spin` and could never serve real operator config. Load Settings at startup from a Spin key-value store instead, matching the Fastly/Axum config-store path. - Add `SpinKvConfigStore`, a Spin key-value-backed `PlatformConfigStore` (async `spin_sdk::key_value`, wasm+spin gated). This is distinct from the variable-backed `ConfigStoreHandleAdapter`: a multi-kilobyte app-config blob does not fit Spin's flat variable namespace. - `build_state` now loads via `get_settings_from_config_store` on the Spin runtime (store id and blob key both `app_config`); native test builds keep the embedded example config since Spin host KV is unavailable there. - Grant the component the `app_config` KV label in spin.toml and declare its backend in a new runtime-config.toml; `serve` passes `--runtime-config-file`. - Unseeded/unreadable store fails closed (503) via the startup error router. - Ignore `.spin/` local runtime state; update the architecture doc. Verified: wasm build, native tests, clippy (wasm + native), and a live `spin up` unseeded run (503). The seeded read is verified on every local layer (the pushed blob is a valid envelope at store/key `app_config`) but not live end-to-end: the installed Spin CLI (4.0) is newer than the Spin 2-3 SQLite KV schema that `ts config push --local` writes, so a Spin-2/3 runtime (or an edgezero-cli update) is needed to confirm seeded -> 200.
Two gaps surfaced while validating the Spin config-store load path, both in how the Spin adapter reports a failed startup: - The adapter had no `log` backend (`edgezero_adapter_spin` only calls a no-op `init_logger`), so every `log::error!` — including the startup-error chain — was silently dropped, making a config-load failure undebuggable. Install a minimal stderr `log` backend before dispatch; Spin captures component stderr to `.spin/logs/` and the Fermyon Cloud log stream. Confirmed live: the config-store read error now prints with its full error-stack chain. - `startup_error_router` hard-coded 503 for every startup failure, so a blob that reads but fails envelope/settings verification (a 500-class `Configuration` error) was misreported as a retryable 503. Map the response to the error's `status_code()` with the generic `user_message()` body, matching the Fastly/Axum startup routers. Add a test that a verify failure maps to 500, and make the existing method-coverage test use a realistic `ConfigStoreUnavailable` (it previously built a 400 but asserted 503, which only passed because of the hard-code).
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Automated review:
Review Summary
Reviewed PR #783 at head 03fc21c00528544717b398d481556842e3757026 against main, focusing on the config-store load/error-classification changes, Fastly/Axum/Spin HTTP response behavior, Spin runtime config-store integration, manifests, integration workflow changes, and prior review feedback. Overall risk looks moderate: CI is green and the main 503-vs-500 split is well covered, but I found one non-blocking correctness gap in the new Spin KV-backed app-config path where corrupt stored bytes are reported as retryable store unavailability.
Findings
P0 / Blockers
None.
P1 / High
None.
P2 / Medium
- Spin app-config values that are present but not UTF-8 are classified as retryable 503s — see inline comment on
crates/trusted-server-adapter-spin/src/platform.rs.
P3 / Low
None.
CI / Existing Reviews
- GitHub checks are currently passing.
- Existing review context: the previously raised store-name fallback, JA4 preflight status mapping, design-doc body mismatch, duplicate CI-step drift, and 503 body wording comments appear addressed by later commits. The earlier EdgeZero-entry-point integration probe comment remains a non-blocking coverage concern, but I am not duplicating it here.
- I did not run additional local test commands beyond inspecting the diff and CI results.
| "key `{key}` not found in Spin key-value store `{label}`" | ||
| )) | ||
| })?; | ||
| String::from_utf8(bytes).map_err(|e| { |
There was a problem hiding this comment.
Automated review: Present-but-corrupt Spin app-config bytes are reported as retryable store unavailability.
This String::from_utf8 failure means the app_config key was successfully opened and read, but the stored value is corrupt/not a valid text config blob. Because the error is returned as PlatformError::ConfigStore, read_config_entry() wraps it in TrustedServerError::ConfigStoreUnavailable, so Spin returns the new retryable 503 path instead of the intended 500-class "read succeeded but reconstruct/verify failed" path. In practice, an operator who accidentally seeds binary/corrupt bytes gets misleading retryable behavior and clients may keep retrying a terminal bad config.
Suggested fix: keep open/get/missing-key failures on the 503 path, but make a present value that cannot be decoded feed the verification/configuration path (for example, have this adapter convert invalid UTF-8 into a TrustedServerError::Configuration, or decode lossily/otherwise pass bytes forward so settings_from_config_blob fails as a 500), and add a Spin-specific regression test for non-UTF-8 bytes.
Summary
stackpop/edgezero#269. Stacked onfeature/ts-cli-next(which carries the #269 repin,Bodyfixes, the Fastly adapter migration, and config-store-backedSettingsload). Draft — base is unmerged; retarget tomainoncets-cli-nextlands.TrustedServerError::ConfigStoreUnavailable→ 503, while a reconstruct/verify failure (settings_from_config_entries: hash mismatch / unparseable) stays 500. One new error variant; no platform-layer change.run \ts config push``) goes to server logs (error chain); the public 503 body stays generic by design.Base note: diff is against
feature/ts-cli-next, so it shows only this branch's work. Againstmainit would include all ofts-cli-next.Changes
crates/trusted-server-core/src/error.rsConfigStoreUnavailable { store_name, message }variant → 503 (+ exhaustiveness guard, unit test)crates/trusted-server-core/src/settings_data.rsread_config_entryread failures →ConfigStoreUnavailable; tests (unseeded→503, malformed-hash→500, missing-listed-key→503, hint-in-chain)crates/trusted-server-adapter-fastly/src/error.rsConfigStoreUnavailablerenders 503 to client viato_error_responsecrates/integration-tests/Cargo.lockCloses
n/a — issue linking skipped by request.
Test plan
cargo test --workspace(core 1376 / adapter 39, 0 fail)cargo clippy --workspace --all-targets --all-features -- -D warningscargo fmt --all -- --checkcargo build -p trusted-server-adapter-fastly --release --target wasm32-wasip1Checklist
log, colocated tests)unwrap()in production code