Fix build saga crash recovery so resumed builds complete after a restart#880
Conversation
|
Warning Review limit reachedYou’ve reached a temporary PR review limit under our Fair Usage Limits Policy. Next review available in: 51 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis change defers build-saga recovery until after server boot dependencies (OCI registry, cluster.local resolution) are available, rather than performing it during Coordinator/SagaBuilder initialization. SagaBuilder.Init is split into a registration-only Init and a separate Recover method; Coordinator retains a SagaBuilder reference and exposes RecoverBuildSagas, invoked in a background goroutine from server startup. Separately, ListIncomplete in both saga storage implementations now deduplicates saga IDs when combining pending, running, and undoing status results to avoid double-processing the same execution. Sequence Diagram(s)sequenceDiagram
participant Server as cli/commands/server.go
participant Coordinator
participant SagaBuilder
Coordinator->>SagaBuilder: Init() during Start (registers saga definitions)
Coordinator->>Coordinator: store sagaBuilder field
Server->>Server: registry listening, cluster.local resolves
Server->>Coordinator: RecoverBuildSagas(ctx) in goroutine
Coordinator->>SagaBuilder: Recover(ctx)
SagaBuilder-->>Coordinator: error (logged, non-fatal)
Related PRs: None identified. Suggested labels: enhancement, bug Suggested reviewers: None identified. Poem Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cli/commands/server.go`:
- Around line 946-954: The background saga recovery in co.RecoverBuildSagas is
using the parent ctx instead of the shutdown-scoped sub context, so it can
outlive the server teardown path. Update the goroutine in server startup to pass
sub into co.RecoverBuildSagas so recovery is canceled with the same lifecycle as
the other errgroup-managed background tasks and stops during drain/restart.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7ee364fd-da90-4a6a-aeb3-8c2237ece958
📒 Files selected for processing (5)
cli/commands/server.gocomponents/coordinate/coordinate.gopkg/saga/eac_storage.gopkg/saga/storage.goservers/build/saga_builder.go
Recovered build sagas resumed the image push before the cluster registry and the cluster.local name mapping existed, because recovery ran inside co.Start() while both are established later in the server boot sequence. The push failed with "no such host" and the saga rolled back, so a crash-recovered build never completed (MIR-1285). Run build saga recovery from the boot sequence after the registry and name mapping are ready, rather than inside co.Start(). Recovery is split out of SagaBuilder.Init (now register-only) into a Recover method the coordinator drives via RecoverBuildSagas, and it runs in the background so a recovered build, which re-runs to completion and can take minutes, doesn't block boot. Also dedupe saga.ListIncomplete: it combined the pending/running/undoing status-index IDs without dedup, so a saga lingering in more than one index recovered twice, and the second pass collided creating the config version. Masked before because recovery always failed fast at the push; it surfaced once recovery started succeeding. Verified in the dev environment: deploy with an onbuild sleep, SIGKILL the coordinator mid-build, restart; the recovered build now completes and activates, the server boots promptly, and recovery runs exactly once.
6485bb1 to
21a173d
Compare
Saga crash recovery for builds worked in the narrow sense that the executor faithfully recovered an in-flight
build-from-tarsaga from etcd on restart and resumed thebuild-imageaction. What it couldn't do was finish. The recovered build would try to push its image tocluster.local:5000and fail withdial tcp: lookup cluster.local ... no such host, burn through buildkit's fast internal retries in about a second, and roll the whole thing back. A fresh deploy minutes later worked fine, which was the tell that this was an ordering problem, not a broken registry.Mapping the boot sequence made the cause clear, and it's more of a structural inversion than a race. Build saga recovery ran synchronously inside
co.Start(), but the things a push depends on (the flannel-derived router address, thecluster.localhost mapping,SetRegistryIP, and the registry listener itself) are all established later in the server boot path, afterco.Start()returns. So on every restart with an in-flight build, recovery resumed the push into a world where the push target didn't resolve yet.The fix is to run recovery from the boot sequence after those dependencies exist, rather than in the middle of coordinator startup. Recovery is split out of
SagaBuilder.Init(which now just registers the saga definition) into aRecovermethod the coordinator drives viaRecoverBuildSagas, called once the registry and name mapping are up.Verifying it end to end (deploy with an
onbuild = ["sleep 180"]window, SIGKILL the coordinator mid-build, restart) turned up two more things, which is why the diff is a little bigger than a pure reorder.First, once recovery actually succeeds it re-runs the whole build to completion, which takes minutes, and doing that synchronously wedged boot: the server never signaled ready and health checks would time out. The old code got away with synchronous recovery only because it always failed in about a second. So recovery now runs in the background, and boot completes promptly while the recovered build finishes on its own.
Second, with recovery finally getting past the push, it started colliding on
create-config-version. It turned outsaga.ListIncompletecombined the pending, running, and undoing status-index IDs without deduplicating, so a saga that lingered in more than one index got recovered twice, and the second pass re-ran already-completed actions and conflicted on an entity the first pass had created. This bug predates the change; it was just invisible while recovery always died at the push. Fixed by deduplicating in both saga storage backends.After all three, the repro is clean: the coordinator boots in a few seconds, recovery runs exactly once, and the resumed build completes and activates the new version, with zero DNS errors and zero conflicts.
The stale deployment lock a crashed deploy leaves behind (the other symptom in the original issue) is intentionally out of scope here and stays with MIR-681, where server-owned deployment lifecycle is the right home for it. The broader "boot ordering is implicit and fragile" problem this exposed is written up separately as an RFD (Boot Dependency Graph).
Closes MIR-1285