Skip to content

Fix build saga crash recovery so resumed builds complete after a restart#880

Open
phinze wants to merge 1 commit into
mainfrom
phinze/mir-1285-build-saga-crash-recovery-resumed-build-cant-complete-after
Open

Fix build saga crash recovery so resumed builds complete after a restart#880
phinze wants to merge 1 commit into
mainfrom
phinze/mir-1285-build-saga-crash-recovery-resumed-build-cant-complete-after

Conversation

@phinze

@phinze phinze commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Saga crash recovery for builds worked in the narrow sense that the executor faithfully recovered an in-flight build-from-tar saga from etcd on restart and resumed the build-image action. What it couldn't do was finish. The recovered build would try to push its image to cluster.local:5000 and fail with dial tcp: lookup cluster.local ... no such host, burn through buildkit's fast internal retries in about a second, and roll the whole thing back. A fresh deploy minutes later worked fine, which was the tell that this was an ordering problem, not a broken registry.

Mapping the boot sequence made the cause clear, and it's more of a structural inversion than a race. Build saga recovery ran synchronously inside co.Start(), but the things a push depends on (the flannel-derived router address, the cluster.local host mapping, SetRegistryIP, and the registry listener itself) are all established later in the server boot path, after co.Start() returns. So on every restart with an in-flight build, recovery resumed the push into a world where the push target didn't resolve yet.

The fix is to run recovery from the boot sequence after those dependencies exist, rather than in the middle of coordinator startup. Recovery is split out of SagaBuilder.Init (which now just registers the saga definition) into a Recover method the coordinator drives via RecoverBuildSagas, called once the registry and name mapping are up.

Verifying it end to end (deploy with an onbuild = ["sleep 180"] window, SIGKILL the coordinator mid-build, restart) turned up two more things, which is why the diff is a little bigger than a pure reorder.

First, once recovery actually succeeds it re-runs the whole build to completion, which takes minutes, and doing that synchronously wedged boot: the server never signaled ready and health checks would time out. The old code got away with synchronous recovery only because it always failed in about a second. So recovery now runs in the background, and boot completes promptly while the recovered build finishes on its own.

Second, with recovery finally getting past the push, it started colliding on create-config-version. It turned out saga.ListIncomplete combined the pending, running, and undoing status-index IDs without deduplicating, so a saga that lingered in more than one index got recovered twice, and the second pass re-ran already-completed actions and conflicted on an entity the first pass had created. This bug predates the change; it was just invisible while recovery always died at the push. Fixed by deduplicating in both saga storage backends.

After all three, the repro is clean: the coordinator boots in a few seconds, recovery runs exactly once, and the resumed build completes and activates the new version, with zero DNS errors and zero conflicts.

The stale deployment lock a crashed deploy leaves behind (the other symptom in the original issue) is intentionally out of scope here and stays with MIR-681, where server-owned deployment lifecycle is the right home for it. The broader "boot ordering is implicit and fragile" problem this exposed is written up separately as an RFD (Boot Dependency Graph).

Closes MIR-1285

@phinze phinze requested a review from a team as a code owner July 2, 2026 19:30
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 51 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aef6c186-0eb3-4b39-b073-10b42364ef25

📥 Commits

Reviewing files that changed from the base of the PR and between 6485bb1 and 21a173d.

📒 Files selected for processing (5)
  • cli/commands/server.go
  • components/coordinate/coordinate.go
  • pkg/saga/eac_storage.go
  • pkg/saga/storage.go
  • servers/build/saga_builder.go
📝 Walkthrough

Walkthrough

This change defers build-saga recovery until after server boot dependencies (OCI registry, cluster.local resolution) are available, rather than performing it during Coordinator/SagaBuilder initialization. SagaBuilder.Init is split into a registration-only Init and a separate Recover method; Coordinator retains a SagaBuilder reference and exposes RecoverBuildSagas, invoked in a background goroutine from server startup. Separately, ListIncomplete in both saga storage implementations now deduplicates saga IDs when combining pending, running, and undoing status results to avoid double-processing the same execution.

Sequence Diagram(s)

sequenceDiagram
  participant Server as cli/commands/server.go
  participant Coordinator
  participant SagaBuilder

  Coordinator->>SagaBuilder: Init() during Start (registers saga definitions)
  Coordinator->>Coordinator: store sagaBuilder field
  Server->>Server: registry listening, cluster.local resolves
  Server->>Coordinator: RecoverBuildSagas(ctx) in goroutine
  Coordinator->>SagaBuilder: Recover(ctx)
  SagaBuilder-->>Coordinator: error (logged, non-fatal)
Loading

Related PRs: None identified.

Suggested labels: enhancement, bug

Suggested reviewers: None identified.

Poem
A saga once stalled mid-flight,
now waits for registry's green light.
No double IDs to confuse the ledger's trace,
a seen map keeps each execution in its place.
Recovery resumes, quiet, in the night. 🌙


Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cli/commands/server.go`:
- Around line 946-954: The background saga recovery in co.RecoverBuildSagas is
using the parent ctx instead of the shutdown-scoped sub context, so it can
outlive the server teardown path. Update the goroutine in server startup to pass
sub into co.RecoverBuildSagas so recovery is canceled with the same lifecycle as
the other errgroup-managed background tasks and stops during drain/restart.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7ee364fd-da90-4a6a-aeb3-8c2237ece958

📥 Commits

Reviewing files that changed from the base of the PR and between af5d3cb and 6485bb1.

📒 Files selected for processing (5)
  • cli/commands/server.go
  • components/coordinate/coordinate.go
  • pkg/saga/eac_storage.go
  • pkg/saga/storage.go
  • servers/build/saga_builder.go

Comment thread cli/commands/server.go
Recovered build sagas resumed the image push before the cluster
registry and the cluster.local name mapping existed, because recovery
ran inside co.Start() while both are established later in the server
boot sequence. The push failed with "no such host" and the saga rolled
back, so a crash-recovered build never completed (MIR-1285).

Run build saga recovery from the boot sequence after the registry and
name mapping are ready, rather than inside co.Start(). Recovery is
split out of SagaBuilder.Init (now register-only) into a Recover method
the coordinator drives via RecoverBuildSagas, and it runs in the
background so a recovered build, which re-runs to completion and can
take minutes, doesn't block boot.

Also dedupe saga.ListIncomplete: it combined the pending/running/undoing
status-index IDs without dedup, so a saga lingering in more than one
index recovered twice, and the second pass collided creating the config
version. Masked before because recovery always failed fast at the push;
it surfaced once recovery started succeeding.

Verified in the dev environment: deploy with an onbuild sleep, SIGKILL
the coordinator mid-build, restart; the recovered build now completes
and activates, the server boots promptly, and recovery runs exactly
once.
@phinze phinze force-pushed the phinze/mir-1285-build-saga-crash-recovery-resumed-build-cant-complete-after branch from 6485bb1 to 21a173d Compare July 2, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant