Fix build saga crash recovery so resumed builds complete after a restart by phinze · Pull Request #880 · mirendev/runtime

phinze · 2026-07-02T19:30:03Z

Saga crash recovery for builds worked in the narrow sense that the executor faithfully recovered an in-flight build-from-tar saga from etcd on restart and resumed the build-image action. What it couldn't do was finish. The recovered build would try to push its image to cluster.local:5000 and fail with dial tcp: lookup cluster.local ... no such host, burn through buildkit's fast internal retries in about a second, and roll the whole thing back. A fresh deploy minutes later worked fine, which was the tell that this was an ordering problem, not a broken registry.

Mapping the boot sequence made the cause clear, and it's more of a structural inversion than a race. Build saga recovery ran synchronously inside co.Start(), but the things a push depends on (the flannel-derived router address, the cluster.local host mapping, SetRegistryIP, and the registry listener itself) are all established later in the server boot path, after co.Start() returns. So on every restart with an in-flight build, recovery resumed the push into a world where the push target didn't resolve yet.

The fix is to run recovery from the boot sequence after those dependencies exist, rather than in the middle of coordinator startup. Recovery is split out of SagaBuilder.Init (which now just registers the saga definition) into a Recover method the coordinator drives via RecoverBuildSagas, called once the registry and name mapping are up.

Verifying it end to end (deploy with an onbuild = ["sleep 180"] window, SIGKILL the coordinator mid-build, restart) turned up two more things, which is why the diff is a little bigger than a pure reorder.

First, once recovery actually succeeds it re-runs the whole build to completion, which takes minutes, and doing that synchronously wedged boot: the server never signaled ready and health checks would time out. The old code got away with synchronous recovery only because it always failed in about a second. So recovery now runs in the background, and boot completes promptly while the recovered build finishes on its own.

Second, with recovery finally getting past the push, it started colliding on create-config-version. It turned out saga.ListIncomplete combined the pending, running, and undoing status-index IDs without deduplicating, so a saga that lingered in more than one index got recovered twice, and the second pass re-ran already-completed actions and conflicted on an entity the first pass had created. This bug predates the change; it was just invisible while recovery always died at the push. Fixed by deduplicating in both saga storage backends.

After all three, the repro is clean: the coordinator boots in a few seconds, recovery runs exactly once, and the resumed build completes and activates the new version, with zero DNS errors and zero conflicts.

The stale deployment lock a crashed deploy leaves behind (the other symptom in the original issue) is intentionally out of scope here and stays with MIR-681, where server-owned deployment lifecycle is the right home for it. The broader "boot ordering is implicit and fragile" problem this exposed is written up separately as an RFD (Boot Dependency Graph).

Closes MIR-1285

coderabbitai · 2026-07-02T19:35:09Z

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 51 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aef6c186-0eb3-4b39-b073-10b42364ef25

📥 Commits

Reviewing files that changed from the base of the PR and between 6485bb1 and 21a173d.

📒 Files selected for processing (5)

cli/commands/server.go
components/coordinate/coordinate.go
pkg/saga/eac_storage.go
pkg/saga/storage.go
servers/build/saga_builder.go

📝 Walkthrough

Walkthrough

This change defers build-saga recovery until after server boot dependencies (OCI registry, cluster.local resolution) are available, rather than performing it during Coordinator/SagaBuilder initialization. SagaBuilder.Init is split into a registration-only Init and a separate Recover method; Coordinator retains a SagaBuilder reference and exposes RecoverBuildSagas, invoked in a background goroutine from server startup. Separately, ListIncomplete in both saga storage implementations now deduplicates saga IDs when combining pending, running, and undoing status results to avoid double-processing the same execution.

Sequence Diagram(s)

sequenceDiagram
  participant Server as cli/commands/server.go
  participant Coordinator
  participant SagaBuilder

  Coordinator->>SagaBuilder: Init() during Start (registers saga definitions)
  Coordinator->>Coordinator: store sagaBuilder field
  Server->>Server: registry listening, cluster.local resolves
  Server->>Coordinator: RecoverBuildSagas(ctx) in goroutine
  Coordinator->>SagaBuilder: Recover(ctx)
  SagaBuilder-->>Coordinator: error (logged, non-fatal)

Related PRs: None identified.

Suggested labels: enhancement, bug

Suggested reviewers: None identified.

Poem
A saga once stalled mid-flight,
now waits for registry's green light.
No double IDs to confuse the ledger's trace,
a seen map keeps each execution in its place.
Recovery resumes, quiet, in the night. 🌙

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cli/commands/server.go`:
- Around line 946-954: The background saga recovery in co.RecoverBuildSagas is
using the parent ctx instead of the shutdown-scoped sub context, so it can
outlive the server teardown path. Update the goroutine in server startup to pass
sub into co.RecoverBuildSagas so recovery is canceled with the same lifecycle as
the other errgroup-managed background tasks and stops during drain/restart.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7ee364fd-da90-4a6a-aeb3-8c2237ece958

📥 Commits

Reviewing files that changed from the base of the PR and between af5d3cb and 6485bb1.

📒 Files selected for processing (5)

cli/commands/server.go
components/coordinate/coordinate.go
pkg/saga/eac_storage.go
pkg/saga/storage.go
servers/build/saga_builder.go

Recovered build sagas resumed the image push before the cluster registry and the cluster.local name mapping existed, because recovery ran inside co.Start() while both are established later in the server boot sequence. The push failed with "no such host" and the saga rolled back, so a crash-recovered build never completed (MIR-1285). Run build saga recovery from the boot sequence after the registry and name mapping are ready, rather than inside co.Start(). Recovery is split out of SagaBuilder.Init (now register-only) into a Recover method the coordinator drives via RecoverBuildSagas, and it runs in the background so a recovered build, which re-runs to completion and can take minutes, doesn't block boot. Also dedupe saga.ListIncomplete: it combined the pending/running/undoing status-index IDs without dedup, so a saga lingering in more than one index recovered twice, and the second pass collided creating the config version. Masked before because recovery always failed fast at the push; it surfaced once recovery started succeeding. Verified in the dev environment: deploy with an onbuild sleep, SIGKILL the coordinator mid-build, restart; the recovered build now completes and activates, the server boots promptly, and recovery runs exactly once.

phinze requested a review from a team as a code owner July 2, 2026 19:30

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread cli/commands/server.go

phinze force-pushed the phinze/mir-1285-build-saga-crash-recovery-resumed-build-cant-complete-after branch from 6485bb1 to 21a173d Compare July 2, 2026 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix build saga crash recovery so resumed builds complete after a restart#880

Fix build saga crash recovery so resumed builds complete after a restart#880
phinze wants to merge 1 commit into
mainfrom
phinze/mir-1285-build-saga-crash-recovery-resumed-build-cant-complete-after

phinze commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Review limit reached

Walkthrough

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

phinze commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading