Skip to content

Prune old app versions and reclaim their image blobs#877

Open
phinze wants to merge 1 commit into
mainfrom
phinze/mir-329-app-version-retention-prune-old-versions-beyond-a
Open

Prune old app versions and reclaim their image blobs#877
phinze wants to merge 1 commit into
mainfrom
phinze/mir-329-app-version-retention-prune-old-versions-beyond-a

Conversation

@phinze

@phinze phinze commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

App version retention is something we always figured we'd need eventually, and this is us building it. Every deploy mints a new AppVersion, and until now they lived forever. That unbounded growth had two costs: it bloated the entity store (and made the ephemeral GC's every-5-minute all-versions scan grow with the cluster's whole deploy history), and it kept disk pinned. Each version references an Artifact, and Artifacts reference registry blobs, so as long as old versions stuck around the image and blob GC could never reclaim old images.

So this adds a periodic version retention GC, modeled on the existing artifact GC and wired in beside it. Per app it keeps the most recent retention_count versions or anything newer than retention_period, whichever rule keeps it, always keeps the active version, and skips anything a running or pending sandbox still references. Everything else is hard-deleted along with its 1:1 ConfigVersion.

The interesting part is the disk half. Artifacts are shared many-versions-to-one (they're deduped by manifest digest), so you can't just archive a deleted version's artifact, since another live version might share the same image and would start 404ing. Instead, artifact archiving becomes reference-driven: the artifact GC's old blind age/count retention is replaced with "archive any active artifact that zero surviving versions reference," computed across all versions. That gives the existing image and blob GC chain something to actually reclaim, and a shared artifact naturally survives until its last referencing version is pruned.

While in here, the sandbox-pool cleanup that used to live in the ephemeral delete path got pulled into a shared helper. It turned out the version GC doesn't need it at all: the pool manager already reaps drained, unreferenced pools on its own, and the deploy launcher de-refs superseded pools. So there's now a plain appversion.Delete (a version plus its ConfigVersion) that the version GC uses, and a DeleteWithPools wrapper that only the ephemeral path calls, where prompt teardown actually matters.

Retention is configurable through app_version.retention_count and app_version.retention_period (defaults 10 and 30d), following the duration-string style of the existing victorialogs/buildkit knobs, and documented in the server config reference.

Closes MIR-329

@phinze phinze requested a review from a team as a code owner July 1, 2026 14:44
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 772ad6a2-3250-4109-a809-e9a0a51fde13

📥 Commits

Reviewing files that changed from the base of the PR and between e27dd1a and c261840.

📒 Files selected for processing (17)
  • cli/commands/server.go
  • components/coordinate/coordinate.go
  • controllers/artifact/gc.go
  • controllers/artifact/gc_test.go
  • controllers/ephemeral/gc.go
  • controllers/version/gc.go
  • controllers/version/gc_test.go
  • docs/docs/command/server.md
  • docs/docs/server-config.md
  • pkg/appversion/delete.go
  • pkg/ephemeral/manage.go
  • pkg/serverconfig/cli.gen.go
  • pkg/serverconfig/config.gen.go
  • pkg/serverconfig/defaults.gen.go
  • pkg/serverconfig/loader.gen.go
  • pkg/serverconfig/schema.yml
  • pkg/serverconfig/validation.gen.go
✅ Files skipped from review due to trivial changes (2)
  • pkg/serverconfig/cli.gen.go
  • docs/docs/server-config.md
🚧 Files skipped from review as they are similar to previous changes (12)
  • pkg/serverconfig/schema.yml
  • cli/commands/server.go
  • pkg/serverconfig/validation.gen.go
  • pkg/serverconfig/loader.gen.go
  • pkg/serverconfig/config.gen.go
  • controllers/ephemeral/gc.go
  • controllers/version/gc_test.go
  • controllers/artifact/gc_test.go
  • components/coordinate/coordinate.go
  • controllers/artifact/gc.go
  • controllers/version/gc.go
  • pkg/ephemeral/manage.go

📝 Walkthrough

Walkthrough

Changes

This PR adds configurable AppVersion retention, introduces a version GC controller that deletes old versions by count and period while respecting active and live-sandbox pins, switches artifact GC to archive only unreferenced active artifacts, and centralizes AppVersion deletion into a shared helper used by ephemeral cleanup. The coordinator and server startup now wire the new retention settings into the version GC controller. Documentation and generated server config files were updated for the new app_version settings.

Sequence Diagram(s)

See the embedded diagrams above.

Related PRs: None identified.

Suggested labels: enhancement, garbage-collection, configuration

Suggested reviewers: repository maintainers familiar with coordinator, controller, and serverconfig code

Poem
Versions age, but some remain,
Pinned by actives, live again.
Artifacts now heed references,
Config steers the GC senses.


Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
pkg/serverconfig/validation.gen.go (1)

62-68: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Stub validation leaves malformed retention_period unvalidated; stale comment doesn't match the field.

Two issues here:

  • The comment Check for port conflicts in AppVersionConfig is copy-pasted boilerplate unrelated to retention settings (there are no ports in this config).
  • RetentionPeriod is a free-form duration string parsed downstream in cli/commands/server.go via units.ParseDuration, whose error is discarded and silently falls back to "use default." Validating the duration format here (and non-negative RetentionCount) would surface misconfiguration at startup instead of silently degrading to defaults.

Since this is a generated file, the underlying fix likely belongs in the schemagen template/schema.yml rather than this file directly.

🛠️ Suggested validation
-// Validate validates AppVersionConfig
-func (c *AppVersionConfig) Validate() error {
-
-	// Check for port conflicts in AppVersionConfig
-
-	return nil
-}
+// Validate validates AppVersionConfig
+func (c *AppVersionConfig) Validate() error {
+	if c.RetentionCount != nil && *c.RetentionCount < 0 {
+		return fmt.Errorf("retention_count must be non-negative, got %d", *c.RetentionCount)
+	}
+	if c.RetentionPeriod != nil && *c.RetentionPeriod != "" {
+		if _, err := units.ParseDuration(*c.RetentionPeriod); err != nil {
+			return fmt.Errorf("invalid retention_period %q: %w", *c.RetentionPeriod, err)
+		}
+	}
+	return nil
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/serverconfig/validation.gen.go` around lines 62 - 68, Update the
AppVersionConfig validation so it matches the retention settings instead of the
stale port-conflict comment. In the generated validation path for
AppVersionConfig, add checks that RetentionPeriod parses as a valid duration
format and that RetentionCount is non-negative, and return an error on invalid
values rather than silently accepting them. Since this is generated code, make
the change in the schemagen template/schema definition that feeds validation for
AppVersionConfig, and ensure the comment/docstring reflects retention validation
instead of ports.
pkg/serverconfig/schema.yml (1)

515-536: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Add format validation for retention_period to match sibling configs.

retention_period here has no validation block, unlike VictoriaLogsConfig.retention_period (line 368-377) which validates against '^\d+(ms|s|m|h|d|w|y)$'. Without it, a malformed value (e.g. typo like 30dd) isn't caught at config load; it silently parses to 0 in cli/commands/server.go and falls back to the controller default, masking the misconfiguration.

♻️ Proposed addition
       retention_period:
         type: string
         default: "30d"
         cli:
           long: app-version-retention-period
           description: Retain versions newer than this duration regardless of count (e.g. 30d, 2w)
         env: MIREN_APP_VERSION_RETENTION_PERIOD
         toml: retention_period
+        validation:
+          regex: '^\d+(ms|s|m|h|d|w|y)$'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/serverconfig/schema.yml` around lines 515 - 536, Add a validation block
to AppVersionConfig.retention_period in schema.yml so it matches the sibling
VictoriaLogsConfig.retention_period format check. Update the AppVersionConfig
schema field to validate against the same duration regex used elsewhere (for
example, via the existing validation pattern on retention_period) so malformed
values are rejected at config load instead of silently falling through in
cli/commands/server.go. Locate the change in AppVersionConfig and keep the field
naming and CLI/env/TOML wiring unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/artifact/gc.go`:
- Around line 136-161: The artifact GC flow in
c.archiveArtifact/c.referencedArtifacts computes referenced once and then
archives based on stale state; add a fresh reference check for art.ID
immediately before calling archiveArtifact, or make archiveArtifact verify
current references itself using the artifact ID and revision. Keep the fix
localized around the loop over resp.Values() and the archiveArtifact path so a
newly referenced artifact cannot be marked ARCHIVED.

In `@controllers/version/gc.go`:
- Around line 146-210: Recheck the pin state immediately before deleting in the
GC sweep: the snapshots from activeVersions and versionsWithLiveSandboxes in the
version GC flow can become stale during a long run. In the loop that processes
each version before appversion.Delete, add a final guard using the same
active/live lookup logic (or an entity precondition) so a version that became
active or gained a live sandbox reference is retained instead of hard-deleted.

In `@pkg/appversion/delete.go`:
- Around line 31-39: In delete.go, the Delete flow in the AppVersion cleanup
should remove the parent AppVersion first and only then attempt the best-effort
ConfigVersion deletion. Update the logic around eac.Delete in the
deleteAppVersion path so the version.ID deletion happens before the
version.ConfigVersion cleanup, and keep the warning-only behavior for the config
cleanup after the main delete succeeds.

---

Nitpick comments:
In `@pkg/serverconfig/schema.yml`:
- Around line 515-536: Add a validation block to
AppVersionConfig.retention_period in schema.yml so it matches the sibling
VictoriaLogsConfig.retention_period format check. Update the AppVersionConfig
schema field to validate against the same duration regex used elsewhere (for
example, via the existing validation pattern on retention_period) so malformed
values are rejected at config load instead of silently falling through in
cli/commands/server.go. Locate the change in AppVersionConfig and keep the field
naming and CLI/env/TOML wiring unchanged.

In `@pkg/serverconfig/validation.gen.go`:
- Around line 62-68: Update the AppVersionConfig validation so it matches the
retention settings instead of the stale port-conflict comment. In the generated
validation path for AppVersionConfig, add checks that RetentionPeriod parses as
a valid duration format and that RetentionCount is non-negative, and return an
error on invalid values rather than silently accepting them. Since this is
generated code, make the change in the schemagen template/schema definition that
feeds validation for AppVersionConfig, and ensure the comment/docstring reflects
retention validation instead of ports.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a85ccd2a-7736-4d9a-b03f-38a7f389402b

📥 Commits

Reviewing files that changed from the base of the PR and between a5ae843 and e27dd1a.

📒 Files selected for processing (16)
  • cli/commands/server.go
  • components/coordinate/coordinate.go
  • controllers/artifact/gc.go
  • controllers/artifact/gc_test.go
  • controllers/ephemeral/gc.go
  • controllers/version/gc.go
  • controllers/version/gc_test.go
  • docs/docs/server-config.md
  • pkg/appversion/delete.go
  • pkg/ephemeral/manage.go
  • pkg/serverconfig/cli.gen.go
  • pkg/serverconfig/config.gen.go
  • pkg/serverconfig/defaults.gen.go
  • pkg/serverconfig/loader.gen.go
  • pkg/serverconfig/schema.yml
  • pkg/serverconfig/validation.gen.go

Comment thread controllers/artifact/gc.go
Comment thread controllers/version/gc.go
Comment thread pkg/appversion/delete.go Outdated
@phinze phinze force-pushed the phinze/mir-329-app-version-retention-prune-old-versions-beyond-a branch from e27dd1a to 8dbd29a Compare July 1, 2026 14:56
We always knew app version retention was coming: every deploy mints a
new AppVersion, and until now they lived forever. That unbounded growth
bloated the entity store and made the ephemeral GC's every-5-minute
all-versions scan grow with the cluster's entire deploy history. It
also kept disk pinned, since each version references an Artifact and
Artifacts reference registry blobs, so the image and blob GC could
never reclaim old images.

This adds a periodic version retention GC that keeps, per app, the most
recent N versions or anything newer than a retention period (whichever
keeps it), always keeps the active version, and skips any version a
live sandbox still references. Pruned versions are hard-deleted along
with their 1:1 ConfigVersion.

Artifacts are shared many-versions-to-one (deduped by manifest digest),
so we can't naively archive a deleted version's artifact without
risking a 404 for another version that shares it. So artifact archiving
becomes reference-driven: the artifact GC now archives any active
Artifact that zero surviving versions reference, computed cluster-wide,
instead of its old blind age/count rule. That hands the existing image
and blob GC chain something to reclaim, and shared artifacts survive
until their last referencing version is pruned.

Sandbox-pool teardown was pulled into a shared
appversion.DeleteWithPools that only the ephemeral path needs; the
version GC uses a plain Delete, since pools already self-reap via the
pool manager.

Retention is configurable via app_version.retention_count and
app_version.retention_period (defaults 10 / 30d).

Closes MIR-329
@phinze phinze force-pushed the phinze/mir-329-app-version-retention-prune-old-versions-beyond-a branch from 8dbd29a to c261840 Compare July 1, 2026 15:46
@phinze

phinze commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

On the AppVersionConfig.Validate() nitpick: that empty stub and the "port conflicts" comment are boilerplate the config generator emits for every type, not specific to this change, and the malformed-retention_period case already fails safe (parse error falls back to the default). Filed MIR-1284 to fix the template (drop the stale comment, optionally emit real validation) rather than hand-patch a generated file here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant