Prune old app versions and reclaim their image blobs#877
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (17)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (12)
📝 WalkthroughWalkthroughChangesThis PR adds configurable AppVersion retention, introduces a version GC controller that deletes old versions by count and period while respecting active and live-sandbox pins, switches artifact GC to archive only unreferenced active artifacts, and centralizes AppVersion deletion into a shared helper used by ephemeral cleanup. The coordinator and server startup now wire the new retention settings into the version GC controller. Documentation and generated server config files were updated for the new Sequence Diagram(s)See the embedded diagrams above. Related PRs: None identified. Suggested labels: enhancement, garbage-collection, configuration Suggested reviewers: repository maintainers familiar with coordinator, controller, and serverconfig code Poem Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pkg/serverconfig/validation.gen.go (1)
62-68: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winStub validation leaves malformed
retention_periodunvalidated; stale comment doesn't match the field.Two issues here:
- The comment
Check for port conflicts in AppVersionConfigis copy-pasted boilerplate unrelated to retention settings (there are no ports in this config).RetentionPeriodis a free-form duration string parsed downstream incli/commands/server.goviaunits.ParseDuration, whose error is discarded and silently falls back to "use default." Validating the duration format here (and non-negativeRetentionCount) would surface misconfiguration at startup instead of silently degrading to defaults.Since this is a generated file, the underlying fix likely belongs in the schemagen template/schema.yml rather than this file directly.
🛠️ Suggested validation
-// Validate validates AppVersionConfig -func (c *AppVersionConfig) Validate() error { - - // Check for port conflicts in AppVersionConfig - - return nil -} +// Validate validates AppVersionConfig +func (c *AppVersionConfig) Validate() error { + if c.RetentionCount != nil && *c.RetentionCount < 0 { + return fmt.Errorf("retention_count must be non-negative, got %d", *c.RetentionCount) + } + if c.RetentionPeriod != nil && *c.RetentionPeriod != "" { + if _, err := units.ParseDuration(*c.RetentionPeriod); err != nil { + return fmt.Errorf("invalid retention_period %q: %w", *c.RetentionPeriod, err) + } + } + return nil +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/serverconfig/validation.gen.go` around lines 62 - 68, Update the AppVersionConfig validation so it matches the retention settings instead of the stale port-conflict comment. In the generated validation path for AppVersionConfig, add checks that RetentionPeriod parses as a valid duration format and that RetentionCount is non-negative, and return an error on invalid values rather than silently accepting them. Since this is generated code, make the change in the schemagen template/schema definition that feeds validation for AppVersionConfig, and ensure the comment/docstring reflects retention validation instead of ports.pkg/serverconfig/schema.yml (1)
515-536: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winAdd format validation for
retention_periodto match sibling configs.
retention_periodhere has novalidationblock, unlikeVictoriaLogsConfig.retention_period(line 368-377) which validates against'^\d+(ms|s|m|h|d|w|y)$'. Without it, a malformed value (e.g. typo like30dd) isn't caught at config load; it silently parses to0incli/commands/server.goand falls back to the controller default, masking the misconfiguration.♻️ Proposed addition
retention_period: type: string default: "30d" cli: long: app-version-retention-period description: Retain versions newer than this duration regardless of count (e.g. 30d, 2w) env: MIREN_APP_VERSION_RETENTION_PERIOD toml: retention_period + validation: + regex: '^\d+(ms|s|m|h|d|w|y)$'🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/serverconfig/schema.yml` around lines 515 - 536, Add a validation block to AppVersionConfig.retention_period in schema.yml so it matches the sibling VictoriaLogsConfig.retention_period format check. Update the AppVersionConfig schema field to validate against the same duration regex used elsewhere (for example, via the existing validation pattern on retention_period) so malformed values are rejected at config load instead of silently falling through in cli/commands/server.go. Locate the change in AppVersionConfig and keep the field naming and CLI/env/TOML wiring unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@controllers/artifact/gc.go`:
- Around line 136-161: The artifact GC flow in
c.archiveArtifact/c.referencedArtifacts computes referenced once and then
archives based on stale state; add a fresh reference check for art.ID
immediately before calling archiveArtifact, or make archiveArtifact verify
current references itself using the artifact ID and revision. Keep the fix
localized around the loop over resp.Values() and the archiveArtifact path so a
newly referenced artifact cannot be marked ARCHIVED.
In `@controllers/version/gc.go`:
- Around line 146-210: Recheck the pin state immediately before deleting in the
GC sweep: the snapshots from activeVersions and versionsWithLiveSandboxes in the
version GC flow can become stale during a long run. In the loop that processes
each version before appversion.Delete, add a final guard using the same
active/live lookup logic (or an entity precondition) so a version that became
active or gained a live sandbox reference is retained instead of hard-deleted.
In `@pkg/appversion/delete.go`:
- Around line 31-39: In delete.go, the Delete flow in the AppVersion cleanup
should remove the parent AppVersion first and only then attempt the best-effort
ConfigVersion deletion. Update the logic around eac.Delete in the
deleteAppVersion path so the version.ID deletion happens before the
version.ConfigVersion cleanup, and keep the warning-only behavior for the config
cleanup after the main delete succeeds.
---
Nitpick comments:
In `@pkg/serverconfig/schema.yml`:
- Around line 515-536: Add a validation block to
AppVersionConfig.retention_period in schema.yml so it matches the sibling
VictoriaLogsConfig.retention_period format check. Update the AppVersionConfig
schema field to validate against the same duration regex used elsewhere (for
example, via the existing validation pattern on retention_period) so malformed
values are rejected at config load instead of silently falling through in
cli/commands/server.go. Locate the change in AppVersionConfig and keep the field
naming and CLI/env/TOML wiring unchanged.
In `@pkg/serverconfig/validation.gen.go`:
- Around line 62-68: Update the AppVersionConfig validation so it matches the
retention settings instead of the stale port-conflict comment. In the generated
validation path for AppVersionConfig, add checks that RetentionPeriod parses as
a valid duration format and that RetentionCount is non-negative, and return an
error on invalid values rather than silently accepting them. Since this is
generated code, make the change in the schemagen template/schema definition that
feeds validation for AppVersionConfig, and ensure the comment/docstring reflects
retention validation instead of ports.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a85ccd2a-7736-4d9a-b03f-38a7f389402b
📒 Files selected for processing (16)
cli/commands/server.gocomponents/coordinate/coordinate.gocontrollers/artifact/gc.gocontrollers/artifact/gc_test.gocontrollers/ephemeral/gc.gocontrollers/version/gc.gocontrollers/version/gc_test.godocs/docs/server-config.mdpkg/appversion/delete.gopkg/ephemeral/manage.gopkg/serverconfig/cli.gen.gopkg/serverconfig/config.gen.gopkg/serverconfig/defaults.gen.gopkg/serverconfig/loader.gen.gopkg/serverconfig/schema.ymlpkg/serverconfig/validation.gen.go
e27dd1a to
8dbd29a
Compare
We always knew app version retention was coming: every deploy mints a new AppVersion, and until now they lived forever. That unbounded growth bloated the entity store and made the ephemeral GC's every-5-minute all-versions scan grow with the cluster's entire deploy history. It also kept disk pinned, since each version references an Artifact and Artifacts reference registry blobs, so the image and blob GC could never reclaim old images. This adds a periodic version retention GC that keeps, per app, the most recent N versions or anything newer than a retention period (whichever keeps it), always keeps the active version, and skips any version a live sandbox still references. Pruned versions are hard-deleted along with their 1:1 ConfigVersion. Artifacts are shared many-versions-to-one (deduped by manifest digest), so we can't naively archive a deleted version's artifact without risking a 404 for another version that shares it. So artifact archiving becomes reference-driven: the artifact GC now archives any active Artifact that zero surviving versions reference, computed cluster-wide, instead of its old blind age/count rule. That hands the existing image and blob GC chain something to reclaim, and shared artifacts survive until their last referencing version is pruned. Sandbox-pool teardown was pulled into a shared appversion.DeleteWithPools that only the ephemeral path needs; the version GC uses a plain Delete, since pools already self-reap via the pool manager. Retention is configurable via app_version.retention_count and app_version.retention_period (defaults 10 / 30d). Closes MIR-329
8dbd29a to
c261840
Compare
|
On the |
App version retention is something we always figured we'd need eventually, and this is us building it. Every deploy mints a new AppVersion, and until now they lived forever. That unbounded growth had two costs: it bloated the entity store (and made the ephemeral GC's every-5-minute all-versions scan grow with the cluster's whole deploy history), and it kept disk pinned. Each version references an Artifact, and Artifacts reference registry blobs, so as long as old versions stuck around the image and blob GC could never reclaim old images.
So this adds a periodic version retention GC, modeled on the existing artifact GC and wired in beside it. Per app it keeps the most recent
retention_countversions or anything newer thanretention_period, whichever rule keeps it, always keeps the active version, and skips anything a running or pending sandbox still references. Everything else is hard-deleted along with its 1:1 ConfigVersion.The interesting part is the disk half. Artifacts are shared many-versions-to-one (they're deduped by manifest digest), so you can't just archive a deleted version's artifact, since another live version might share the same image and would start 404ing. Instead, artifact archiving becomes reference-driven: the artifact GC's old blind age/count retention is replaced with "archive any active artifact that zero surviving versions reference," computed across all versions. That gives the existing image and blob GC chain something to actually reclaim, and a shared artifact naturally survives until its last referencing version is pruned.
While in here, the sandbox-pool cleanup that used to live in the ephemeral delete path got pulled into a shared helper. It turned out the version GC doesn't need it at all: the pool manager already reaps drained, unreferenced pools on its own, and the deploy launcher de-refs superseded pools. So there's now a plain
appversion.Delete(a version plus its ConfigVersion) that the version GC uses, and aDeleteWithPoolswrapper that only the ephemeral path calls, where prompt teardown actually matters.Retention is configurable through
app_version.retention_countandapp_version.retention_period(defaults 10 and 30d), following the duration-string style of the existing victorialogs/buildkit knobs, and documented in the server config reference.Closes MIR-329