Enforce BuildKit's GC byte cap with age-less policy tiers#879
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughChangesThis change updates the BuildKit daemon configuration template to replace a single GC policy using Estimated code review effort: High Suggested labels: buildkit, configuration, tests Suggested reviewers: None identified Sequence Diagram(s)Not applicable. Poem: Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
components/buildkit/config_test.go (2)
38-55: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAssert the final
all = truecap tier too.Line 54 passes even if the last-resort
all = truepolicy is removed, leaving internal/frontend cache outside the enforced ceiling the PR intends to protect.Suggested test tightening
r.GreaterOrEqual(ageless, 1, "at least one tier must enforce maxUsedSpace with no keepDuration") + + last := blocks[len(blocks)-1] + r.Contains(last, "all = true", "last tier should include all cache types") + r.Contains(last, "maxUsedSpace = 10737418240", "last tier should enforce the configured byte cap") + r.NotContains(last, "keepDuration", "last tier must not be age-gated") })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/buildkit/config_test.go` around lines 38 - 55, The gc policy test in the “enforces the byte cap with an age-less tier” case only checks for an ageless maxUsedSpace tier, so it can still pass if the final all=true cap tier is missing. Tighten the assertions around gcPolicyBlocks(config) to also verify the presence of the last-resort all=true policy in addition to the existing maxUsedSpace/keepDuration check, using the same block scan logic in this test.
63-68: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winCover the full source filter contract.
This currently protects
source.localandexec.cachemount, but droppingsource.git.checkoutwould still pass despite being part of the cheap-to-rebuild tier.Suggested assertion
// A single comma-joined filter string is AND'd and matches no record. r.NotContains(config, "type==source.local,type==exec.cachemount") - r.Contains(config, `"type==source.local"`) - r.Contains(config, `"type==exec.cachemount"`) + r.Contains(config, `filters = [ "type==source.local", "type==exec.cachemount", "type==source.git.checkout" ]`) })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/buildkit/config_test.go` around lines 63 - 68, The current test in config_test.go only verifies the source filter OR-array form for source.local and exec.cachemount, so it still misses source.git.checkout. Update the existing test around the config assertions to also check that source.git.checkout is included as its own array entry in the cheap-to-rebuild source filter contract, using the same config string assertions in the test case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@components/buildkit/config_test.go`:
- Around line 38-55: The gc policy test in the “enforces the byte cap with an
age-less tier” case only checks for an ageless maxUsedSpace tier, so it can
still pass if the final all=true cap tier is missing. Tighten the assertions
around gcPolicyBlocks(config) to also verify the presence of the last-resort
all=true policy in addition to the existing maxUsedSpace/keepDuration check,
using the same block scan logic in this test.
- Around line 63-68: The current test in config_test.go only verifies the source
filter OR-array form for source.local and exec.cachemount, so it still misses
source.git.checkout. Update the existing test around the config assertions to
also check that source.git.checkout is included as its own array entry in the
cheap-to-rebuild source filter contract, using the same config string assertions
in the test case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 549f7f22-9da1-46f6-b5ef-768e401bea05
📒 Files selected for processing (2)
components/buildkit/buildkit.gocomponents/buildkit/config_test.go
d74b04c to
ae42176
Compare
The generated buildkitd.toml used a single gcpolicy carrying keepBytes + keepDuration, and two faults made the 10GB cap a no-op. keepBytes is deprecated and migrates to reservedSpace (a floor, not a ceiling), leaving maxUsedSpace unset, so no ceiling field was ever set. And keepDuration protects not just records used within the window but their entire ancestry, since a parent cannot be deleted while a child still references it. On a constant builder that pins nearly the whole cache, so GC could never bring usage down to the cap: garden grew to 27GB against a 10GB cap and fed the MIR-1279 disk outage. Replace it with a graduated multi-tier policy mirroring BuildKit own default shape: a 48h/512MB tier for cheap-to-rebuild sources, a soft keepDuration tier, and two age-less maxUsedSpace tiers that enforce the ceiling regardless of entry age. Source filters use the OR (array) form; the comma-joined string form ANDs them and matches nothing. Verified on garden: an age-gated prune to 2GB freed 0B while an age-less prune freed 7.74GB. The prune/GC logic is unchanged across BuildKit v0.19 through v0.31, so this is a config fix, not a version bump. MIR-1280
ae42176 to
1690a10
Compare
BuildKit's build cache on miren-garden grew to 27GB against a configured 10GB cap and became a primary contributor to the MIR-1279 disk-pressure outage. The frustrating part: when we pruned it by hand it dropped right back under the limit, so the cache was always collectable. Automatic GC just never did it.
Digging in on the box turned up two compounding faults in the
buildkitd.tomlwe generate. First, no ceiling field was ever actually set. We configuredkeepBytes, which BuildKit v0.19 deprecates and quietly migrates toreservedSpace, and that's a floor ("never prune below this"), not a cap. The real ceiling field,maxUsedSpace, was left at zero.buildctl debug workerson garden confirmed it:{reservedSpace:10GB, maxUsedSpace:0, keepDuration:7d}.Second, and this is the subtle one,
keepDurationprotects far more than it looks. It shields any record used within the window, but because a parent layer can't be deleted while a child still references it, it transitively pins the entire ancestry of anything touched recently. On a builder that constantly rebuilds the same family of base images, that's nearly the whole cache. We measured it on garden: 73 of 73 records older than 7 days were pinned by a younger descendant, zero were freely deletable. So GC ran faithfully after every build and only ever reclaimed the thin tail that had just aged out, while the cap never bound and the pile climbed.We proved the mechanism live with a controlled prune: same 2GB target,
--keep-duration 168hfreed 0 bytes while dropping the age gate freed 7.74GB. And we confirmed this isn't something a version bump fixes: the prune logic is byte-identical from v0.19 (what we run) through v0.31.The fix replaces the single age-gated policy with a graduated set that mirrors BuildKit's own default shape: a 48h/512MB tier for the cheapest-to-rebuild sources, a soft
keepDurationtier, and then two age-lessmaxUsedSpacetiers that enforce the ceiling regardless of age (the age-less pass drops fresh leaves so their pinned ancestors can finally go). Source filters use the array (OR) form, since a single comma-joined string ANDs the types and matches nothing.Follow-ups filed while we were in here: MIR-1286 (the
miren-ociregistry has no retention and has ballooned to 236GB) and MIR-1287 (get BuildKit off the release candidate and align the daemon/client/test versions).Closes MIR-1280