Skip to content

Enforce BuildKit's GC byte cap with age-less policy tiers#879

Open
phinze wants to merge 1 commit into
mainfrom
phinze/mir-1280-buildkit-gc-not-enforcing-its-keepbytes-cap-cache-grew-to
Open

Enforce BuildKit's GC byte cap with age-less policy tiers#879
phinze wants to merge 1 commit into
mainfrom
phinze/mir-1280-buildkit-gc-not-enforcing-its-keepbytes-cap-cache-grew-to

Conversation

@phinze

@phinze phinze commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

BuildKit's build cache on miren-garden grew to 27GB against a configured 10GB cap and became a primary contributor to the MIR-1279 disk-pressure outage. The frustrating part: when we pruned it by hand it dropped right back under the limit, so the cache was always collectable. Automatic GC just never did it.

Digging in on the box turned up two compounding faults in the buildkitd.toml we generate. First, no ceiling field was ever actually set. We configured keepBytes, which BuildKit v0.19 deprecates and quietly migrates to reservedSpace, and that's a floor ("never prune below this"), not a cap. The real ceiling field, maxUsedSpace, was left at zero. buildctl debug workers on garden confirmed it: {reservedSpace:10GB, maxUsedSpace:0, keepDuration:7d}.

Second, and this is the subtle one, keepDuration protects far more than it looks. It shields any record used within the window, but because a parent layer can't be deleted while a child still references it, it transitively pins the entire ancestry of anything touched recently. On a builder that constantly rebuilds the same family of base images, that's nearly the whole cache. We measured it on garden: 73 of 73 records older than 7 days were pinned by a younger descendant, zero were freely deletable. So GC ran faithfully after every build and only ever reclaimed the thin tail that had just aged out, while the cap never bound and the pile climbed.

We proved the mechanism live with a controlled prune: same 2GB target, --keep-duration 168h freed 0 bytes while dropping the age gate freed 7.74GB. And we confirmed this isn't something a version bump fixes: the prune logic is byte-identical from v0.19 (what we run) through v0.31.

The fix replaces the single age-gated policy with a graduated set that mirrors BuildKit's own default shape: a 48h/512MB tier for the cheapest-to-rebuild sources, a soft keepDuration tier, and then two age-less maxUsedSpace tiers that enforce the ceiling regardless of age (the age-less pass drops fresh leaves so their pinned ancestors can finally go). Source filters use the array (OR) form, since a single comma-joined string ANDs the types and matches nothing.

Follow-ups filed while we were in here: MIR-1286 (the miren-oci registry has no retention and has ballooned to 236GB) and MIR-1287 (get BuildKit off the release candidate and align the daemon/client/test versions).

Closes MIR-1280

@phinze phinze requested a review from a team as a code owner July 1, 2026 16:30
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6bf3daa3-c314-4da3-ad35-aec4175bda98

📥 Commits

Reviewing files that changed from the base of the PR and between ae42176 and 1690a10.

📒 Files selected for processing (2)
  • components/buildkit/buildkit.go
  • components/buildkit/config_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/buildkit/config_test.go

📝 Walkthrough

Walkthrough

Changes

This change updates the BuildKit daemon configuration template to replace a single GC policy using keepBytes/keepDuration with a multi-tier [[worker.oci.gcpolicy]] structure using maxUsedSpace and keepDuration, including a final all = true tier. The registry layout in the template is also adjusted. The test file now parses the generated config into individual GC policy blocks and checks tier count, maxUsedSpace usage, removal of keepBytes, and the updated source filter syntax.

Estimated code review effort: High

Suggested labels: buildkit, configuration, tests

Suggested reviewers: None identified

Sequence Diagram(s)

Not applicable.

Poem:
A rabbit dug through cache and stone,
Found tiers of space it called its own.


Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
components/buildkit/config_test.go (2)

38-55: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the final all = true cap tier too.

Line 54 passes even if the last-resort all = true policy is removed, leaving internal/frontend cache outside the enforced ceiling the PR intends to protect.

Suggested test tightening
 		r.GreaterOrEqual(ageless, 1, "at least one tier must enforce maxUsedSpace with no keepDuration")
+
+		last := blocks[len(blocks)-1]
+		r.Contains(last, "all = true", "last tier should include all cache types")
+		r.Contains(last, "maxUsedSpace = 10737418240", "last tier should enforce the configured byte cap")
+		r.NotContains(last, "keepDuration", "last tier must not be age-gated")
 	})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/buildkit/config_test.go` around lines 38 - 55, The gc policy test
in the “enforces the byte cap with an age-less tier” case only checks for an
ageless maxUsedSpace tier, so it can still pass if the final all=true cap tier
is missing. Tighten the assertions around gcPolicyBlocks(config) to also verify
the presence of the last-resort all=true policy in addition to the existing
maxUsedSpace/keepDuration check, using the same block scan logic in this test.

63-68: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Cover the full source filter contract.

This currently protects source.local and exec.cachemount, but dropping source.git.checkout would still pass despite being part of the cheap-to-rebuild tier.

Suggested assertion
 		// A single comma-joined filter string is AND'd and matches no record.
 		r.NotContains(config, "type==source.local,type==exec.cachemount")
-		r.Contains(config, `"type==source.local"`)
-		r.Contains(config, `"type==exec.cachemount"`)
+		r.Contains(config, `filters = [ "type==source.local", "type==exec.cachemount", "type==source.git.checkout" ]`)
 	})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/buildkit/config_test.go` around lines 63 - 68, The current test in
config_test.go only verifies the source filter OR-array form for source.local
and exec.cachemount, so it still misses source.git.checkout. Update the existing
test around the config assertions to also check that source.git.checkout is
included as its own array entry in the cheap-to-rebuild source filter contract,
using the same config string assertions in the test case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@components/buildkit/config_test.go`:
- Around line 38-55: The gc policy test in the “enforces the byte cap with an
age-less tier” case only checks for an ageless maxUsedSpace tier, so it can
still pass if the final all=true cap tier is missing. Tighten the assertions
around gcPolicyBlocks(config) to also verify the presence of the last-resort
all=true policy in addition to the existing maxUsedSpace/keepDuration check,
using the same block scan logic in this test.
- Around line 63-68: The current test in config_test.go only verifies the source
filter OR-array form for source.local and exec.cachemount, so it still misses
source.git.checkout. Update the existing test around the config assertions to
also check that source.git.checkout is included as its own array entry in the
cheap-to-rebuild source filter contract, using the same config string assertions
in the test case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 549f7f22-9da1-46f6-b5ef-768e401bea05

📥 Commits

Reviewing files that changed from the base of the PR and between a5ae843 and d74b04c.

📒 Files selected for processing (2)
  • components/buildkit/buildkit.go
  • components/buildkit/config_test.go

@phinze phinze force-pushed the phinze/mir-1280-buildkit-gc-not-enforcing-its-keepbytes-cap-cache-grew-to branch from d74b04c to ae42176 Compare July 1, 2026 16:42
The generated buildkitd.toml used a single gcpolicy carrying keepBytes
+ keepDuration, and two faults made the 10GB cap a no-op. keepBytes is
deprecated and migrates to reservedSpace (a floor, not a ceiling),
leaving maxUsedSpace unset, so no ceiling field was ever set. And
keepDuration protects not just records used within the window but their
entire ancestry, since a parent cannot be deleted while a child still
references it. On a constant builder that pins nearly the whole cache,
so GC could never bring usage down to the cap: garden grew to 27GB
against a 10GB cap and fed the MIR-1279 disk outage.

Replace it with a graduated multi-tier policy mirroring BuildKit own
default shape: a 48h/512MB tier for cheap-to-rebuild sources, a soft
keepDuration tier, and two age-less maxUsedSpace tiers that enforce the
ceiling regardless of entry age. Source filters use the OR (array)
form; the comma-joined string form ANDs them and matches nothing.

Verified on garden: an age-gated prune to 2GB freed 0B while an
age-less prune freed 7.74GB. The prune/GC logic is unchanged across
BuildKit v0.19 through v0.31, so this is a config fix, not a version
bump.

MIR-1280
@phinze phinze force-pushed the phinze/mir-1280-buildkit-gc-not-enforcing-its-keepbytes-cap-cache-grew-to branch from ae42176 to 1690a10 Compare July 1, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant