Fix mtree table-diff silently reporting a diverged cluster as identical (ACE-190)#130
Conversation
The bounded (non-continuous) CDC drain behind `mtree table-diff` and
`mtree update` treated the 1-second idle receive-timeout as "drain
complete" and returned success. Under proto_version '1' the walsender
streams nothing until a transaction's commit is decoded, so when a node
carries a large backlog (e.g. a bulk-loaded divergence) the drain saw no
data within 1s, declared itself drained, and left the Merkle tree frozen
at its previous state. The subsequent comparison then reported the trees
as identical even though millions of rows differed (ACE-190) -- a silent
under-report that could mislead table-repair.
A real bound already exists: UpdateMtree wraps the per-node drain in
context.WithTimeout(cdc_processing_timeout), and a parent-context
cancellation already returns an error. The 1s "drained" shortcut simply
pre-empted it.
Make the bounded drain honest about completion:
- The 1s idle receive-timeout is now a poll, not completion. The drain
stops only once it has actually received data up to targetFlushLSN
(the call-time snapshot), or a keepalive confirms the decoder reached
it. Otherwise it keeps waiting; cdc_processing_timeout bounds the wait
and turns an unreachable target into a clear, actionable error.
- A reachedTarget flag and a post-loop backstop guarantee a bounded
drain never checkpoints and returns success unless it reached the
snapshot. It now fails loud rather than report a possibly-stale tree
as current.
- MaxBufferedChanges (default 1,000,000) caps in-memory buffering of a
single huge transaction -- proto_version '1' buffers a whole
transaction before its commit -- erroring with a recommendation to
rebuild rather than risking OOM or a silent partial drain.
Continuous mode (ListenForChanges) is unchanged. DiffMtree, the CLI, and
the HTTP handler already propagate the drain error, so the new failure
surfaces end to end.
Validated: the 9.9M-row reproduction now errors with an actionable
"rebuild" message instead of reporting the trees identical; the mtree/CDC
integration suite passes, including `go test -race` on the
drain-and-update tests.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
Next review available in: 51 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR adds CDC timeout and flush-batch configuration wiring, updates bounded CDC draining to buffer primary keys only and flush synchronously in sub-batches, and tightens bounded-drain completion, timeout, and validation behavior. It also adds documentation and an integration test for oversized transactions. ChangesBounded CDC drain correctness and configuration
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Category | Results |
|---|---|
| Complexity | 6 medium |
🟢 Metrics 9 duplication
Metric Results Duplication 9
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/infra/cdc/listen.go`:
- Around line 368-371: The keepalive-only drain path in listen.go marks the
target as reached but leaves lastLSN unchanged, so the final status update can
checkpoint the stale start position. Update the replication state in the branch
that detects ServerWALEnd >= targetFlushLSN so lastLSN advances to the reached
WAL end (or target LSN) before the final standby status/metadata write, using
the existing reachedTarget and stopStreaming flow in the keepalive handling
logic.
- Around line 304-309: The timeout branch in listenForChanges is returning early
and skipping the normal teardown path, which can leave started processChanges
workers running past function exit. Update the timeout handling to set
processingErr and then flow through the common cleanup logic so the existing
wg.Wait() and deferred pool close still run; use the listenForChanges loop and
the processingErr/wg teardown path as the main references.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bc101f1c-c25f-4871-b701-43140588d857
📒 Files selected for processing (1)
internal/infra/cdc/listen.go
|
Attaching a suggested patch Makes the bounded CDC drain behind mtree update / mtree table-diff handle large transactions instead of bailing to a full rebuild.
|
A bounded CDC drain previously buffered a whole transaction in memory until
its commit was decoded (proto_version '1' streams nothing before commit), and
refused to continue once buffering exceeded a hard cap -- erroring with a
recommendation to rebuild the tree. That made a routine large transaction
(e.g. a bulk load) force a full Merkle-tree rebuild, and the advice was wrong:
`ace mtree build` is per-table and leaves the publication's start_lsn and slot
untouched, so it does not advance the drain baseline and the next drain re-hits
the same backlog.
Make the bounded drain absorb large transactions instead:
- Apply changes in sub-batches: once the in-flight transaction has accumulated
FlushBatchSize changes, apply them to the Merkle tree and free the buffer
rather than waiting for the commit. Peak memory is bounded by FlushBatchSize
regardless of transaction size. Configurable via cdc_flush_batch_size
(default 10000); the old MaxBufferedChanges cap is removed.
- Buffer only the primary key of each change, not the whole row tuple, since
the tree update needs only the PK (the row is recomputed from the live table
at hash time). Per-change memory scales with key size, not row width.
- Apply synchronously in bounded mode (continuous mode keeps its worker
goroutines). lastLSN -- which drives the slot's confirmed_flush_lsn and the
metadata checkpoint -- then never advances ahead of durably-applied work, so
a crash mid-drain is recovered by re-streaming the in-flight transaction on
the next run. The only cost is re-applying already-flushed sub-batches,
which transiently over-counts per-block split heuristics (reset on the next
tree update) and never misses a change.
- Send a standby status update before each flush so a slow apply on a busy
node cannot let wal_sender_timeout elapse while the receive loop is blocked.
- Memoize PK type-name lookups per drain so a large multi-flush transaction
does not re-query pg_type on every sub-batch.
Timeout and recovery:
- Raise the cdc_processing_timeout default from 30s to 300s. A timeout now
means "re-run or raise" (drain progress is durable), not "rebuild", so the
budget favours absorbing large backlogs and busy-server slowdowns.
- Add a --cdc-timeout flag to `mtree update` and `mtree table-diff` to override
the drain budget per invocation.
- Reword the drain-timeout error to drop the misleading "rebuild" advice:
re-run `ace mtree update`, or raise cdc_processing_timeout to drain in one
pass.
- Fail loud with an actionable error when a tracked table's REPLICA IDENTITY
exposes no primary key (FULL/NOTHING), instead of panicking on an empty PK.
Docs (configuration, mtree-update, mtree-table-diff, CHANGELOG) updated.
Validated: TestCDCDrainHandlesLargeSingleTransaction drains a single
transaction far larger than FlushBatchSize across many flush batches and
confirms every change lands in the leaf counters; the CDC, mtree-update, and
composite/simple-PK integration tests pass under `go test -race`.
…docstrings - Keepalive-driven drain stop now advances lastLSN to the target before the final standby status update and metadata write. Previously a bounded drain that caught up via a keepalive (the no-change / other-tables-only case) left lastLSN at the start LSN, so the checkpoint regressed to a stale position, the slot never advanced, and every subsequent drain re-scanned the same span (and the server retained that WAL). Safe because the synchronous bounded apply means received == applied, and the branch only fires once the decoder has read past the target, so no tracked change <= target is still pending. - Document all top-level declarations in internal/infra/cdc/listen.go to satisfy the docstring-coverage check. CodeRabbit's other comment -- the drain-timeout branch leaking processChanges workers past wg.Wait() -- is moot under the synchronous bounded-drain apply: a bounded drain spawns no worker goroutines, so the timeout early-return has nothing in flight to wait for. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/cli/default_config.yaml`:
- Around line 33-37: The generated default config is missing the new
cdc_flush_batch_size setting, so update the cdc block in the default config
template to include it alongside cdc_processing_timeout and
cdc_metadata_flush_seconds. Use the existing cdc section in the default config
generation path to keep ace config init aligned with the settings supported by
pkg/config/config.go and internal/infra/cdc/listen.go.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 800de308-6e14-4873-b4e9-34dfc614e451
📒 Files selected for processing (11)
ace.yamldocs/CHANGELOG.mddocs/commands/mtree/mtree-table-diff.mddocs/commands/mtree/mtree-update.mddocs/configuration.mdinternal/cli/cli.gointernal/cli/default_config.yamlinternal/consistency/mtree/merkle.gointernal/infra/cdc/listen.gopkg/config/config.gotests/integration/cdc_busy_table_test.go
✅ Files skipped from review due to trivial changes (4)
- docs/commands/mtree/mtree-table-diff.md
- docs/commands/mtree/mtree-update.md
- ace.yaml
- docs/CHANGELOG.md
…nt complexity - tests/integration/cdc_busy_table_test.go: mark the seeding DELETE in TestCDCDrainHandlesLargeSingleTransaction `// nosemgrep`. Codacy's critical SQL-injection finding is a false positive -- the table name is a sanitized constant and the bound is a constant int -- and the sibling query already carries the same marker (matching the suite-wide convention). - listen.go (processReplicationStream) and merkle.go (UpdateMtree): add //nolint:gocyclo. Both functions were already far over the cyclomatic threshold before this PR (106 and 51); the bounded-drain changes only touched them, so Codacy re-reports their pre-existing complexity as new. The branching is inherent to the CDC drain protocol loop and the update orchestration; restructuring validated drain code to satisfy the metric is deferred to a dedicated refactor rather than risked under this fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mason's patch added the cdc_flush_batch_size setting (field in pkg/config/config.go, read in internal/infra/cdc/listen.go, documented in docs/configuration.md) but did not list it in the YAML config templates, so `ace config init` emitted a config missing a supported setting. Add it with the code/doc default (10000, matching cdc.FlushBatchSize) to internal/cli/default_config.yaml, and to ace.yaml so the example config stays in sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…docstrings - Keepalive-driven drain stop now advances lastLSN to the target before the final standby status update and metadata write. Previously a bounded drain that caught up via a keepalive (the no-change / other-tables-only case) left lastLSN at the start LSN, so the checkpoint regressed to a stale position, the slot never advanced, and every subsequent drain re-scanned the same span (and the server retained that WAL). Safe because the synchronous bounded apply means received == applied, and the branch only fires once the decoder has read past the target, so no tracked change <= target is still pending. - Document all top-level declarations in internal/infra/cdc/listen.go to satisfy the docstring-coverage check. CodeRabbit's other comment -- the drain-timeout branch leaking processChanges workers past wg.Wait() -- is moot under the synchronous bounded-drain apply: a bounded drain spawns no worker goroutines, so the timeout early-return has nothing in flight to wait for. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt complexity - tests/integration/cdc_busy_table_test.go: mark the seeding DELETE in TestCDCDrainHandlesLargeSingleTransaction `// nosemgrep`. Codacy's critical SQL-injection finding is a false positive -- the table name is a sanitized constant and the bound is a constant int -- and the sibling query already carries the same marker (matching the suite-wide convention). - listen.go (processReplicationStream) and merkle.go (UpdateMtree): add //nolint:gocyclo. Both functions were already far over the cyclomatic threshold before this PR (106 and 51); the bounded-drain changes only touched them, so Codacy re-reports their pre-existing complexity as new. The branching is inherent to the CDC drain protocol loop and the update orchestration; restructuring validated drain code to satisfy the metric is deferred to a dedicated refactor rather than risked under this fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
On a cluster where one node has a large, un-replicated divergence (e.g. n1 has ~9.9M rows that n2 does not),
mtree table-diffreportsMerkle trees are identical— zero differences — while plaintable-diffcorrectly finds the divergence. The mtree result is a silent false negative that can misleadtable-repair. Reproduced exactly from the field report: build the tree while both nodes are in sync, bulk-insert ~9.9M rows on one node in a single transaction, then runmtree table-diff→ "No updates needed" → "Merkle trees are identical". Small divergences (≲100k) work; large ones do not.Root cause
mtree table-diff(andmtree update) compare cached Merkle trees, so they first reconcile each node's tree by draining its CDC stream. The bounded (non-continuous) drain ininternal/infra/cdc/listen.gotreated the 1-second idle receive-timeout as "drain complete" and returned success.Under
proto_version '1'the walsender streams nothing until a transaction's commit is decoded. A ~9.9M-row insert is a single transaction (~1.8 GB of WAL) whose decode takes well over a second, during which the consumer receives no data. The drain hit its 1s idle timeout, declared itself drained, and left the tree frozen at its previous state — so the comparison saw two unchanged trees and called them identical. This is why a small backlog (decodes in <1s, streams before the timeout) works and a large single transaction does not.A real bound already existed but never engaged:
UpdateMtreewraps the per-node drain incontext.WithTimeout(cdc_processing_timeout)and a parent-context cancellation already returns an error — the 1s "drained" shortcut simply pre-empted it.The fix
All changes are in
internal/infra/cdc/listen.go, scoped to the bounded (!continuous) drain:targetFlushLSN(the call-time snapshot), or a keepalive confirms the decoder reached it. Otherwise it keeps waiting;cdc_processing_timeoutbounds the wait and turns an unreachable target into a clear, actionable error.reachedTargetflag + post-loop backstop. A bounded drain never checkpoints and returns success unless it actually reached the snapshot. It now fails loud rather than report a possibly-stale tree as current.MaxBufferedChanges(default 1,000,000). Caps in-memory buffering of a single huge transaction (proto_version '1'buffers a whole transaction before its commit), erroring with a recommendation to rebuild rather than risking OOM or a silent partial drain.Continuous mode (
ListenForChanges) is unchanged.DiffMtree, the CLI, and the HTTP handler already propagate the drain error, so the new failure surfaces end to end.Behaviour after the fix
cdc_processing_timeout) → hard error recommendingace mtree build. This is the data-safe trade-off: incremental CDC drain of ~10M rows is the wrong tool; rebuild is.cdc_processing_timeoutnow actually bites — before, the 1s shortcut pre-empted it, so it was effectively dead.The default
mtree table-diffis now either authoritative (drained to the snapshot) or it errors — never a silent under-report. This restores the premise the bounded-drain proposal (--quick/--max-drain) assumed but which was actually broken; that opt-in mode can now layer on a correct default.🤖 Generated with Claude Code