Skip to content

Allow table-diff/update to run while mtree listen is active#131

Merged
mason-sharp merged 2 commits into
mainfrom
fix/ACE-199/listen
Jul 1, 2026
Merged

Allow table-diff/update to run while mtree listen is active#131
mason-sharp merged 2 commits into
mainfrom
fix/ACE-199/listen

Conversation

@mason-sharp

Copy link
Copy Markdown
Member

Previously mtree table-diff and mtree update failed with SQLSTATE 55006 when mtree listen was running: both consumed the same logical replication slot, and PostgreSQL allows only one active consumer per slot.

Now, when a node's slot is already held by a running listener, the bounded CDC drain returns a sentinel error and UpdateMtree skips that node's drain (best-effort), comparing against the tree listen keeps warm, instead of aborting. Nodes without an active listener still get a full, guaranteed drain.

  • cdc: add ErrSlotBusy sentinel + 55006 (object in use) classifier; the bounded drain detects a busy slot via an up-front active-pid check plus a StartReplication backstop
  • queries: extract a reusable GetActiveSlotPID helper
  • mtree: skip the CDC drain per-node on a busy slot with a warning, and surface cdc_skipped_nodes on both the update and table-diff paths
  • tests: integration test asserting diff/update succeed (and that the skip actually occurs) while listen holds the slot
  • docs: document the concurrent behavior and best-effort freshness caveat

…ve (ACE-199)

Previously `mtree table-diff` and `mtree update` failed with SQLSTATE 55006
when `mtree listen` was running: both consumed the same logical replication
slot, and PostgreSQL allows only one active consumer per slot.

Now, when a node's slot is already held by a running listener, the bounded
CDC drain returns a sentinel error and UpdateMtree skips that node's drain
(best-effort), comparing against the tree listen keeps warm, instead of
aborting. Nodes without an active listener still get a full, guaranteed drain.

- cdc: add ErrSlotBusy sentinel + 55006 classifier; the bounded drain detects
  a busy slot via an up-front active-pid check plus a StartReplication backstop
- queries: extract a reusable GetActiveSlotPID helper
- mtree: skip the CDC drain per-node on a busy slot with a warning, and surface
  cdc_skipped_nodes on both the update and table-diff paths
- tests: integration test asserting diff/update succeed (and that the skip
  actually occurs) while listen holds the slot
- docs: document the concurrent behavior and best-effort freshness caveat
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@mason-sharp, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 35 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e682f67-c215-41e8-bab9-263fa4e88db2

📥 Commits

Reviewing files that changed from the base of the PR and between 48abdaf and 3b006d6.

📒 Files selected for processing (4)
  • docs/commands/mtree/index.md
  • docs/commands/mtree/mtree-table-diff.md
  • internal/consistency/mtree/merkle.go
  • pkg/types/types.go
📝 Walkthrough

Walkthrough

Adds a GetActiveSlotPID query helper used by DropReplicationSlot and a new pre-check in CDC listen startup. Introduces ErrSlotBusy detection for SQLSTATE 55006. MerkleTreeTask now tracks CDCSkippedNodes when a node's replication slot is busy during updates, surfacing them in diffs. Docs and tests updated accordingly.

Changes

Slot Busy Detection and CDC-Skip Diff Behavior

Layer / File(s) Summary
Active slot PID lookup
db/queries/queries.go
Adds GetActiveSlotPID to query the active consumer PID for a replication slot, returning nil on no rows; DropReplicationSlot refactored to call it.
ErrSlotBusy detection and pre-check
internal/infra/cdc/listen.go, internal/infra/cdc/listen_test.go
Adds ErrSlotBusy sentinel and isSlotBusyErr SQLSTATE-55006 detection, wraps StartReplication failures, adds an up-front slot-PID check before bounded drains, and covers detection logic with unit tests.
Merkle task skip tracking
internal/consistency/mtree/merkle.go
Adds CDCSkippedNodes field; UpdateMtree marks nodes skipped on ErrSlotBusy instead of failing and aggregates skipped nodes into result context; DiffMtree surfaces stored skipped nodes.
Docs and concurrency integration test
docs/commands/mtree/index.md, docs/commands/mtree/mtree-listen.md, docs/commands/mtree/mtree-table-diff.md, tests/integration/merkle_tree_test.go
Documents concurrent listen/diff/update behavior and adds a DiffWhileListenActive integration test asserting successful diff/update with skipped nodes recorded while listen holds the slot.

> Twitch of nose, a slot held tight,
Listen hums on through the night.
Diff hops round the busy gate,
Skips one node, won't sit and wait. 🐇
Warnings squeak, but nothing breaks —
Merkle trees still count the stakes! 🌳✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main change: allowing table-diff and update to run while mtree listen is active.
Description check ✅ Passed The description accurately summarizes the slot-busy handling, mtree behavior, tests, and docs changes in the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/ACE-199/listen

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codacy-production

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 21 complexity · 0 duplication

Metric Results
Complexity 21
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
internal/consistency/mtree/merkle.go (1)

1648-1657: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick win

Reset CDCSkippedNodes on every update run.

If one call hits the busy-slot path, a later UpdateMtree on the same MerkleTreeTask with no skips leaves m.CDCSkippedNodes unchanged. Line 2036 then reuses that stale value, so a later diff can report cdc_skipped_nodes even after a fully drained run. Clear the field at the start of UpdateMtree, or assign it unconditionally from skippedNodes (including nil).

Proposed fix
 func (m *MerkleTreeTask) UpdateMtree(skipAllChecks bool) (err error) { //nolint:gocyclo // orchestrates per-node CDC drain + tree rehash/rebalance; cyclomatic complexity is inherent to the staged flow
 	resultCtx := map[string]any{
 		"rebalance":       m.Rebalance,
 		"skip_all_checks": skipAllChecks,
 	}
+	m.CDCSkippedNodes = nil
@@
-		if len(skippedNodes) > 0 {
-			resultCtx["cdc_skipped_nodes"] = skippedNodes
-			m.CDCSkippedNodes = skippedNodes
-		}
+		m.CDCSkippedNodes = skippedNodes
+		if len(skippedNodes) > 0 {
+			resultCtx["cdc_skipped_nodes"] = skippedNodes
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/consistency/mtree/merkle.go` around lines 1648 - 1657, Reset the
stale CDC skip state in UpdateMtree: `MerkleTreeTask.CDCSkippedNodes` is only
set when `skippedNodes` is non-empty, so a later run can reuse the previous
value. Update the `UpdateMtree` flow in `merkle.go` so `CDCSkippedNodes` is
cleared at the start of each run or assigned unconditionally from `skippedNodes`
(including nil), and keep the `resultCtx["cdc_skipped_nodes"]` population in
sync with that field.
tests/integration/merkle_tree_test.go (1)

809-819: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the skip marker before the second update call.

Right now this still passes if DiffMtree() stops surfacing the busy-slot path and only the explicit UpdateMtree(true) sets CDCSkippedNodes. Check it immediately after Line 809, then clear it before Line 813 if you want to verify both paths independently.

Suggested test tightening
 	err = mtreeTask.DiffMtree()
 	require.NoError(t, err, "table-diff should succeed while listen is active")
+	require.NotEmpty(t, mtreeTask.CDCSkippedNodes,
+		"expected table-diff to record a skipped CDC node while listen holds the slot")
+
+	mtreeTask.CDCSkippedNodes = nil
 
 	// update must likewise not fail with a slot-active error.
 	require.NoError(t, mtreeTask.UpdateMtree(true),
 		"update should succeed while listen is active")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/merkle_tree_test.go` around lines 809 - 819, In the merkle
tree integration test, the busy-slot skip assertion is too late and can be
satisfied by UpdateMtree(true) alone. In the test around DiffMtree, UpdateMtree,
and CDCSkippedNodes, assert the skip marker immediately after DiffMtree()
succeeds, then clear CDCSkippedNodes before calling UpdateMtree(true) if you
want to verify both paths independently; this keeps the test tied to the
DiffMtree busy-slot path instead of the later update call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/infra/cdc/listen.go`:
- Around line 229-250: The active-slot check in listen startup is over-broad: it
returns ErrSlotBusy for any consumer, which lets non-listener flows like bounded
update/table-diff skip incorrectly. Update the logic in listen.go around the
GetActiveSlotPID path and the StartReplication/ isSlotBusyErr handling so only
the actual mtree listen ownership case maps to ErrSlotBusy. Keep generic
busy-slot or PID-occupied failures fatal, or introduce a listener-specific
marker before returning the skip sentinel.

---

Nitpick comments:
In `@internal/consistency/mtree/merkle.go`:
- Around line 1648-1657: Reset the stale CDC skip state in UpdateMtree:
`MerkleTreeTask.CDCSkippedNodes` is only set when `skippedNodes` is non-empty,
so a later run can reuse the previous value. Update the `UpdateMtree` flow in
`merkle.go` so `CDCSkippedNodes` is cleared at the start of each run or assigned
unconditionally from `skippedNodes` (including nil), and keep the
`resultCtx["cdc_skipped_nodes"]` population in sync with that field.

In `@tests/integration/merkle_tree_test.go`:
- Around line 809-819: In the merkle tree integration test, the busy-slot skip
assertion is too late and can be satisfied by UpdateMtree(true) alone. In the
test around DiffMtree, UpdateMtree, and CDCSkippedNodes, assert the skip marker
immediately after DiffMtree() succeeds, then clear CDCSkippedNodes before
calling UpdateMtree(true) if you want to verify both paths independently; this
keeps the test tied to the DiffMtree busy-slot path instead of the later update
call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7a5bbc15-2d5c-4955-8f49-d74dc954a922

📥 Commits

Reviewing files that changed from the base of the PR and between 8dc2f4c and 48abdaf.

📒 Files selected for processing (8)
  • db/queries/queries.go
  • docs/commands/mtree/index.md
  • docs/commands/mtree/mtree-listen.md
  • docs/commands/mtree/mtree-table-diff.md
  • internal/consistency/mtree/merkle.go
  • internal/infra/cdc/listen.go
  • internal/infra/cdc/listen_test.go
  • tests/integration/merkle_tree_test.go

Comment thread internal/infra/cdc/listen.go
A busy slot can be held by any consumer (including a concurrent bounded
mtree op sharing the node's slot), not only mtree listen. Make the skip
visible instead of implying a guaranteed comparison:

- add cdc_skipped_nodes to DiffSummary so it appears in the diff report
- warn at end of diff when nodes were skipped (fires even if trees match)
- reword the per-node warning to not assert the holder is listen and to
  note divergence can be under-reported
- docs: shared per-node slot, concurrent-op trigger, best-effort caveat
@danolivo danolivo self-requested a review July 1, 2026 10:45
@mason-sharp mason-sharp merged commit fc4af60 into main Jul 1, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants