fix(swarm): Track-0 stale-cache and migration-semantics fixes (BF-02/05/06/07/08 + #286) by lmoresi · Pull Request #313 · underworldcode/underworld3

lmoresi · 2026-07-04T03:26:27Z

Track-0 remediation (2026-07 quality audit, docs/reviews/2026-07/REMEDIATION-WORKLIST.md):
BF-02 (SWARM-01/02/17), BF-05 (SWARM-04), BF-06 (SWARM-07), BF-07 (SWARM-03),
BF-08 (LE-03 = SWARM-05, issue #215 Bug 3), plus the freshly diagnosed #286.

Closes #289
Closes #286

Defects and fixes

BF-02 — stale caches after no-move migrate / particle addition / populate (SWARM-01/02/17)

Swarm.migrate() early-returned when no particle was globally unclaimed without
invalidating the canonical .data caches, the particle kd-tree, or the proxy
staleness flags. Reproduced (serial): after add_particles_with_global_coordinates
the DMSwarm held 974 particles while every cached .data still had 972 rows; after
an in-place coordinate mirror + migrate(), rbf_interpolate returned garbage from
the poisoned no-copy kd-tree; after populate() a pre-existing cache stayed at 0
rows. This is also the root cause of issue #289: swarm.advection() ends with a
no-move migrate() in serial, so _proxy_stale stayed False and the material
proxy froze at the launch positions.

Fix: invalidate before the early return; explicit invalidation in
add_particles_with_global_coordinates (mirroring the #216 fix in
add_particles_with_coordinates) and in populate() (SWARM-17 verified at
runtime, folded in).

Collateral defect exposed by this fix (same #216 stale-cache class):
Swarm.apply_snapshot_payload() wrote restored variable data into a detached
np.asarray view of the canonical cache — the DMSwarm field never received the
restored values, and the restore only "worked" because the SWARM-01 bug preserved
the stale canonical copy. With the invalidation fixed, test_0007_snapshot_inmemory
continuation tests caught it (restored material became zeros after the first
advection). Fixed by writing through the canonical array so the PETSc pack fires.

#286 — AMR: kd-tree rebuilt against stale `_nav_coords`

In nuke_coords_and_rebuild() the point-location kd-tree was rebuilt before
self._nav_coords was refreshed, so the rebuild indexed an old-sized coordinate
array with new-mesh point ranges: IndexError: index 91 is out of bounds for axis 0 with size 80 (reproduced via tests/test_0810_amr_swarm_migration_regression.py).
Fix: refresh _nav_coords (volume and manifold branches) before
_build_kd_tree_index().

BF-05 — writes inside `migration_disabled()` silently discarded (SWARM-04)

Both PETSc-sync callbacks early-returned while _migration_disabled was set and
nothing ever re-packed; migrate() then read the stale PETSc coordinates and its
trailing invalidation destroyed the only copy of the writes. Reproduced (serial):
after writing 42.0 inside the context, the PETSc field still held 0.0.
Fix: separate "suppress migration" from "suppress PETSc sync" — writes are recorded
and flushed to the DMSwarm by Swarm._flush_pending_petsc_sync() at context exit
(and defensively at migrate() entry); only the migration itself is deferred
(default) or skipped (disable=True). The vacuous assertions in
tests/parallel/test_0755_swarm_global_stats.py (the perturbation never reached the
DMSwarm) now assert that it does.

BF-06 — empty/starved ranks: silent zeros or hard crash (SWARM-07)

rbf_interpolate silently returned zeros for ranks with <= 1 particles and
_rbf_to_meshVar wrote them into the proxy; IndexSwarmVariable's proxy update
reached unguarded KDTree construction on a 0-particle rank —
IndexError: Out of bounds on buffer access (reproduced) inside a collective
update, i.e. an MPI abort/hang.
Fix: starved ranks leave their proxy nodal values untouched and warn (never silent
zeros); guard the KDTree/nearest-neighbour machinery on local_size <= 1 while
keeping every rank inside the same sequence of collective reads/writes/contexts
(MeshVariable syncs are collective — a naive early-return would deadlock).
IndexSwarmVariable gains a proper _update_proxy_if_stale() override (its
proxies live in _meshLevelSetVars, so the base implementation was a no-op for it).

The update_type=0 level-set projection was also restructured to compute into a
local buffer and issue exactly ONE MeshVariable write per level set per rank
(serially bit-identical): the previous formulation issued a data-dependent number
of writes whose deferred collective syncs mismatched across ranks — the np4
starved-rank test deadlocked on it (each write's ghost sync is collective).

BF-07 — modern coordinate writes never migrate (SWARM-03)

The class docstring promised automatic migration, but swarm.coords /
swarm._particle_coordinates.data writes only packed to PETSc; only the
deprecated points path migrated. Fix (maintainer-approved shape): coordinate
writes mark swarm._needs_migration; the collective migrate() runs DEFERRED at
the next collective point — migration-control context exit or solve entry — never
per-write (uneven writes would deadlock). The flag is combined across ranks with a
MAX reduction before acting, so uneven writes are safe. Docstrings updated to state
the actual contract.

BF-08 — stale proxy consumed by solvers (LE-03 = SWARM-05, issue #215 Bug 3)

Proxy refresh only fired via the lazy .sym accessor, but solvers pull the proxy
DM directly through mesh.update_lvec(). Reproduced (serial): write
material.data, re-solve a captured Projection — the solve consumed the old
values. Fix (maintainer-approved shape): a single eager, collective refresh at
solve entry — Swarm._sync_before_assembly() invoked from Mesh.update_lvec()
before its staleness check (the refresh itself is what sets _stale_lvec). It
also performs BF-07's deferred migration. Flag-guarded: repeated calls are no-ops,
so nothing is re-interpolated per access (the test_0006 memory-leak constraint).
The TODO(BUG) at the old swarm.py:1075 is resolved and removed.
Together with BF-02 this closes #289 (verified with the issue reproducer: the
proxy centroid now tracks the advecting particle centroid).

Tests

tests/test_0113_swarm_stale_cache_regression.py (new, level_1/tier_a):
SWARM-01 (both migrate= variants), SWARM-17, SWARM-02 (kd-tree invalidation
contract + behavioral mirror probe), the swarm.advection() leaves proxy mesh variables stale (frozen material) — auto-update hook orphaned in deprecated access path #289 reproducer as a regression test,
and the [BUG] - Three stale-cache issues after swarm particle addition #215-Bug-3 solve-freshness test.
tests/parallel/test_0756_swarm_migration_semantics.py (new): np2
migration-suppressed write survival, np2 deferred migration at context exit and
at solve entry, np4 starved-rank proxy test (all particles on one rank).
tests/parallel/test_0755_swarm_global_stats.py: repaired the vacuous
perturbation block.
tests/test_0810_amr_swarm_migration_regression.py (AMR: swarm migration after mesh.adapt() raises IndexError -- k-d tree rebuilt before nav_coords refresh #286) now passes.

Gate results

All runs from the fix worktree, env amr-dev (AMR-enabled PETSc build, needed
for the #286 test), after ./uw build.

Serial level_1 and tier_a: BEFORE = 295 passed, 6 failed -- the 6 failures
are exactly the new regression tests, each demonstrating its bug on the pristine
baseline. AFTER = 301 passed, 0 failed (1 skipped, 4 xfailed, 1 xpassed --
unchanged from baseline).
np2 (mpirun -np 2, test_0755 + test_0756 + test_0760 + test_0765):
19 passed, 1 skipped (the np4-only test).
np4 (mpirun -np 4, test_0755 + test_0756 + test_0765): 17 passed.
test_0006_memory_leak: passed (the BF-08 gate -- the solve-entry refresh is
flag-guarded, nothing re-interpolates per access).
test_0810_amr_swarm_migration_regression (AMR: swarm migration after mesh.adapt() raises IndexError -- k-d tree rebuilt before nav_coords refresh #286): failed with the reported
IndexError before the fix; passes after.
Issue swarm.advection() leaves proxy mesh variables stale (frozen material) — auto-update hook orphaned in deprecated access path #289 reproducer: particle centroid 0.1068 -> 0.1268, proxy centroid
0.1071 -> 0.1270 (was frozen at 0.1071); also enshrined as a regression test.

Pre-existing np4 defect found while gating (NOT addressed here):
tests/parallel/test_0760_swarm_cache_migration.py hangs at mpirun -np 4 on the
pristine baseline (verified by stashing this branch's src changes and
rebuilding): with rank-biased coordinates, global_evaluate leaves some ranks with
zero interior points, and those ranks diverge inside the DMInterpolation
machinery in petsc_interpolate while the others block in its collectives. It
passes at np2 (CI runs --p 2). Per-rank stack traces confirm all of this branch's
collective hooks enter and exit symmetrically before the hang. This should be
tracked as its own issue (parallel-evaluation empty-rank asymmetry, related to the
audit's SWARM-15/16 provisional findings).

Underworld development team with AI support from Claude Code

…#286) In nuke_coords_and_rebuild() the point-location kd-tree was rebuilt before self._nav_coords was refreshed from the rebuilt DM, so on an adapted mesh the rebuild indexed an old-sized coordinate array with new-mesh point ranges. When the mesh grew this raised IndexError (index 91 out of bounds for axis 0 with size 80) from swarm migration after mesh.adapt(); when it shrank it silently mislocated points. The navigation-coordinate refresh (both the volume-mesh and manifold branches) now runs before _build_kd_tree_index(). Reproduced by tests/test_0810_amr_swarm_migration_regression.py, which now passes. Underworld development team with AI support from Claude Code

…tion, and populate (BF-02) Swarm.migrate() early-returned when no particle was globally unclaimed without invalidating the canonical .data caches, the particle kd-tree, or the proxy staleness flags (SWARM-01/SWARM-02, 2026-07 audit). In serial or whenever no particle changes rank - the common case - every cached array kept the old particle count and the kd-tree kept a poisoned no-copy view of mutated coordinates. This is the root cause of issue #289: advection ends with a no-move migrate, so material proxies froze at the launch positions and time-stepped models silently used stationary material. The early return now performs the same invalidation as the fall-through path. add_particles_with_global_coordinates() gains an explicit invalidation (mirroring the #216 fix in add_particles_with_coordinates - with migrate=False nothing else invalidates), and populate() invalidates caches created before the swarm had particles (SWARM-17, verified at runtime). Collateral defect exposed by this fix, same #216 stale-cache class: Swarm.apply_snapshot_payload() wrote restored variable data into a detached np.asarray view of the canonical cache, so the DMSwarm field never received the restored values - the restore only worked because the stale canonical copy survived. It now writes through the canonical array so the PETSc pack fires (caught by test_0007_snapshot_inmemory). Regression tests: tests/test_0113_swarm_stale_cache_regression.py (SWARM-01 both migrate= variants, SWARM-17, SWARM-02 kd-tree contract and behavioural mirror probe, and the issue #289 reproducer). Underworld development team with AI support from Claude Code

…f discarding them (BF-05) Variable and coordinate writes made inside migration_disabled() / migration_control() were silently discarded (SWARM-04, 2026-07 audit): both PETSc-sync callbacks early-returned while _migration_disabled was set, nothing ever re-packed the canonical arrays, migrate() then read the stale PETSc coordinates, and its trailing invalidation destroyed the only copy of the writes. The docstrings explicitly recommended writing inside the context, and the parallel test covering the pattern asserted vacuously. Suppressing migration now defers only the PETSc pack, never the data: writes made while the flag is set are recorded per variable and flushed into the DMSwarm by Swarm._flush_pending_petsc_sync() at context exit (and defensively at migrate() entry). Both context-manager modes flush; only the migrate() call itself is deferred (default) or skipped (disable=True). Docstrings updated to state the actual contract. This commit also introduces Swarm._sync_before_assembly() (adjacent to the flush helper): the collective solve-entry synchronisation used by the deferred-migration and stale-proxy fixes. It is wired into the mesh assembly path in the follow-up commit. tests/parallel/test_0755_swarm_global_stats.py's perturbation block now asserts the write actually reaches the DMSwarm (previously vacuous), and tests/parallel/test_0756_swarm_migration_semantics.py (added in the follow-up commit) covers write survival at np2. Underworld development team with AI support from Claude Code

…or zeroing (BF-06) Ranks holding <= 1 particles either hard-crashed or silently corrupted their proxy mesh variables (SWARM-07, 2026-07 audit): rbf_interpolate() returned silent zeros that _rbf_to_meshVar wrote straight into the proxy nodal values, and IndexSwarmVariable's proxy update reached unguarded KDTree construction on an empty coordinate array, raising IndexError inside a collective update - an MPI abort or hang. Starved ranks now leave their proxy nodal values untouched and warn (warnings are suppressed for a swarm that has never been populated, since proxied variables legitimately touch this path at creation time). The guards are written to be collective-safe: MeshVariable reads/writes perform collective ghost synchronisation, so every rank executes the same read-then-write sequence and only the values differ per rank. The IndexSwarmVariable update_type=0 projection is restructured to compute into a local buffer and issue exactly ONE MeshVariable write per level set per rank (serially bit-identical to the old formulation): the previous in-context version issued a data-dependent number of writes whose deferred collective syncs mismatched across ranks - the np4 starved-rank test deadlocked on it. IndexSwarmVariable also gains a real _update_proxy_if_stale() override (its proxies live in _meshLevelSetVars, so the base implementation was a no-op for it) and routes .sym through it; this is also what the solve-entry refresh uses. np4 regression test (all particles seeded on one rank, sentinel survives on starved ranks, no crash) lands with the parallel test file in the follow-up commit. Underworld development team with AI support from Claude Code

…refresh at solve entry (BF-07, BF-08) BF-07 (SWARM-03, 2026-07 audit): the Swarm docstring promised automatic migration, but coordinate writes through the modern interface (swarm.coords setter / swarm._particle_coordinates.data) only packed to PETSc - only the deprecated points path migrated, leaving particles on the wrong rank in parallel. Coordinate writes now mark swarm._needs_migration; the collective migrate() runs DEFERRED at the next collective point - migration-control context exit or solve entry - never per-write, which would deadlock when ranks write unevenly (maintainer-approved fix shape). advection() suspends the deferred migration around its substep loop (a migrate firing from its velocity evaluations would reorder particle rows between captured arrays) and runs its own migrate() at the end, which clears the suspension. BF-08 (LE-03 = SWARM-05, issue #215 Bug 3): proxy refresh fired only via the lazy .sym accessor, but solvers pull the proxy DM directly through mesh.update_lvec(), consuming stale data after a material.data write. Mesh.update_lvec() now runs Swarm._sync_before_assembly() for each registered swarm before its staleness check: a single eager, collective refresh at solve entry (the refresh itself is what sets _stale_lvec). Rank-local flags are combined with MAX reductions so uneven writes are safe; repeated calls are flag-guarded no-ops, preserving the test_0006_memory_leak constraint. The refresh calls inside petsc_interpolate pass swarm_sync=False - only a subset of ranks reaches that function (zero-interior-point ranks skip it), and the hook's reductions must not run on a subset; those sites already rely on the all-ranks update_lvec() in evaluate() for freshness. The resolved TODO(BUG) at the old swarm.py:1075 is removed. Together with the BF-02 invalidation fix this closes issue #289: the issue reproducer (regression test in test_0113) shows the material proxy tracking the advected particles, and a Projection re-solve consumes fresh particle data without touching .sym. Adds tests/parallel/test_0756_swarm_migration_semantics.py: np2 migration-suppressed write survival, np2 deferred migration at context exit and at solve entry, np4 starved-rank proxy behaviour. Underworld development team with AI support from Claude Code

Copilot

Pull request overview

This PR implements Track-0 remediation for Underworld3’s swarm migration / cache invalidation / proxy freshness semantics, and fixes an AMR point-location regression (#286) by reordering cache rebuilds. It also adds/repairs regression tests to prevent stale-cache, stale-proxy, and starved-rank MPI failure modes from reappearing.

Changes:

Fix swarm cache invalidation and deferred-migration/proxy-refresh semantics (including solve-entry swarm sync) to prevent stale .data caches, stale kd-trees, and stale proxy mesh variables.
Fix AMR mesh rebuild ordering so _nav_coords is refreshed before rebuilding the navigation kd-tree index (#286).
Add new serial + MPI regression tests covering the Track-0 defects and repair a previously vacuous parallel test assertion.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_0113_swarm_stale_cache_regression.py	Adds serial regression coverage for stale-cache, kd-tree invalidation, proxy freshness through advection, and solve-entry proxy consumption.
tests/parallel/test_0756_swarm_migration_semantics.py	Adds MPI regression coverage for migration-control semantics, deferred migration, and starved-rank proxy safety.
tests/parallel/test_0755_swarm_global_stats.py	Strengthens an MPI test to ensure coordinate writes inside `migration_disabled()` actually reach PETSc (non-vacuous).
src/underworld3/swarm.py	Implements deferred PETSc sync flushing, deferred migration flagging, solve-entry proxy refresh, starved-rank guards, cache invalidation fixes, and snapshot restore correctness.
src/underworld3/function/_function.pyx	Avoids running swarm-sync logic from `petsc_interpolate()` on only a subset of ranks by adding `swarm_sync=False` for those paths.
src/underworld3/discretisation/discretisation_mesh.py	Refreshes `_nav_coords` before rebuilding kd-tree index and adds a `swarm_sync` control path to `update_lvec()` to prevent subset-rank deadlocks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            # manifold mesh: the nav clone carries its own (ghosted) coords;
+            # refresh them from the rebuilt main DM where possible.
+            try:
+                self._nav_dm.setCoordinatesLocal(self.dm.getCoordinatesLocal())
+                self._nav_coords = numpy.asarray(
+                    self._nav_dm.getCoordinatesLocal().array
+                ).reshape(-1, self.cdim)
+            except Exception:
+                pass


        elif self.update_type == 1:
+            # NOTE: this branch performs data-dependent MeshVariable writes
+            # outside any access context, which is not parallel-safe
+            # independently of the starved-rank issue (pre-existing).
+            # The guard here only prevents the empty-rank KDTree crash.
+            if starved:
+                return


+        # Suspend the deferred (solve-entry) migration for the duration of
+        # the substep loop: the velocity evaluations below pass through
+        # Mesh.update_lvec(), and a migrate() firing there would reorder
+        # particle rows between the coordinate array and the velocity array
+        # captured from it. advection() performs its own migrate() at the end.
+        self._deferred_migration_suspended = True
+


lmoresi added 5 commits July 4, 2026 13:23

Copilot AI review requested due to automatic review settings July 4, 2026 03:26

Copilot started reviewing on behalf of lmoresi July 4, 2026 03:26 View session

lmoresi mentioned this pull request Jul 4, 2026

global_evaluate hangs at np4 when a rank has zero interior points (DMInterpolation empty-rank asymmetry) #314

Open

Copilot AI reviewed Jul 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(swarm): Track-0 stale-cache and migration-semantics fixes (BF-02/05/06/07/08 + #286)#313

fix(swarm): Track-0 stale-cache and migration-semantics fixes (BF-02/05/06/07/08 + #286)#313
lmoresi wants to merge 5 commits into
developmentfrom
bugfix/track0-swarm-caches

lmoresi commented Jul 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lmoresi commented Jul 4, 2026

Defects and fixes

BF-02 — stale caches after no-move migrate / particle addition / populate (SWARM-01/02/17)

#286 — AMR: kd-tree rebuilt against stale _nav_coords

BF-05 — writes inside migration_disabled() silently discarded (SWARM-04)

BF-06 — empty/starved ranks: silent zeros or hard crash (SWARM-07)

BF-07 — modern coordinate writes never migrate (SWARM-03)

BF-08 — stale proxy consumed by solvers (LE-03 = SWARM-05, issue #215 Bug 3)

Tests

Gate results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

#286 — AMR: kd-tree rebuilt against stale `_nav_coords`

BF-05 — writes inside `migration_disabled()` silently discarded (SWARM-04)