fix(swarm): Track-0 stale-cache and migration-semantics fixes (BF-02/05/06/07/08 + #286)#313
Open
lmoresi wants to merge 5 commits into
Open
fix(swarm): Track-0 stale-cache and migration-semantics fixes (BF-02/05/06/07/08 + #286)#313lmoresi wants to merge 5 commits into
lmoresi wants to merge 5 commits into
Conversation
…#286) In nuke_coords_and_rebuild() the point-location kd-tree was rebuilt before self._nav_coords was refreshed from the rebuilt DM, so on an adapted mesh the rebuild indexed an old-sized coordinate array with new-mesh point ranges. When the mesh grew this raised IndexError (index 91 out of bounds for axis 0 with size 80) from swarm migration after mesh.adapt(); when it shrank it silently mislocated points. The navigation-coordinate refresh (both the volume-mesh and manifold branches) now runs before _build_kd_tree_index(). Reproduced by tests/test_0810_amr_swarm_migration_regression.py, which now passes. Underworld development team with AI support from Claude Code
…tion, and populate (BF-02) Swarm.migrate() early-returned when no particle was globally unclaimed without invalidating the canonical .data caches, the particle kd-tree, or the proxy staleness flags (SWARM-01/SWARM-02, 2026-07 audit). In serial or whenever no particle changes rank - the common case - every cached array kept the old particle count and the kd-tree kept a poisoned no-copy view of mutated coordinates. This is the root cause of issue #289: advection ends with a no-move migrate, so material proxies froze at the launch positions and time-stepped models silently used stationary material. The early return now performs the same invalidation as the fall-through path. add_particles_with_global_coordinates() gains an explicit invalidation (mirroring the #216 fix in add_particles_with_coordinates - with migrate=False nothing else invalidates), and populate() invalidates caches created before the swarm had particles (SWARM-17, verified at runtime). Collateral defect exposed by this fix, same #216 stale-cache class: Swarm.apply_snapshot_payload() wrote restored variable data into a detached np.asarray view of the canonical cache, so the DMSwarm field never received the restored values - the restore only worked because the stale canonical copy survived. It now writes through the canonical array so the PETSc pack fires (caught by test_0007_snapshot_inmemory). Regression tests: tests/test_0113_swarm_stale_cache_regression.py (SWARM-01 both migrate= variants, SWARM-17, SWARM-02 kd-tree contract and behavioural mirror probe, and the issue #289 reproducer). Underworld development team with AI support from Claude Code
…f discarding them (BF-05) Variable and coordinate writes made inside migration_disabled() / migration_control() were silently discarded (SWARM-04, 2026-07 audit): both PETSc-sync callbacks early-returned while _migration_disabled was set, nothing ever re-packed the canonical arrays, migrate() then read the stale PETSc coordinates, and its trailing invalidation destroyed the only copy of the writes. The docstrings explicitly recommended writing inside the context, and the parallel test covering the pattern asserted vacuously. Suppressing migration now defers only the PETSc pack, never the data: writes made while the flag is set are recorded per variable and flushed into the DMSwarm by Swarm._flush_pending_petsc_sync() at context exit (and defensively at migrate() entry). Both context-manager modes flush; only the migrate() call itself is deferred (default) or skipped (disable=True). Docstrings updated to state the actual contract. This commit also introduces Swarm._sync_before_assembly() (adjacent to the flush helper): the collective solve-entry synchronisation used by the deferred-migration and stale-proxy fixes. It is wired into the mesh assembly path in the follow-up commit. tests/parallel/test_0755_swarm_global_stats.py's perturbation block now asserts the write actually reaches the DMSwarm (previously vacuous), and tests/parallel/test_0756_swarm_migration_semantics.py (added in the follow-up commit) covers write survival at np2. Underworld development team with AI support from Claude Code
…or zeroing (BF-06) Ranks holding <= 1 particles either hard-crashed or silently corrupted their proxy mesh variables (SWARM-07, 2026-07 audit): rbf_interpolate() returned silent zeros that _rbf_to_meshVar wrote straight into the proxy nodal values, and IndexSwarmVariable's proxy update reached unguarded KDTree construction on an empty coordinate array, raising IndexError inside a collective update - an MPI abort or hang. Starved ranks now leave their proxy nodal values untouched and warn (warnings are suppressed for a swarm that has never been populated, since proxied variables legitimately touch this path at creation time). The guards are written to be collective-safe: MeshVariable reads/writes perform collective ghost synchronisation, so every rank executes the same read-then-write sequence and only the values differ per rank. The IndexSwarmVariable update_type=0 projection is restructured to compute into a local buffer and issue exactly ONE MeshVariable write per level set per rank (serially bit-identical to the old formulation): the previous in-context version issued a data-dependent number of writes whose deferred collective syncs mismatched across ranks - the np4 starved-rank test deadlocked on it. IndexSwarmVariable also gains a real _update_proxy_if_stale() override (its proxies live in _meshLevelSetVars, so the base implementation was a no-op for it) and routes .sym through it; this is also what the solve-entry refresh uses. np4 regression test (all particles seeded on one rank, sentinel survives on starved ranks, no crash) lands with the parallel test file in the follow-up commit. Underworld development team with AI support from Claude Code
…refresh at solve entry (BF-07, BF-08) BF-07 (SWARM-03, 2026-07 audit): the Swarm docstring promised automatic migration, but coordinate writes through the modern interface (swarm.coords setter / swarm._particle_coordinates.data) only packed to PETSc - only the deprecated points path migrated, leaving particles on the wrong rank in parallel. Coordinate writes now mark swarm._needs_migration; the collective migrate() runs DEFERRED at the next collective point - migration-control context exit or solve entry - never per-write, which would deadlock when ranks write unevenly (maintainer-approved fix shape). advection() suspends the deferred migration around its substep loop (a migrate firing from its velocity evaluations would reorder particle rows between captured arrays) and runs its own migrate() at the end, which clears the suspension. BF-08 (LE-03 = SWARM-05, issue #215 Bug 3): proxy refresh fired only via the lazy .sym accessor, but solvers pull the proxy DM directly through mesh.update_lvec(), consuming stale data after a material.data write. Mesh.update_lvec() now runs Swarm._sync_before_assembly() for each registered swarm before its staleness check: a single eager, collective refresh at solve entry (the refresh itself is what sets _stale_lvec). Rank-local flags are combined with MAX reductions so uneven writes are safe; repeated calls are flag-guarded no-ops, preserving the test_0006_memory_leak constraint. The refresh calls inside petsc_interpolate pass swarm_sync=False - only a subset of ranks reaches that function (zero-interior-point ranks skip it), and the hook's reductions must not run on a subset; those sites already rely on the all-ranks update_lvec() in evaluate() for freshness. The resolved TODO(BUG) at the old swarm.py:1075 is removed. Together with the BF-02 invalidation fix this closes issue #289: the issue reproducer (regression test in test_0113) shows the material proxy tracking the advected particles, and a Projection re-solve consumes fresh particle data without touching .sym. Adds tests/parallel/test_0756_swarm_migration_semantics.py: np2 migration-suppressed write survival, np2 deferred migration at context exit and at solve entry, np4 starved-rank proxy behaviour. Underworld development team with AI support from Claude Code
Contributor
There was a problem hiding this comment.
Pull request overview
This PR implements Track-0 remediation for Underworld3’s swarm migration / cache invalidation / proxy freshness semantics, and fixes an AMR point-location regression (#286) by reordering cache rebuilds. It also adds/repairs regression tests to prevent stale-cache, stale-proxy, and starved-rank MPI failure modes from reappearing.
Changes:
- Fix swarm cache invalidation and deferred-migration/proxy-refresh semantics (including solve-entry swarm sync) to prevent stale
.datacaches, stale kd-trees, and stale proxy mesh variables. - Fix AMR mesh rebuild ordering so
_nav_coordsis refreshed before rebuilding the navigation kd-tree index (#286). - Add new serial + MPI regression tests covering the Track-0 defects and repair a previously vacuous parallel test assertion.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_0113_swarm_stale_cache_regression.py | Adds serial regression coverage for stale-cache, kd-tree invalidation, proxy freshness through advection, and solve-entry proxy consumption. |
| tests/parallel/test_0756_swarm_migration_semantics.py | Adds MPI regression coverage for migration-control semantics, deferred migration, and starved-rank proxy safety. |
| tests/parallel/test_0755_swarm_global_stats.py | Strengthens an MPI test to ensure coordinate writes inside migration_disabled() actually reach PETSc (non-vacuous). |
| src/underworld3/swarm.py | Implements deferred PETSc sync flushing, deferred migration flagging, solve-entry proxy refresh, starved-rank guards, cache invalidation fixes, and snapshot restore correctness. |
| src/underworld3/function/_function.pyx | Avoids running swarm-sync logic from petsc_interpolate() on only a subset of ranks by adding swarm_sync=False for those paths. |
| src/underworld3/discretisation/discretisation_mesh.py | Refreshes _nav_coords before rebuilding kd-tree index and adds a swarm_sync control path to update_lvec() to prevent subset-rank deadlocks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+2361
to
+2369
| # manifold mesh: the nav clone carries its own (ghosted) coords; | ||
| # refresh them from the rebuilt main DM where possible. | ||
| try: | ||
| self._nav_dm.setCoordinatesLocal(self.dm.getCoordinatesLocal()) | ||
| self._nav_coords = numpy.asarray( | ||
| self._nav_dm.getCoordinatesLocal().array | ||
| ).reshape(-1, self.cdim) | ||
| except Exception: | ||
| pass |
Comment on lines
2536
to
+2542
| elif self.update_type == 1: | ||
| # NOTE: this branch performs data-dependent MeshVariable writes | ||
| # outside any access context, which is not parallel-safe | ||
| # independently of the starved-rank issue (pre-existing). | ||
| # The guard here only prevents the empty-rank KDTree crash. | ||
| if starved: | ||
| return |
Comment on lines
+4943
to
+4949
| # Suspend the deferred (solve-entry) migration for the duration of | ||
| # the substep loop: the velocity evaluations below pass through | ||
| # Mesh.update_lvec(), and a migrate() firing there would reorder | ||
| # particle rows between the coordinate array and the velocity array | ||
| # captured from it. advection() performs its own migrate() at the end. | ||
| self._deferred_migration_suspended = True | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Track-0 remediation (2026-07 quality audit,
docs/reviews/2026-07/REMEDIATION-WORKLIST.md):BF-02 (SWARM-01/02/17), BF-05 (SWARM-04), BF-06 (SWARM-07), BF-07 (SWARM-03),
BF-08 (LE-03 = SWARM-05, issue #215 Bug 3), plus the freshly diagnosed #286.
Closes #289
Closes #286
Defects and fixes
BF-02 — stale caches after no-move migrate / particle addition / populate (SWARM-01/02/17)
Swarm.migrate()early-returned when no particle was globally unclaimed withoutinvalidating the canonical
.datacaches, the particle kd-tree, or the proxystaleness flags. Reproduced (serial): after
add_particles_with_global_coordinatesthe DMSwarm held 974 particles while every cached
.datastill had 972 rows; afteran in-place coordinate mirror +
migrate(),rbf_interpolatereturned garbage fromthe poisoned no-copy kd-tree; after
populate()a pre-existing cache stayed at 0rows. This is also the root cause of issue #289:
swarm.advection()ends with ano-move
migrate()in serial, so_proxy_stalestayedFalseand the materialproxy froze at the launch positions.
Fix: invalidate before the early return; explicit invalidation in
add_particles_with_global_coordinates(mirroring the #216 fix inadd_particles_with_coordinates) and inpopulate()(SWARM-17 verified atruntime, folded in).
Collateral defect exposed by this fix (same #216 stale-cache class):
Swarm.apply_snapshot_payload()wrote restored variable data into a detachednp.asarrayview of the canonical cache — the DMSwarm field never received therestored values, and the restore only "worked" because the SWARM-01 bug preserved
the stale canonical copy. With the invalidation fixed,
test_0007_snapshot_inmemorycontinuation tests caught it (restored material became zeros after the first
advection). Fixed by writing through the canonical array so the PETSc pack fires.
#286 — AMR: kd-tree rebuilt against stale
_nav_coordsIn
nuke_coords_and_rebuild()the point-location kd-tree was rebuilt beforeself._nav_coordswas refreshed, so the rebuild indexed an old-sized coordinatearray with new-mesh point ranges:
IndexError: index 91 is out of bounds for axis 0 with size 80(reproduced viatests/test_0810_amr_swarm_migration_regression.py).Fix: refresh
_nav_coords(volume and manifold branches) before_build_kd_tree_index().BF-05 — writes inside
migration_disabled()silently discarded (SWARM-04)Both PETSc-sync callbacks early-returned while
_migration_disabledwas set andnothing ever re-packed;
migrate()then read the stale PETSc coordinates and itstrailing invalidation destroyed the only copy of the writes. Reproduced (serial):
after writing 42.0 inside the context, the PETSc field still held 0.0.
Fix: separate "suppress migration" from "suppress PETSc sync" — writes are recorded
and flushed to the DMSwarm by
Swarm._flush_pending_petsc_sync()at context exit(and defensively at
migrate()entry); only the migration itself is deferred(default) or skipped (
disable=True). The vacuous assertions intests/parallel/test_0755_swarm_global_stats.py(the perturbation never reached theDMSwarm) now assert that it does.
BF-06 — empty/starved ranks: silent zeros or hard crash (SWARM-07)
rbf_interpolatesilently returned zeros for ranks with <= 1 particles and_rbf_to_meshVarwrote them into the proxy;IndexSwarmVariable's proxy updatereached unguarded
KDTreeconstruction on a 0-particle rank —IndexError: Out of bounds on buffer access(reproduced) inside a collectiveupdate, i.e. an MPI abort/hang.
Fix: starved ranks leave their proxy nodal values untouched and warn (never silent
zeros); guard the KDTree/nearest-neighbour machinery on
local_size <= 1whilekeeping every rank inside the same sequence of collective reads/writes/contexts
(MeshVariable syncs are collective — a naive early-return would deadlock).
IndexSwarmVariablegains a proper_update_proxy_if_stale()override (itsproxies live in
_meshLevelSetVars, so the base implementation was a no-op for it).The
update_type=0level-set projection was also restructured to compute into alocal buffer and issue exactly ONE MeshVariable write per level set per rank
(serially bit-identical): the previous formulation issued a data-dependent number
of writes whose deferred collective syncs mismatched across ranks — the np4
starved-rank test deadlocked on it (each write's ghost sync is collective).
BF-07 — modern coordinate writes never migrate (SWARM-03)
The class docstring promised automatic migration, but
swarm.coords/swarm._particle_coordinates.datawrites only packed to PETSc; only thedeprecated
pointspath migrated. Fix (maintainer-approved shape): coordinatewrites mark
swarm._needs_migration; the collectivemigrate()runs DEFERRED atthe next collective point — migration-control context exit or solve entry — never
per-write (uneven writes would deadlock). The flag is combined across ranks with a
MAX reduction before acting, so uneven writes are safe. Docstrings updated to state
the actual contract.
BF-08 — stale proxy consumed by solvers (LE-03 = SWARM-05, issue #215 Bug 3)
Proxy refresh only fired via the lazy
.symaccessor, but solvers pull the proxyDM directly through
mesh.update_lvec(). Reproduced (serial): writematerial.data, re-solve a capturedProjection— the solve consumed the oldvalues. Fix (maintainer-approved shape): a single eager, collective refresh at
solve entry —
Swarm._sync_before_assembly()invoked fromMesh.update_lvec()before its staleness check (the refresh itself is what sets
_stale_lvec). Italso performs BF-07's deferred migration. Flag-guarded: repeated calls are no-ops,
so nothing is re-interpolated per access (the
test_0006memory-leak constraint).The
TODO(BUG)at the oldswarm.py:1075is resolved and removed.Together with BF-02 this closes #289 (verified with the issue reproducer: the
proxy centroid now tracks the advecting particle centroid).
Tests
tests/test_0113_swarm_stale_cache_regression.py(new, level_1/tier_a):SWARM-01 (both
migrate=variants), SWARM-17, SWARM-02 (kd-tree invalidationcontract + behavioral mirror probe), the swarm.advection() leaves proxy mesh variables stale (frozen material) — auto-update hook orphaned in deprecated access path #289 reproducer as a regression test,
and the [BUG] - Three stale-cache issues after swarm particle addition #215-Bug-3 solve-freshness test.
tests/parallel/test_0756_swarm_migration_semantics.py(new): np2migration-suppressed write survival, np2 deferred migration at context exit and
at solve entry, np4 starved-rank proxy test (all particles on one rank).
tests/parallel/test_0755_swarm_global_stats.py: repaired the vacuousperturbation block.
tests/test_0810_amr_swarm_migration_regression.py(AMR: swarm migration after mesh.adapt() raises IndexError -- k-d tree rebuilt before nav_coords refresh #286) now passes.Gate results
All runs from the fix worktree, env
amr-dev(AMR-enabled PETSc build, neededfor the #286 test), after
./uw build.level_1 and tier_a: BEFORE = 295 passed, 6 failed -- the 6 failuresare exactly the new regression tests, each demonstrating its bug on the pristine
baseline. AFTER = 301 passed, 0 failed (1 skipped, 4 xfailed, 1 xpassed --
unchanged from baseline).
mpirun -np 2,test_0755+test_0756+test_0760+test_0765):19 passed, 1 skipped (the np4-only test).
mpirun -np 4,test_0755+test_0756+test_0765): 17 passed.test_0006_memory_leak: passed (the BF-08 gate -- the solve-entry refresh isflag-guarded, nothing re-interpolates per access).
test_0810_amr_swarm_migration_regression(AMR: swarm migration after mesh.adapt() raises IndexError -- k-d tree rebuilt before nav_coords refresh #286): failed with the reportedIndexError before the fix; passes after.
0.1071 -> 0.1270 (was frozen at 0.1071); also enshrined as a regression test.
Pre-existing np4 defect found while gating (NOT addressed here):
tests/parallel/test_0760_swarm_cache_migration.pyhangs atmpirun -np 4on thepristine baseline (verified by stashing this branch's src changes and
rebuilding): with rank-biased coordinates,
global_evaluateleaves some ranks withzero interior points, and those ranks diverge inside the
DMInterpolationmachinery in
petsc_interpolatewhile the others block in its collectives. Itpasses at np2 (CI runs
--p 2). Per-rank stack traces confirm all of this branch'scollective hooks enter and exit symmetrically before the hang. This should be
tracked as its own issue (parallel-evaluation empty-rank asymmetry, related to the
audit's SWARM-15/16 provisional findings).
Underworld development team with AI support from Claude Code