Skip to content

fix(swarm): Track-0 stale-cache and migration-semantics fixes (BF-02/05/06/07/08 + #286)#313

Open
lmoresi wants to merge 5 commits into
developmentfrom
bugfix/track0-swarm-caches
Open

fix(swarm): Track-0 stale-cache and migration-semantics fixes (BF-02/05/06/07/08 + #286)#313
lmoresi wants to merge 5 commits into
developmentfrom
bugfix/track0-swarm-caches

Conversation

@lmoresi

@lmoresi lmoresi commented Jul 4, 2026

Copy link
Copy Markdown
Member

Track-0 remediation (2026-07 quality audit, docs/reviews/2026-07/REMEDIATION-WORKLIST.md):
BF-02 (SWARM-01/02/17), BF-05 (SWARM-04), BF-06 (SWARM-07), BF-07 (SWARM-03),
BF-08 (LE-03 = SWARM-05, issue #215 Bug 3), plus the freshly diagnosed #286.

Closes #289
Closes #286

Defects and fixes

BF-02 — stale caches after no-move migrate / particle addition / populate (SWARM-01/02/17)

Swarm.migrate() early-returned when no particle was globally unclaimed without
invalidating the canonical .data caches, the particle kd-tree, or the proxy
staleness flags. Reproduced (serial): after add_particles_with_global_coordinates
the DMSwarm held 974 particles while every cached .data still had 972 rows; after
an in-place coordinate mirror + migrate(), rbf_interpolate returned garbage from
the poisoned no-copy kd-tree; after populate() a pre-existing cache stayed at 0
rows. This is also the root cause of issue #289: swarm.advection() ends with a
no-move migrate() in serial, so _proxy_stale stayed False and the material
proxy froze at the launch positions.

Fix: invalidate before the early return; explicit invalidation in
add_particles_with_global_coordinates (mirroring the #216 fix in
add_particles_with_coordinates) and in populate() (SWARM-17 verified at
runtime, folded in).

Collateral defect exposed by this fix (same #216 stale-cache class):
Swarm.apply_snapshot_payload() wrote restored variable data into a detached
np.asarray view of the canonical cache — the DMSwarm field never received the
restored values, and the restore only "worked" because the SWARM-01 bug preserved
the stale canonical copy. With the invalidation fixed, test_0007_snapshot_inmemory
continuation tests caught it (restored material became zeros after the first
advection). Fixed by writing through the canonical array so the PETSc pack fires.

#286 — AMR: kd-tree rebuilt against stale _nav_coords

In nuke_coords_and_rebuild() the point-location kd-tree was rebuilt before
self._nav_coords was refreshed, so the rebuild indexed an old-sized coordinate
array with new-mesh point ranges: IndexError: index 91 is out of bounds for axis 0 with size 80 (reproduced via tests/test_0810_amr_swarm_migration_regression.py).
Fix: refresh _nav_coords (volume and manifold branches) before
_build_kd_tree_index().

BF-05 — writes inside migration_disabled() silently discarded (SWARM-04)

Both PETSc-sync callbacks early-returned while _migration_disabled was set and
nothing ever re-packed; migrate() then read the stale PETSc coordinates and its
trailing invalidation destroyed the only copy of the writes. Reproduced (serial):
after writing 42.0 inside the context, the PETSc field still held 0.0.
Fix: separate "suppress migration" from "suppress PETSc sync" — writes are recorded
and flushed to the DMSwarm by Swarm._flush_pending_petsc_sync() at context exit
(and defensively at migrate() entry); only the migration itself is deferred
(default) or skipped (disable=True). The vacuous assertions in
tests/parallel/test_0755_swarm_global_stats.py (the perturbation never reached the
DMSwarm) now assert that it does.

BF-06 — empty/starved ranks: silent zeros or hard crash (SWARM-07)

rbf_interpolate silently returned zeros for ranks with <= 1 particles and
_rbf_to_meshVar wrote them into the proxy; IndexSwarmVariable's proxy update
reached unguarded KDTree construction on a 0-particle rank —
IndexError: Out of bounds on buffer access (reproduced) inside a collective
update, i.e. an MPI abort/hang.
Fix: starved ranks leave their proxy nodal values untouched and warn (never silent
zeros); guard the KDTree/nearest-neighbour machinery on local_size <= 1 while
keeping every rank inside the same sequence of collective reads/writes/contexts
(MeshVariable syncs are collective — a naive early-return would deadlock).
IndexSwarmVariable gains a proper _update_proxy_if_stale() override (its
proxies live in _meshLevelSetVars, so the base implementation was a no-op for it).

The update_type=0 level-set projection was also restructured to compute into a
local buffer and issue exactly ONE MeshVariable write per level set per rank
(serially bit-identical): the previous formulation issued a data-dependent number
of writes whose deferred collective syncs mismatched across ranks — the np4
starved-rank test deadlocked on it (each write's ghost sync is collective).

BF-07 — modern coordinate writes never migrate (SWARM-03)

The class docstring promised automatic migration, but swarm.coords /
swarm._particle_coordinates.data writes only packed to PETSc; only the
deprecated points path migrated. Fix (maintainer-approved shape): coordinate
writes mark swarm._needs_migration; the collective migrate() runs DEFERRED at
the next collective point — migration-control context exit or solve entry — never
per-write (uneven writes would deadlock). The flag is combined across ranks with a
MAX reduction before acting, so uneven writes are safe. Docstrings updated to state
the actual contract.

BF-08 — stale proxy consumed by solvers (LE-03 = SWARM-05, issue #215 Bug 3)

Proxy refresh only fired via the lazy .sym accessor, but solvers pull the proxy
DM directly through mesh.update_lvec(). Reproduced (serial): write
material.data, re-solve a captured Projection — the solve consumed the old
values. Fix (maintainer-approved shape): a single eager, collective refresh at
solve entry — Swarm._sync_before_assembly() invoked from Mesh.update_lvec()
before its staleness check (the refresh itself is what sets _stale_lvec). It
also performs BF-07's deferred migration. Flag-guarded: repeated calls are no-ops,
so nothing is re-interpolated per access (the test_0006 memory-leak constraint).
The TODO(BUG) at the old swarm.py:1075 is resolved and removed.
Together with BF-02 this closes #289 (verified with the issue reproducer: the
proxy centroid now tracks the advecting particle centroid).

Tests

Gate results

All runs from the fix worktree, env amr-dev (AMR-enabled PETSc build, needed
for the #286 test), after ./uw build.

Pre-existing np4 defect found while gating (NOT addressed here):
tests/parallel/test_0760_swarm_cache_migration.py hangs at mpirun -np 4 on the
pristine baseline (verified by stashing this branch's src changes and
rebuilding): with rank-biased coordinates, global_evaluate leaves some ranks with
zero interior points, and those ranks diverge inside the DMInterpolation
machinery in petsc_interpolate while the others block in its collectives. It
passes at np2 (CI runs --p 2). Per-rank stack traces confirm all of this branch's
collective hooks enter and exit symmetrically before the hang. This should be
tracked as its own issue (parallel-evaluation empty-rank asymmetry, related to the
audit's SWARM-15/16 provisional findings).

Underworld development team with AI support from Claude Code

lmoresi added 5 commits July 4, 2026 13:23
…#286)

In nuke_coords_and_rebuild() the point-location kd-tree was rebuilt before
self._nav_coords was refreshed from the rebuilt DM, so on an adapted mesh
the rebuild indexed an old-sized coordinate array with new-mesh point
ranges. When the mesh grew this raised IndexError (index 91 out of bounds
for axis 0 with size 80) from swarm migration after mesh.adapt(); when it
shrank it silently mislocated points. The navigation-coordinate refresh
(both the volume-mesh and manifold branches) now runs before
_build_kd_tree_index().

Reproduced by tests/test_0810_amr_swarm_migration_regression.py, which
now passes.

Underworld development team with AI support from Claude Code
…tion, and populate (BF-02)

Swarm.migrate() early-returned when no particle was globally unclaimed
without invalidating the canonical .data caches, the particle kd-tree, or
the proxy staleness flags (SWARM-01/SWARM-02, 2026-07 audit). In serial or
whenever no particle changes rank - the common case - every cached array
kept the old particle count and the kd-tree kept a poisoned no-copy view
of mutated coordinates. This is the root cause of issue #289: advection
ends with a no-move migrate, so material proxies froze at the launch
positions and time-stepped models silently used stationary material.

The early return now performs the same invalidation as the fall-through
path. add_particles_with_global_coordinates() gains an explicit
invalidation (mirroring the #216 fix in add_particles_with_coordinates -
with migrate=False nothing else invalidates), and populate() invalidates
caches created before the swarm had particles (SWARM-17, verified at
runtime).

Collateral defect exposed by this fix, same #216 stale-cache class:
Swarm.apply_snapshot_payload() wrote restored variable data into a
detached np.asarray view of the canonical cache, so the DMSwarm field
never received the restored values - the restore only worked because the
stale canonical copy survived. It now writes through the canonical array
so the PETSc pack fires (caught by test_0007_snapshot_inmemory).

Regression tests: tests/test_0113_swarm_stale_cache_regression.py
(SWARM-01 both migrate= variants, SWARM-17, SWARM-02 kd-tree contract and
behavioural mirror probe, and the issue #289 reproducer).

Underworld development team with AI support from Claude Code
…f discarding them (BF-05)

Variable and coordinate writes made inside migration_disabled() /
migration_control() were silently discarded (SWARM-04, 2026-07 audit):
both PETSc-sync callbacks early-returned while _migration_disabled was
set, nothing ever re-packed the canonical arrays, migrate() then read the
stale PETSc coordinates, and its trailing invalidation destroyed the only
copy of the writes. The docstrings explicitly recommended writing inside
the context, and the parallel test covering the pattern asserted
vacuously.

Suppressing migration now defers only the PETSc pack, never the data:
writes made while the flag is set are recorded per variable and flushed
into the DMSwarm by Swarm._flush_pending_petsc_sync() at context exit
(and defensively at migrate() entry). Both context-manager modes flush;
only the migrate() call itself is deferred (default) or skipped
(disable=True). Docstrings updated to state the actual contract.

This commit also introduces Swarm._sync_before_assembly() (adjacent to
the flush helper): the collective solve-entry synchronisation used by the
deferred-migration and stale-proxy fixes. It is wired into the mesh
assembly path in the follow-up commit.

tests/parallel/test_0755_swarm_global_stats.py's perturbation block now
asserts the write actually reaches the DMSwarm (previously vacuous), and
tests/parallel/test_0756_swarm_migration_semantics.py (added in the
follow-up commit) covers write survival at np2.

Underworld development team with AI support from Claude Code
…or zeroing (BF-06)

Ranks holding <= 1 particles either hard-crashed or silently corrupted
their proxy mesh variables (SWARM-07, 2026-07 audit):
rbf_interpolate() returned silent zeros that _rbf_to_meshVar wrote
straight into the proxy nodal values, and IndexSwarmVariable's proxy
update reached unguarded KDTree construction on an empty coordinate
array, raising IndexError inside a collective update - an MPI abort or
hang.

Starved ranks now leave their proxy nodal values untouched and warn
(warnings are suppressed for a swarm that has never been populated, since
proxied variables legitimately touch this path at creation time).
The guards are written to be collective-safe: MeshVariable reads/writes
perform collective ghost synchronisation, so every rank executes the same
read-then-write sequence and only the values differ per rank. The
IndexSwarmVariable update_type=0 projection is restructured to compute
into a local buffer and issue exactly ONE MeshVariable write per level
set per rank (serially bit-identical to the old formulation): the
previous in-context version issued a data-dependent number of writes
whose deferred collective syncs mismatched across ranks - the np4
starved-rank test deadlocked on it.

IndexSwarmVariable also gains a real _update_proxy_if_stale() override
(its proxies live in _meshLevelSetVars, so the base implementation was a
no-op for it) and routes .sym through it; this is also what the
solve-entry refresh uses.

np4 regression test (all particles seeded on one rank, sentinel survives
on starved ranks, no crash) lands with the parallel test file in the
follow-up commit.

Underworld development team with AI support from Claude Code
…refresh at solve entry (BF-07, BF-08)

BF-07 (SWARM-03, 2026-07 audit): the Swarm docstring promised automatic
migration, but coordinate writes through the modern interface
(swarm.coords setter / swarm._particle_coordinates.data) only packed to
PETSc - only the deprecated points path migrated, leaving particles on
the wrong rank in parallel. Coordinate writes now mark
swarm._needs_migration; the collective migrate() runs DEFERRED at the
next collective point - migration-control context exit or solve entry -
never per-write, which would deadlock when ranks write unevenly
(maintainer-approved fix shape). advection() suspends the deferred
migration around its substep loop (a migrate firing from its velocity
evaluations would reorder particle rows between captured arrays) and
runs its own migrate() at the end, which clears the suspension.

BF-08 (LE-03 = SWARM-05, issue #215 Bug 3): proxy refresh fired only via
the lazy .sym accessor, but solvers pull the proxy DM directly through
mesh.update_lvec(), consuming stale data after a material.data write.
Mesh.update_lvec() now runs Swarm._sync_before_assembly() for each
registered swarm before its staleness check: a single eager, collective
refresh at solve entry (the refresh itself is what sets _stale_lvec).
Rank-local flags are combined with MAX reductions so uneven writes are
safe; repeated calls are flag-guarded no-ops, preserving the
test_0006_memory_leak constraint. The refresh calls inside
petsc_interpolate pass swarm_sync=False - only a subset of ranks reaches
that function (zero-interior-point ranks skip it), and the hook's
reductions must not run on a subset; those sites already rely on the
all-ranks update_lvec() in evaluate() for freshness. The resolved
TODO(BUG) at the old swarm.py:1075 is removed.

Together with the BF-02 invalidation fix this closes issue #289: the
issue reproducer (regression test in test_0113) shows the material proxy
tracking the advected particles, and a Projection re-solve consumes fresh
particle data without touching .sym.

Adds tests/parallel/test_0756_swarm_migration_semantics.py: np2
migration-suppressed write survival, np2 deferred migration at context
exit and at solve entry, np4 starved-rank proxy behaviour.

Underworld development team with AI support from Claude Code

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements Track-0 remediation for Underworld3’s swarm migration / cache invalidation / proxy freshness semantics, and fixes an AMR point-location regression (#286) by reordering cache rebuilds. It also adds/repairs regression tests to prevent stale-cache, stale-proxy, and starved-rank MPI failure modes from reappearing.

Changes:

  • Fix swarm cache invalidation and deferred-migration/proxy-refresh semantics (including solve-entry swarm sync) to prevent stale .data caches, stale kd-trees, and stale proxy mesh variables.
  • Fix AMR mesh rebuild ordering so _nav_coords is refreshed before rebuilding the navigation kd-tree index (#286).
  • Add new serial + MPI regression tests covering the Track-0 defects and repair a previously vacuous parallel test assertion.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_0113_swarm_stale_cache_regression.py Adds serial regression coverage for stale-cache, kd-tree invalidation, proxy freshness through advection, and solve-entry proxy consumption.
tests/parallel/test_0756_swarm_migration_semantics.py Adds MPI regression coverage for migration-control semantics, deferred migration, and starved-rank proxy safety.
tests/parallel/test_0755_swarm_global_stats.py Strengthens an MPI test to ensure coordinate writes inside migration_disabled() actually reach PETSc (non-vacuous).
src/underworld3/swarm.py Implements deferred PETSc sync flushing, deferred migration flagging, solve-entry proxy refresh, starved-rank guards, cache invalidation fixes, and snapshot restore correctness.
src/underworld3/function/_function.pyx Avoids running swarm-sync logic from petsc_interpolate() on only a subset of ranks by adding swarm_sync=False for those paths.
src/underworld3/discretisation/discretisation_mesh.py Refreshes _nav_coords before rebuilding kd-tree index and adds a swarm_sync control path to update_lvec() to prevent subset-rank deadlocks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2361 to +2369
# manifold mesh: the nav clone carries its own (ghosted) coords;
# refresh them from the rebuilt main DM where possible.
try:
self._nav_dm.setCoordinatesLocal(self.dm.getCoordinatesLocal())
self._nav_coords = numpy.asarray(
self._nav_dm.getCoordinatesLocal().array
).reshape(-1, self.cdim)
except Exception:
pass
Comment thread src/underworld3/swarm.py
Comment on lines 2536 to +2542
elif self.update_type == 1:
# NOTE: this branch performs data-dependent MeshVariable writes
# outside any access context, which is not parallel-safe
# independently of the starved-rank issue (pre-existing).
# The guard here only prevents the empty-rank KDTree crash.
if starved:
return
Comment thread src/underworld3/swarm.py
Comment on lines +4943 to +4949
# Suspend the deferred (solve-entry) migration for the duration of
# the substep loop: the velocity evaluations below pass through
# Mesh.update_lvec(), and a migrate() firing there would reorder
# particle rows between the coordinate array and the velocity array
# captured from it. advection() performs its own migrate() at the end.
self._deferred_migration_suspended = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants