feat(pt_expt): graph-native dpa1 — .pt2 export + compiled training + C++ inference single & multi-rank (NeighborGraph PR-B)#5604
Conversation
… graph .pt2 export - Add make_model.forward_common_lower_graph_exportable: make_fx trace of forward_common_lower_graph with edge_vec as the autograd leaf; symbolic tracing enabled via two data-dependent shape fixes. - Add EnergyModel.forward_lower_graph_exportable: wraps the above with a second make_fx pass that translates internal keys to public names (atom_energy, energy, force, virial, atom_virial). - Fix edge_transform_output.py: replace int(n_node.sum()) with next(iter(fit_ret.values())).shape[0] (static shape attr, trace-safe); thread node_capacity through edge_energy_deriv -> edge_force_virial so that segment_sum sizes are static under make_fx symbolic mode. - Add edge_energy_deriv node_capacity param (None = eager fallback). - Test: source/tests/pt_expt/model/test_graph_export.py — TDD RED->GREEN, verifies traced module reproduces eager energy_redu (rtol=1e-10).
… do_atomic_virial branches), dedup key-translation, drop redundant detach
…r_graph Adds test_graph_static_capacity.py (4 tests): shape (2,64), real-edge count matches dynamic, real prefix identity + masked tail, overflow raises. No source change needed -- builder.py already threads layout.edge_capacity into pad_and_guard_edges.
…raph .pt2 metadata for the C++ hub (B2)
…pbc, 1e-10) Add a graph-form .pt2 DeepEval dispatch and parity test (NeighborGraph PR-B Phase B1.4). DeepEval: route lower_input_kind=="graph" archives through a new _eval_model_graph that builds a carry-all NeighborGraph (padded to the static edge_capacity baked in metadata), feeds the positional schema (atype, n_node, edge_index, edge_vec, edge_mask, fparam, aparam, charge_spin) to the AOTI forward, and reshapes the LOCAL public outputs by category (no communicate_extended_output, since the graph path is ghost-free). The dense (nlist) path is untouched. Fix dynamic-shape generalization of the graph .pt2 export: under symbolic make_fx/torch.export, int() on a SymInt SPECIALIZES the axis, baking the trace-time sample size (N=14, nf=2) into the autograd-derived force/energy/ virial outputs. Drop int() on node_capacity (edge_force_virial) and on n_node.shape[0] (fit_output_to_model_output_graph), and derive nf-1 in frame_id_from_n_node as a runtime 0-d tensor instead of xp.asarray(shape-1). The exported graph now generalizes across arbitrary N and nf. Test: dpa1(attn_layer=0) energy model exported with lower_kind="graph", evaluated through the pt_expt DeepPot and compared to the eager dense (sel-capped, neighbor_graph_method="legacy") reference at rtol=atol=1e-10 for pbc and nopbc. The fixture is a sparse 8-atom cluster and asserts non-binding sel (max neighbors < 30) so the carry-all/sel-capped neighbor sets coincide (anti-vacuity).
…); drop force_legacy_descriptor Retarget _CompiledModel to compile forward_common_lower_graph for graph-eligible descriptors (dpa1 attn_layer==0), gated by the same mixed_types()+uses_graph_lower() predicate the eager default-flip uses; se_e2_a/dpa2/dpa3 keep compiling the dense forward_lower. _trace_and_compile_graph builds a synthetic NeighborGraph with prime-distinct nf/N/E axes (no make_fx duck-shape merge) and edge_vec as the autograd leaf; _forward_graph builds the carry-all graph eagerly and unravels flat (N,*) node outputs to (nf,nloc,*). cpp.simdlen=0 for the graph compile avoids an inductor CPU scatter-vectorizer crash on the per-frame virial atomic_add. Also fixes an eager autograd bug in dpa1 call_graph: xp.asarray(type_embedding, device=dev) DETACHES under torch, so the type-embedding weights never trained in the graph path (grad None despite a real finite-diff dependency). make_fx traced through it, so compiled != eager and the optimizer diverged after step 0. Use type_embedding directly (mirrors the dense path); the tebd net now trains and eager==compiled to 1e-10 across the varying-natoms trajectory. Drops the force_legacy_descriptor workaround + uses_graph_lower monkeypatch.
… + legacy-gate assumption in the graph compile path
for more information, see https://pre-commit.ci
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a graph-native NeighborGraph lowering path across export, inference, training compilation, C++ runtime, and tests. It also updates dpmodel and neighbor-graph helpers to preserve gradients and symbolic shapes during graph tracing. ChangesGraph-form lower-forward export and runtime
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| DeepPot, | ||
| ) | ||
| from deepmd.pt_expt.utils.env import ( | ||
| DEVICE, |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
deepmd/pt_expt/utils/serialization.py (1)
750-789: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winReject unknown
lower_kindvalues instead of falling back to nlist.A typo such as
lower_kind="grpah"currently reaches the dense path and recordslower_input_kind="nlist", producing the wrong artifact without any signal.Proposed fix
def deserialize_to_file( @@ """ + if lower_kind not in {"nlist", "graph"}: + raise ValueError( + f"Unknown lower_kind {lower_kind!r}; expected 'nlist' or 'graph'." + ) if model_file.endswith(".pt2"):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt_expt/utils/serialization.py` around lines 750 - 789, The deserialize entrypoint in serialization.py currently accepts any lower_kind value and silently falls back to the nlist path, which can produce the wrong artifact for typos. Update deserialize_to_file and the lower-path selection logic in _deserialize_to_file_pt2 / _deserialize_to_file_pte to validate lower_kind explicitly against the supported values ("nlist" and "graph") and raise a clear error for anything else instead of defaulting to nlist, so the recorded lower_input_kind always matches the requested schema.source/tests/pt_expt/test_training.py (1)
1355-1374: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winAssert that the DPA1 no-attn case really takes the graph lower.
These changes now document graph-lower routing, but the test still only proves eager/compiled parity. If
_CompiledModelsilently falls back to the dense lower and the sampled systems stay non-binding, this case still passes, so the dispatch change introduced by this PR is untested. Please add an explicit assertion for the DPA1 no-attn branch that the compiled model is graph-eligible / using the graph lower, while keeping the dense expectation for the other descriptors.Also applies to: 1391-1392, 1466-1479
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/tests/pt_expt/test_training.py` around lines 1355 - 1374, The current test only checks eager/compiled parity in _check_varying_natoms, so the DPA1 no-attn routing to the graph lower is not explicitly verified. Add an assertion in the DPA1 branch that the compiled model is graph-eligible and actually uses forward_common_lower_graph, while preserving the dense forward_lower expectation for the non-eligible descriptors. Extend the same explicit dispatch check in the related test sections that cover the same descriptor routing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deepmd/pt_expt/train/training.py`:
- Around line 968-970: The graph-compiled forward path is dropping the
do_atomic_virial request, so atom_virial cannot be returned from compiled
models. Update forward and _forward_graph in training.py to thread
do_atomic_virial through the graph-eligible dispatch, and make
_trace_and_compile_graph cache/trace separately based on that flag so the
compiled lower path matches forward_common_lower_graph behavior.
- Around line 1283-1293: The reshaping logic in the result loop currently relies
on the `N != nframes` heuristic, which fails for `nloc == 1` and leaves
node-level outputs flat. Update the `training.py` postprocessing in the
`result.items()` loop to use explicit key categories from the graph/lower-level
public keys instead of shape inference, and reshape known node-level outputs
like `atom_energy` and `force` to `(nf, 1, *)` even when `N == nframes`.
In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 447-448: The dynamic-shape spec in _build_graph_dynamic_shapes
still only marks the frame axis for aparam, so its atom axis remains fixed to
the sample nloc. Update the aparam entry in the shape mapping to include both
the frame axis and the atom axis as dynamic, matching the intended (nf, nloc,
nda) behavior. Keep the change localized to the graph shape spec alongside the
existing fparam and aparam handling so exported graphs no longer specialize to a
single atom count.
---
Outside diff comments:
In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 750-789: The deserialize entrypoint in serialization.py currently
accepts any lower_kind value and silently falls back to the nlist path, which
can produce the wrong artifact for typos. Update deserialize_to_file and the
lower-path selection logic in _deserialize_to_file_pt2 /
_deserialize_to_file_pte to validate lower_kind explicitly against the supported
values ("nlist" and "graph") and raise a clear error for anything else instead
of defaulting to nlist, so the recorded lower_input_kind always matches the
requested schema.
In `@source/tests/pt_expt/test_training.py`:
- Around line 1355-1374: The current test only checks eager/compiled parity in
_check_varying_natoms, so the DPA1 no-attn routing to the graph lower is not
explicitly verified. Add an assertion in the DPA1 branch that the compiled model
is graph-eligible and actually uses forward_common_lower_graph, while preserving
the dense forward_lower expectation for the non-eligible descriptors. Extend the
same explicit dispatch check in the related test sections that cover the same
descriptor routing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 58e2933c-1a05-4d30-bdfa-0cec147b8f6a
📒 Files selected for processing (14)
deepmd/dpmodel/descriptor/dpa1.pydeepmd/dpmodel/utils/neighbor_graph/derivatives.pydeepmd/dpmodel/utils/neighbor_graph/graph.pydeepmd/pt_expt/infer/deep_eval.pydeepmd/pt_expt/model/edge_transform_output.pydeepmd/pt_expt/model/ener_model.pydeepmd/pt_expt/model/make_model.pydeepmd/pt_expt/train/training.pydeepmd/pt_expt/utils/serialization.pysource/tests/common/dpmodel/test_graph_static_capacity.pysource/tests/pt_expt/infer/test_graph_deepeval.pysource/tests/pt_expt/model/test_graph_export.pysource/tests/pt_expt/test_training.pysource/tests/pt_expt/utils/test_graph_pt2_metadata.py
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #5604 +/- ##
==========================================
- Coverage 82.37% 81.85% -0.52%
==========================================
Files 902 960 +58
Lines 101527 106256 +4729
Branches 4056 4134 +78
==========================================
+ Hits 83630 86976 +3346
- Misses 16432 17787 +1355
- Partials 1465 1493 +28 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
… static edge_capacity
The graph-form .pt2 export now marks the edge axis dynamic
(Dim("nedge", min=2)) instead of baking a static E_max=ceil(1.25*nloc*nnei)
capacity. The AOTI artifact accepts any system size with no capacity ceiling.
- _build_graph_dynamic_shapes: edge_index dim1 / edge_vec dim0 / edge_mask dim0
are now dynamic; mirrors the dense Dim("nnei", min=...) precedent.
- _trace_and_export graph branch: drop the metadata["edge_capacity"] write;
apply _strip_shape_assertions to neutralise the SIGFPE-prone deferred
shape guards on the dynamic E axis (same handling the spin dense path uses).
- deep_eval._eval_model_graph: build the carry-all graph at its tight edge
count (no edge_capacity padding).
- test_graph_deepeval: eval TWO different-size systems (8- and 20-atom,
56 vs 380 real edges) through the SAME exported artifact; both match eager
dense dpa1 at 1e-10 pbc+nopbc, both non-binding. The 20-atom system (380 > 263)
would have overflowed the B1 static artifact -> the RED.
- test_graph_pt2_metadata: graph metadata no longer carries edge_capacity.
…e graph (dynamic-E) caller + edge-axis safety
…) generator Section B of gen_dpa1.py produces deeppot_dpa1_graph.pt2 (lower_kind="graph", do_atomic_virial=True) and the accompanying deeppot_dpa1_graph.expected sidecar. Reference values come from an independent nlist .pt2 eval (NOT the graph .pt2) so the C++ gtest (B2.5) validates the graph AOTI path against a known-good reference. Sanity check: graph .pt2 vs nlist ref force diff 1.3e-18 (machine precision; sel=30 >> actual neighbors, so both paths see identical neighbor sets). Forces non-degenerate: max |F| ~5.3e-4 (PBC). Config: type_map=[O,H], sel=30, rcut=6.0, attn_layer=0, neuron=[2,4,8], axis_neuron=4, fitting neuron=[5,5,5], resnet_dt=True, seed=1 — mirrors DPA1_CONFIG in source/tests/pt_expt/utils/test_graph_pt2_metadata.py.
B2.2: read lower_input_kind="graph" -> lower_input_is_graph_. B2.3: run_model_graph with NeighborGraph AOTI input order (atype, n_node, edge_index, edge_vec, edge_mask, [fparam], [aparam], [charge_spin]); no coord / edge_scatter_index. B2.4: GraphTensorPack + buildGraphTensors (delegates to createEdgeTensors+compactEdgeTensors for the rcut filter, dynamic edge count and 2 masked dummy edges; drops edge_index_ext, adds n_node=[nloc], node types from atype_ext[0:nloc]) + compute_inner & standalone dispatch branches. Multi-rank graph fails fast (PR-B3).
… path Local std::vector<int64_t> mapping was declared in compute_inner and populated only inside if(ago==0). The graph branch called buildGraphTensors with this local vector on every step, causing an OOB heap read on ago>0 (mapping.size()==0). Fix: promote mapping to a member mapping_ (parallel to mapping_tensor) so it persists across steps. Edge-path (createEdgeTensors) and dense-path (mapping_tensor) are unaffected in behavior; only the vector source changes from local to member.
…xtraction
Add test_deeppot_dpa1_graph_ptexpt.cc — the first runtime exercise of the
NeighborGraph .pt2 C++ ingestion (B2.2-4). Four cases x {double,float}:
build-nlist parity vs the committed .expected, a second 12-atom system through
the same model (dynamic edge axis), the LAMMPS InputNlist+ago=0/ago=1 path
(compute_inner + cached mapping_), and a tiny no-edge system (nedge_min=2 guard).
The first run surfaced a B2.2-4 bug: the graph forward emits LOCAL flat-N PUBLIC
keys (atom_energy/energy/force/virial/atom_virial) but compute()'s output
extraction read the dense INTERNAL keys (energy_redu/energy_derv_r/...), so
output_map["energy_redu"].view() threw on an undefined tensor. The graph branch
had only ever been compiled, never run.
Fix: remap_graph_outputs_to_dense_keys() in commonPT.h, called after
extract_outputs in both compute overloads (gated on lower_input_is_graph_).
Rewrites the public keys into the dense internal-key layout; per-atom
force/atom_virial are local (nloc) and zero-padded up to nall so the existing
fold-back is a no-op on ghost rows (local rows already carry the folded ghost
contributions). Dense/edge paths untouched.
gen_dpa1.py now persists the dense nlist-ref .pt2 (deeppot_dpa1_graph_nlist_ref.pt2)
as a live graph-vs-dense oracle for the dynamic-edge cases.
…ring (nlist .pt2, 1e-5)
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
source/api_cc/src/DeepPotPTExpt.cc (1)
508-539: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winRequire
lmp_list.mappingfor graph artifacts with ghosts.The graph branch uses
mapping_inbuildGraphTensorsto fold ghost neighbors onto local owners, but the current guard only requires an atom map forhas_message_passing_models. A DPA1 graph artifact can therefore run PBC LAMMPS withoutatom_modify map yes, fall back to identity mapping, and feed invalid/wrong folded graph indices.Proposed fix
if (lower_input_is_graph_ && multi_rank) { throw deepmd::deepmd_exception( "Multi-rank graph (NeighborGraph) .pt2 inference is not yet " "supported (PR-B3). Run single-rank, or use a dense/edge .pt2 for " "multi-rank LAMMPS."); } + if (lower_input_is_graph_ && nghost > 0 && !atom_map_present) { + throw deepmd::deepmd_exception( + "Graph (NeighborGraph) .pt2 inference with ghost atoms requires " + "`atom_modify map yes` so ghost neighbors can be folded onto their " + "local owners. Re-run with an atom map enabled, or use a dense .pt2."); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 508 - 539, The graph-artifact path in DeepPotPTExpt::buildGraphTensors is using mapping_ to fold ghost neighbors, but the current validation only enforces atom-map presence for has_message_passing_ models. Add a guard for lower_input_is_graph_ with ghosts so graph artifacts also require lmp_list.mapping (or atom_modify map yes) before compute(), and fail fast when it is missing instead of falling back to identity mapping.deepmd/pt_expt/utils/serialization.py (1)
793-799: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winReject unknown
lower_kindvalues before dispatch.A typo currently falls through to the dense nlist export while recording nlist metadata, producing a valid-looking artifact with the wrong ABI.
Proposed fix
""" + if lower_kind not in {"nlist", "graph"}: + raise ValueError( + f"Unsupported lower_kind={lower_kind!r}; expected 'nlist' or 'graph'." + ) if model_file.endswith(".pt2"): _deserialize_to_file_pt2(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pt_expt/utils/serialization.py` around lines 793 - 799, The dispatch in serialization.py currently lets unknown lower_kind values fall through to the dense nlist export path, which can produce a misleading artifact with incorrect ABI metadata. Add validation before the model_file.endswith(".pt2") branch so only the expected lower_kind values are accepted, and raise an error for anything else. Keep the check close to the dispatch logic in the deserialization/export flow that calls _deserialize_to_file_pt2 and _deserialize_to_file_pte.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 793-799: The dispatch in serialization.py currently lets unknown
lower_kind values fall through to the dense nlist export path, which can produce
a misleading artifact with incorrect ABI metadata. Add validation before the
model_file.endswith(".pt2") branch so only the expected lower_kind values are
accepted, and raise an error for anything else. Keep the check close to the
dispatch logic in the deserialization/export flow that calls
_deserialize_to_file_pt2 and _deserialize_to_file_pte.
In `@source/api_cc/src/DeepPotPTExpt.cc`:
- Around line 508-539: The graph-artifact path in
DeepPotPTExpt::buildGraphTensors is using mapping_ to fold ghost neighbors, but
the current validation only enforces atom-map presence for has_message_passing_
models. Add a guard for lower_input_is_graph_ with ghosts so graph artifacts
also require lmp_list.mapping (or atom_modify map yes) before compute(), and
fail fast when it is missing instead of falling back to identity mapping.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c4b21156-2b33-4392-b6b0-4819a492ea77
📒 Files selected for processing (9)
deepmd/pt_expt/infer/deep_eval.pydeepmd/pt_expt/utils/serialization.pysource/api_cc/include/DeepPotPTExpt.hsource/api_cc/include/commonPT.hsource/api_cc/src/DeepPotPTExpt.ccsource/api_cc/tests/test_deeppot_dpa1_graph_ptexpt.ccsource/tests/infer/gen_dpa1.pysource/tests/pt_expt/infer/test_graph_deepeval.pysource/tests/pt_expt/utils/test_graph_pt2_metadata.py
🚧 Files skipped from review as they are similar to previous changes (2)
- deepmd/pt_expt/infer/deep_eval.py
- source/tests/pt_expt/infer/test_graph_deepeval.py
…st (B2.5 follow-ups)
…-comm; no with-comm)
…-assert on graph .pt2 export)
The graph forward_common_lower_graph .pt2 AOTI export device-asserted on
CUDA ('index out of bounds: 0 <= tmp8 < ks0') in the edge_force_virial
scatter. A padding/guard edge's node index reaches the index_add address
computation BEFORE edge_mask zeroes its (zero) contribution; CPU does not
bounds-check the address (so all dev + CI, which are CPU-only, were green),
CUDA device-asserts. Clamp src/dst into [0, n_out) and edge_frame into
[0, nf) before the scatters -- padding edges carry w_edge==0, so a clamped
out-of-range index scatters zero (numerically harmless). Verified on Tesla
T4: graph .pt2 export + DeepEval parity (small_8/large_20) + full graph
suite + consistency all pass; parity confirms the clamp does not corrupt
(N bound was correct, the stray index was genuine padding).
Also bind the per-node scatter to the input atype.shape[0] (node_capacity)
rather than the re-derived fit_ret.shape[0] -- hardening.
Fix: CUDA device-assert in graph
|
The xp.clip clamp added in the previous commit breaks under torch.export:
the bound n_out is a SymInt and array_api_compat's clip reads .shape on it
('SymInt' object has no attribute 'shape'), failing every graph .pt2 export
on CI/local (it only passed on the GPU box's newer array_api_compat). Replace
with modulo (src % n_out, dst % n_out, edge_frame % nf) -- pure arithmetic, so
torch.export-safe, and a no-op on in-range real indices. Also fixes the case a
mask-multiply missed: the out-of-range index occurs on an edge_mask==1 edge
(ks0=n_out binds to a smaller symbol than the live node count at AOTI runtime),
which only a clamp-every-index (clip or modulo), not a mask-zero, neutralizes.
Out-of-range edges carry ~zero w_edge, so wrapping them is numerically harmless.
Verified on Tesla T4: graph .pt2 export (gen_dpa1) + DeepEval parity
(small_8/large_20) + C++ TestInferDpa1GraphPtExpt 10/10 all pass; CPU export
suite 12/12 (the SymInt failure the prior commit would have hit in CI).
…#5604 - training.py: add explanatory comments to four pass-only except clauses (CodeQL empty-except); replace the nloc==1-fragile 'N != nframes' shape heuristic in the compiled-graph unravel with an explicit node-level key set so single-atom-per-frame outputs reshape to (nf, 1, *) (CodeRabbit). - serialization.py: mark aparam's atom axis dynamic in the graph export dynamic-shape spec ({0: nframes, 1: nloc}), matching the dense path, so a dim_aparam>0 graph export no longer specializes nloc (CodeRabbit). - test_lammps_dpa1_graph_pt2.py: skip the two reference-comparison tests when the gen_dpa1 .expected fixture is absent (clean skip vs TypeError on None); force 'processors 2 1 1' in the empty-subdomain MPI test so the empty-rank branch is genuinely exercised (CodeRabbit). Skipped as invalid: CodeQL 'import DEVICE' (house pattern, benign in test); CodeRabbit do_atomic_virial-in-compiled-graph (the dense compiled path drops it identically; training never requests atom_virial). Validated: CPU export 8/8 + compiled-varying-natoms 5/5 (incl. dpa1_no_attn).
…port
Add a user-facing entry point to the graph C++ inference path. Before this,
the graph lower was reachable only via the internal API / gen_dpa1.py test
fixture, so a user-frozen dpa1 .pt2 always used the dense (nlist) lower and
the tested C++ graph path was unreachable from the CLI.
- main.py: add --lower-kind {nlist,graph} to the freeze subparser (default
nlist; PyTorch-Exportable backend only, same convention as --head/--node-names).
- entrypoints/main.py: thread lower_kind into freeze() -> deserialize_to_file.
Fail fast (ValueError) when 'graph' is requested for a non-graph-eligible
model (reuses _model_uses_graph_lower; currently dpa1 attn_layer==0 only).
Enable do_atomic_virial for the graph form -- near-free there (one extra
scatter off the shared single backward).
- test_dp_freeze.py: graph-eligibility rejection (se_e2_a) + a public-CLI
graph freeze of dpa1(attn_layer=0) asserting metadata lower_input_kind=graph.
Opt-in by design; auto-selecting graph for eligible models (mirroring training)
is deferred until the graph path covers attention/angles/MP. Both tests pass.
Pushed: CUDA-fix correction + AI-review responses +
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deepmd/pt_expt/entrypoints/main.py`:
- Around line 734-741: The output suffix handling in main() still forces
unsuffixed exports to .pte even when freeze() is running with lower_kind set to
graph. Update the logic around the FLAGS.output normalization so it chooses the
default suffix based on the effective lower_kind (graph should default to .pt2,
otherwise .pte), while still preserving explicit user-provided .pte/.pt2
outputs. Use the freeze() call and the lower_kind lookup as the key symbols to
locate the fix.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5a56748f-b853-426d-96fb-a1bb13a56f0f
📒 Files selected for processing (7)
deepmd/dpmodel/utils/neighbor_graph/derivatives.pydeepmd/main.pydeepmd/pt_expt/entrypoints/main.pydeepmd/pt_expt/train/training.pydeepmd/pt_expt/utils/serialization.pysource/lmp/tests/test_lammps_dpa1_graph_pt2.pysource/tests/pt_expt/test_dp_freeze.py
🚧 Files skipped from review as they are similar to previous changes (3)
- deepmd/dpmodel/utils/neighbor_graph/derivatives.py
- source/lmp/tests/test_lammps_dpa1_graph_pt2.py
- deepmd/pt_expt/utils/serialization.py
Wire the dpa1 graph .pt2 (attn_layer=0, lower_input_kind=graph) into the parametrized VariantDeepPotTest as case dpa1_graph_ptexpt (Backend::PTExpt, ref deeppot_dpa1_graph.expected, pbc/nopbc, tol 1e-10/1e-4). Flags mirror dpa1_pytorch_pt2 exactly; all 19 enabled subtests pass on remote GPU (FiniteDifference, LmpNlist*, CutoffTwice, TypeSel, NoPbc*). The dedicated test_deeppot_dpa1_graph_ptexpt.cc is retained for graph-unique coverage.
… modulo-clamp invariant test - entrypoints/main.py: 'dp --pt_expt freeze --lower-kind graph' now defaults a suffix-less output to .pt2 (the AOTI archive the C++ graph path consumes) instead of .pte; explicit .pte/.pt2 suffixes are preserved for both kinds. New test (freeze mocked) covers all four suffix/kind combinations. - test_edge_force_virial.py: add an invariant test that the in-bounds index clamp (% n_out) NEVER alters a real (edge_mask==True) edge -- a boundary real edge on node n_out-1 scatters unwrapped, and a masked guard edge with deliberately out-of-range indices contributes nothing; the result equals the real-edges-only reference. Directly answers iProzd's ask to prove the modulo cannot silently remap real edges (only zero-weight guard edges). Both new tests pass.
for more information, see https://pre-commit.ci
…t fix + root cause Reframe the comment on the % n_out index guard: it is the PERMANENT fix, not an interim workaround. Records the GPU-confirmed root cause -- the dynamic-edge graph torch.export path traces the node count as several equal-but-distinct symbols tied only by aten._assert_scalar(Eq(...)) nodes, which _strip_shape_assertions neutralises wholesale (needed for export to trace), dropping those equalities so inductor can no longer prove idx < n_out and emits a device_assert. The upstream alternative (making the shared, spin-critical _strip_shape_assertions selective) is deliberately not taken -- it risks the torch.export bugs that helper bypasses and the spin .pt2 path. Comment-only.
OutisLi
left a comment
There was a problem hiding this comment.
One performance issue on the C++ graph LAMMPS path (per-step host-side topology rebuild). Details inline.
OutisLi
left a comment
There was a problem hiding this comment.
One maintainability issue: the graph trace/export sample builders are duplicated across training.py and serialization.py. Please consolidate. Details inline.
…eview) Merge the two near-identical synthetic-graph builders _make_graph_sample_inputs (serialization.py) and _make_graph_trace_inputs (training.py) into one build_synthetic_graph_inputs in serialization.py, parameterized by dtype (export=float64 ABI, training=GLOBAL_PT precision) and device (export passes cpu explicitly instead of mutating env.DEVICE) plus the want_fparam/aparam/charge_spin gating. Also factor the duplicated prime-collision _forbidden scan into _forbidden_dims_from_model, shared by the dense and graph _trace_and_compile. Removes the desync risk between the training and export graph traces flagged in review.
…nk (OutisLi review) Comment 1 (perf): the graph LAMMPS path called buildGraphTensors every timestep, whose createEdgeTensors stage is an O(E) host loop + H2D copy that rebuilds the edge topology from scratch even when LAMMPS has not refreshed the neighbor list. Mirror the edge path: at ago==0 cache the skin topology via createEdgeTensors(with_geometry=false, fold_to_local=!multi_rank) into edge_index_tensor / edge_index_ext_tensor; each step run only the on-device compactEdgeTensors (geometry recompute + rcut filter) and assemble the cheap n_node / node-atype tensors. Topology now rebuilds only on a neighbor-list rebuild, consistent with the edge path. Comment 2 (empty-rank nit): a truly-empty rank (nall_real == 0, no local atoms AND no ghosts) would feed N == 0 into the graph, and edge_force_virial's edge_index % node_capacity would SIGFPE (div-by-zero). Early-return zero outputs for that rank. The tested nloc==0 empty-subdomain case has nall_real>0 (ghosts within rcut) and still runs normally. Validated: 29/33 graph gtests (4 pre-existing NoPBC skips) + 5/5 test_lammps_dpa1_graph_pt2 (single-rank multi-step ref, multi-rank, mpirun -n 2 == single-rank, empty-subdomain).
for more information, see https://pre-commit.ci
NeighborGraph PR-B — graph
.pt2export, compiled training, and C++ inference (single & multi-rank)This PR spans the full PR-B: B1 (Python: graph
.pt2export + compiled training on the graph lower), B2 (C++ single-rank inference of the graph.pt2, dynamic edge axis), and B3 (C++/LAMMPS multi-rank). Built on the merged PR-A (#5583). Scope: dpa1,attn_layer=0, pt_expt.B1 — graph
.pt2export + compiled training (Python)forward_common_lower_graph_exportabletrace target;serialization.pygraph export branch (lower_kind="graph",lower_input_kindmetadata);_eval_model_graphDeepEval dispatch (parity vs eager dpa1 1e-10 pbc+nopbc).force_legacy_descriptordeleted. Root cause was a real dpa1call_graphautograd detach bug (xp.asarray(tebd, device=)drops the tebd-net gradient under torch); fixed.B2 — C++ graph ingestion (dynamic edge axis, single-rank)
.pt2uses a dynamic edge axis (Dim("nedge", min=2)) — one artifact evals any system size (proven across 56- and 380-edge systems at 1e-10), no C++ capacity ceiling.DeepPotPTExpt:lower_input_is_graph_+run_model_graph(NeighborGraph ABI:atype, n_node, edge_index, edge_vec, edge_mask, …) +buildGraphTensors(mirrors the refactor(pt/dpa4): use edge schema for dpa4 and ignore sel #5562 edge path; node types fromatype_ext);remap_graph_outputs_to_dense_keys(single-rank).ago>0, tiny system, atomic-overload). The review process caught two bugs that would otherwise have shipped: anago>0heap-OOB (by inspection) and a public-vs-internal output-key mismatch (at runtime).B3 — multi-rank C++ / LAMMPS (non-MP)
border_op/with-comm artifact (that is a message-passing concern, deferred to PR-G). Multi-rank reuses the same single-rank graph.pt2, fed an extended-region graph (buildGraphTensors(fold_to_local=false),N=nall, ghost node types fromatype_extincl. halo), with owned energy =sum(atom_energy[0:nloc])and the extended force folded to owners through the existing denseselect_mapreverse-comm. The fail-fast forgraph && multi_rank && has_message_passingis retained.test_lammps_dpa1_graph_pt2.py— single-rank vs reference,mpirun -n 2≡ single-rank (energy + per-atom force + virial, atol 1e-8), plus an empty-subdomain (nloc=0) corner. Single-rank gtests stay 10/10 (multi-rank is purely additive). Multi-rank matched single-rank on the first run.Tests / known limitations
nonzero, eager-only), PR-D attention, PR-E angles, PR-F jax graph force, PR-G dpa2/3 message-passing (forward halo + with-comm). CUDA multi-rank unvalidated locally. Carried code-cleanup follow-ups: a ~60-line DRY duplication intraining.py; the multi-rank atomic output branch has no direct gtest (covered indirectly by the mpirun per-atom-virial assertion, since a single-process gtest can't setnprocs>1).Summary by CodeRabbit
Summary
New Features
lower_kind="graph"export path, including CLI support and new graph-form inference handling.Bug Fixes
Tests
.pt2, plus metadata checks forlower_input_kind.