Skip to content

feat(pt_expt): graph-native dpa1 — .pt2 export + compiled training + C++ inference single & multi-rank (NeighborGraph PR-B)#5604

Open
wanghan-iapcm wants to merge 33 commits into
deepmodeling:masterfrom
wanghan-iapcm:feat-graph-pt2-prB
Open

feat(pt_expt): graph-native dpa1 — .pt2 export + compiled training + C++ inference single & multi-rank (NeighborGraph PR-B)#5604
wanghan-iapcm wants to merge 33 commits into
deepmodeling:masterfrom
wanghan-iapcm:feat-graph-pt2-prB

Conversation

@wanghan-iapcm

@wanghan-iapcm wanghan-iapcm commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

NeighborGraph PR-B — graph .pt2 export, compiled training, and C++ inference (single & multi-rank)

This PR spans the full PR-B: B1 (Python: graph .pt2 export + compiled training on the graph lower), B2 (C++ single-rank inference of the graph .pt2, dynamic edge axis), and B3 (C++/LAMMPS multi-rank). Built on the merged PR-A (#5583). Scope: dpa1, attn_layer=0, pt_expt.

B1 — graph .pt2 export + compiled training (Python)

  • forward_common_lower_graph_exportable trace target; serialization.py graph export branch (lower_kind="graph", lower_input_kind metadata); _eval_model_graph DeepEval dispatch (parity vs eager dpa1 1e-10 pbc+nopbc).
  • Compiled training retargeted to the graph lower so eager == compiled (the MUST-FIX) → force_legacy_descriptor deleted. Root cause was a real dpa1 call_graph autograd detach bug (xp.asarray(tebd, device=) drops the tebd-net gradient under torch); fixed.

B2 — C++ graph ingestion (dynamic edge axis, single-rank)

  • Graph .pt2 uses a dynamic edge axis (Dim("nedge", min=2)) — one artifact evals any system size (proven across 56- and 380-edge systems at 1e-10), no C++ capacity ceiling.
  • C++ DeepPotPTExpt: lower_input_is_graph_ + run_model_graph (NeighborGraph ABI: atype, n_node, edge_index, edge_vec, edge_mask, …) + buildGraphTensors (mirrors the refactor(pt/dpa4): use edge schema for dpa4 and ignore sel #5562 edge path; node types from atype_ext); remap_graph_outputs_to_dense_keys (single-rank).
  • gtest: 5 cases × {double,float} = 10/10 (build-nlist parity, dynamic-E 2nd size, ago>0, tiny system, atomic-overload). The review process caught two bugs that would otherwise have shipped: an ago>0 heap-OOB (by inspection) and a public-vs-internal output-key mismatch (at runtime).

B3 — multi-rank C++ / LAMMPS (non-MP)

  • dpa1 is non-message-passing ⇒ multi-rank needs NO border_op/with-comm artifact (that is a message-passing concern, deferred to PR-G). Multi-rank reuses the same single-rank graph .pt2, fed an extended-region graph (buildGraphTensors(fold_to_local=false), N=nall, ghost node types from atype_ext incl. halo), with owned energy = sum(atom_energy[0:nloc]) and the extended force folded to owners through the existing dense select_map reverse-comm. The fail-fast for graph && multi_rank && has_message_passing is retained.
  • Validated locally on multi-CPU (no GPU needed for correctness): test_lammps_dpa1_graph_pt2.py — single-rank vs reference, mpirun -n 2 ≡ single-rank (energy + per-atom force + virial, atol 1e-8), plus an empty-subdomain (nloc=0) corner. Single-rank gtests stay 10/10 (multi-rank is purely additive). Multi-rank matched single-rank on the first run.

Tests / known limitations

  • Per-task + whole-phase reviews all Ready-to-merge.
  • pt_expt-only; dpa1 (non-MP) only. Follow-ons: PR-C vesin/nv O(N) builders (carry-all builders still use nonzero, eager-only), PR-D attention, PR-E angles, PR-F jax graph force, PR-G dpa2/3 message-passing (forward halo + with-comm). CUDA multi-rank unvalidated locally. Carried code-cleanup follow-ups: a ~60-line DRY duplication in training.py; the multi-rank atomic output branch has no direct gtest (covered indirectly by the mpirun per-atom-virial assertion, since a single-process gtest can't set nprocs>1).

Summary by CodeRabbit

Summary

  • New Features

    • Added support for graph-schema (NeighborGraph) model archives with a selectable lower_kind="graph" export path, including CLI support and new graph-form inference handling.
    • Added static edge-capacity support during graph construction.
  • Bug Fixes

    • Improved gradient continuity for type embeddings in graph mode.
    • Enhanced trace/export stability by preventing out-of-range graph indices/frame IDs and making scatter/frame sizing more consistent.
  • Tests

    • Added/extended parity, export metadata, training, and LAMMPS single-/multi-rank validation for graph-form .pt2, plus metadata checks for lower_input_kind.

Han Wang added 8 commits June 29, 2026 18:42
… graph .pt2 export

- Add make_model.forward_common_lower_graph_exportable: make_fx trace of
  forward_common_lower_graph with edge_vec as the autograd leaf; symbolic
  tracing enabled via two data-dependent shape fixes.
- Add EnergyModel.forward_lower_graph_exportable: wraps the above with a
  second make_fx pass that translates internal keys to public names
  (atom_energy, energy, force, virial, atom_virial).
- Fix edge_transform_output.py: replace int(n_node.sum()) with
  next(iter(fit_ret.values())).shape[0] (static shape attr, trace-safe);
  thread node_capacity through edge_energy_deriv -> edge_force_virial so
  that segment_sum sizes are static under make_fx symbolic mode.
- Add edge_energy_deriv node_capacity param (None = eager fallback).
- Test: source/tests/pt_expt/model/test_graph_export.py — TDD RED->GREEN,
  verifies traced module reproduces eager energy_redu (rtol=1e-10).
… do_atomic_virial branches), dedup key-translation, drop redundant detach
…r_graph

Adds test_graph_static_capacity.py (4 tests): shape (2,64), real-edge
count matches dynamic, real prefix identity + masked tail, overflow raises.
No source change needed -- builder.py already threads layout.edge_capacity
into pad_and_guard_edges.
…pbc, 1e-10)

Add a graph-form .pt2 DeepEval dispatch and parity test (NeighborGraph PR-B
Phase B1.4).

DeepEval: route lower_input_kind=="graph" archives through a new
_eval_model_graph that builds a carry-all NeighborGraph (padded to the static
edge_capacity baked in metadata), feeds the positional schema
(atype, n_node, edge_index, edge_vec, edge_mask, fparam, aparam, charge_spin)
to the AOTI forward, and reshapes the LOCAL public outputs by category (no
communicate_extended_output, since the graph path is ghost-free). The dense
(nlist) path is untouched.

Fix dynamic-shape generalization of the graph .pt2 export: under symbolic
make_fx/torch.export, int() on a SymInt SPECIALIZES the axis, baking the
trace-time sample size (N=14, nf=2) into the autograd-derived force/energy/
virial outputs. Drop int() on node_capacity (edge_force_virial) and on
n_node.shape[0] (fit_output_to_model_output_graph), and derive nf-1 in
frame_id_from_n_node as a runtime 0-d tensor instead of xp.asarray(shape-1).
The exported graph now generalizes across arbitrary N and nf.

Test: dpa1(attn_layer=0) energy model exported with lower_kind="graph",
evaluated through the pt_expt DeepPot and compared to the eager dense
(sel-capped, neighbor_graph_method="legacy") reference at rtol=atol=1e-10 for
pbc and nopbc. The fixture is a sparse 8-atom cluster and asserts non-binding
sel (max neighbors < 30) so the carry-all/sel-capped neighbor sets coincide
(anti-vacuity).
…); drop force_legacy_descriptor

Retarget _CompiledModel to compile forward_common_lower_graph for graph-eligible
descriptors (dpa1 attn_layer==0), gated by the same mixed_types()+uses_graph_lower()
predicate the eager default-flip uses; se_e2_a/dpa2/dpa3 keep compiling the dense
forward_lower. _trace_and_compile_graph builds a synthetic NeighborGraph with
prime-distinct nf/N/E axes (no make_fx duck-shape merge) and edge_vec as the
autograd leaf; _forward_graph builds the carry-all graph eagerly and unravels flat
(N,*) node outputs to (nf,nloc,*). cpp.simdlen=0 for the graph compile avoids an
inductor CPU scatter-vectorizer crash on the per-frame virial atomic_add.

Also fixes an eager autograd bug in dpa1 call_graph: xp.asarray(type_embedding,
device=dev) DETACHES under torch, so the type-embedding weights never trained in
the graph path (grad None despite a real finite-diff dependency). make_fx traced
through it, so compiled != eager and the optimizer diverged after step 0. Use
type_embedding directly (mirrors the dense path); the tebd net now trains and
eager==compiled to 1e-10 across the varying-natoms trajectory.

Drops the force_legacy_descriptor workaround + uses_graph_lower monkeypatch.
… + legacy-gate assumption in the graph compile path
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a graph-native NeighborGraph lowering path across export, inference, training compilation, C++ runtime, and tests. It also updates dpmodel and neighbor-graph helpers to preserve gradients and symbolic shapes during graph tracing.

Changes

Graph-form lower-forward export and runtime

Layer / File(s) Summary
Trace-safe graph utilities
deepmd/dpmodel/descriptor/dpa1.py, deepmd/dpmodel/utils/neighbor_graph/graph.py
type_embedding is used directly in graph descriptor paths, and frame_id_from_n_node uses a runtime clamp value for padding-node frame IDs.
Trace-stable edge transforms
deepmd/pt_expt/model/edge_transform_output.py, deepmd/pt_expt/model/make_model.py, deepmd/pt_expt/model/ener_model.py
edge_energy_deriv and graph fitting accept node_capacity, graph export entry points pass the node-axis capacity through the traced lower path, and graph/dense export key translation is centralized.
Graph export metadata and DeepEval dispatch
deepmd/pt_expt/utils/serialization.py, deepmd/pt_expt/infer/deep_eval.py
deserialize_to_file accepts lower_kind, graph exports build dynamic graph inputs and record lower_input_kind, and DeepEval routes graph archives through _eval_model_graph with graph-key output mapping.
GRAPH compiled lower
deepmd/pt_expt/train/training.py
Graph eligibility detection, graph tracing and compilation helpers, a cached graph forward path, and graph-shaped output unraveling are added to compiled training execution.
C++ graph runtime path
source/api_cc/include/commonPT.h, source/api_cc/include/DeepPotPTExpt.h, source/api_cc/src/DeepPotPTExpt.cc
C++ support adds graph tensor packing, graph model execution, graph metadata dispatch, dense-key remapping, and graph-path cache handling.
Graph export and runtime tests
source/tests/common/dpmodel/test_graph_static_capacity.py, source/tests/pt_expt/model/test_graph_export.py, source/tests/pt_expt/infer/test_graph_deepeval.py, source/tests/pt_expt/utils/test_graph_pt2_metadata.py, source/tests/pt_expt/test_training.py, source/api_cc/tests/test_deeppot_dpa1_graph_ptexpt.cc, source/tests/infer/gen_dpa1.py, source/lmp/tests/test_lammps_dpa1_graph_pt2.py, source/tests/pt_expt/test_dp_freeze.py
Tests cover static edge capacity, graph export tracing, metadata, DeepEval parity, C++ inference, graph artifact generation, the compiled varying-natoms graph path, LAMMPS graph .pt2 inference, and freeze-path validation.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

Core

Suggested reviewers

  • OutisLi
  • iProzd
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: graph-native dpa1 pt2 export, compiled training, and C++ inference across single and multi-rank paths.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

DeepPot,
)
from deepmd.pt_expt.utils.env import (
DEVICE,
Comment thread deepmd/pt_expt/train/training.py Fixed
Comment thread deepmd/pt_expt/train/training.py Fixed
Comment thread deepmd/pt_expt/train/training.py Fixed
Comment thread deepmd/pt_expt/train/training.py Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
deepmd/pt_expt/utils/serialization.py (1)

750-789: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject unknown lower_kind values instead of falling back to nlist.

A typo such as lower_kind="grpah" currently reaches the dense path and records lower_input_kind="nlist", producing the wrong artifact without any signal.

Proposed fix
 def deserialize_to_file(
@@
     """
+    if lower_kind not in {"nlist", "graph"}:
+        raise ValueError(
+            f"Unknown lower_kind {lower_kind!r}; expected 'nlist' or 'graph'."
+        )
     if model_file.endswith(".pt2"):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt_expt/utils/serialization.py` around lines 750 - 789, The
deserialize entrypoint in serialization.py currently accepts any lower_kind
value and silently falls back to the nlist path, which can produce the wrong
artifact for typos. Update deserialize_to_file and the lower-path selection
logic in _deserialize_to_file_pt2 / _deserialize_to_file_pte to validate
lower_kind explicitly against the supported values ("nlist" and "graph") and
raise a clear error for anything else instead of defaulting to nlist, so the
recorded lower_input_kind always matches the requested schema.
source/tests/pt_expt/test_training.py (1)

1355-1374: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Assert that the DPA1 no-attn case really takes the graph lower.

These changes now document graph-lower routing, but the test still only proves eager/compiled parity. If _CompiledModel silently falls back to the dense lower and the sampled systems stay non-binding, this case still passes, so the dispatch change introduced by this PR is untested. Please add an explicit assertion for the DPA1 no-attn branch that the compiled model is graph-eligible / using the graph lower, while keeping the dense expectation for the other descriptors.

Also applies to: 1391-1392, 1466-1479

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/tests/pt_expt/test_training.py` around lines 1355 - 1374, The current
test only checks eager/compiled parity in _check_varying_natoms, so the DPA1
no-attn routing to the graph lower is not explicitly verified. Add an assertion
in the DPA1 branch that the compiled model is graph-eligible and actually uses
forward_common_lower_graph, while preserving the dense forward_lower expectation
for the non-eligible descriptors. Extend the same explicit dispatch check in the
related test sections that cover the same descriptor routing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pt_expt/train/training.py`:
- Around line 968-970: The graph-compiled forward path is dropping the
do_atomic_virial request, so atom_virial cannot be returned from compiled
models. Update forward and _forward_graph in training.py to thread
do_atomic_virial through the graph-eligible dispatch, and make
_trace_and_compile_graph cache/trace separately based on that flag so the
compiled lower path matches forward_common_lower_graph behavior.
- Around line 1283-1293: The reshaping logic in the result loop currently relies
on the `N != nframes` heuristic, which fails for `nloc == 1` and leaves
node-level outputs flat. Update the `training.py` postprocessing in the
`result.items()` loop to use explicit key categories from the graph/lower-level
public keys instead of shape inference, and reshape known node-level outputs
like `atom_energy` and `force` to `(nf, 1, *)` even when `N == nframes`.

In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 447-448: The dynamic-shape spec in _build_graph_dynamic_shapes
still only marks the frame axis for aparam, so its atom axis remains fixed to
the sample nloc. Update the aparam entry in the shape mapping to include both
the frame axis and the atom axis as dynamic, matching the intended (nf, nloc,
nda) behavior. Keep the change localized to the graph shape spec alongside the
existing fparam and aparam handling so exported graphs no longer specialize to a
single atom count.

---

Outside diff comments:
In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 750-789: The deserialize entrypoint in serialization.py currently
accepts any lower_kind value and silently falls back to the nlist path, which
can produce the wrong artifact for typos. Update deserialize_to_file and the
lower-path selection logic in _deserialize_to_file_pt2 /
_deserialize_to_file_pte to validate lower_kind explicitly against the supported
values ("nlist" and "graph") and raise a clear error for anything else instead
of defaulting to nlist, so the recorded lower_input_kind always matches the
requested schema.

In `@source/tests/pt_expt/test_training.py`:
- Around line 1355-1374: The current test only checks eager/compiled parity in
_check_varying_natoms, so the DPA1 no-attn routing to the graph lower is not
explicitly verified. Add an assertion in the DPA1 branch that the compiled model
is graph-eligible and actually uses forward_common_lower_graph, while preserving
the dense forward_lower expectation for the non-eligible descriptors. Extend the
same explicit dispatch check in the related test sections that cover the same
descriptor routing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 58e2933c-1a05-4d30-bdfa-0cec147b8f6a

📥 Commits

Reviewing files that changed from the base of the PR and between 58bef11 and b046874.

📒 Files selected for processing (14)
  • deepmd/dpmodel/descriptor/dpa1.py
  • deepmd/dpmodel/utils/neighbor_graph/derivatives.py
  • deepmd/dpmodel/utils/neighbor_graph/graph.py
  • deepmd/pt_expt/infer/deep_eval.py
  • deepmd/pt_expt/model/edge_transform_output.py
  • deepmd/pt_expt/model/ener_model.py
  • deepmd/pt_expt/model/make_model.py
  • deepmd/pt_expt/train/training.py
  • deepmd/pt_expt/utils/serialization.py
  • source/tests/common/dpmodel/test_graph_static_capacity.py
  • source/tests/pt_expt/infer/test_graph_deepeval.py
  • source/tests/pt_expt/model/test_graph_export.py
  • source/tests/pt_expt/test_training.py
  • source/tests/pt_expt/utils/test_graph_pt2_metadata.py

Comment thread deepmd/pt_expt/train/training.py
Comment thread deepmd/pt_expt/train/training.py
Comment thread deepmd/pt_expt/utils/serialization.py Outdated
@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.12389% with 101 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.85%. Comparing base (5082854) to head (70b02fe).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
...rce/api_cc/tests/test_deeppot_dpa1_graph_ptexpt.cc 74.34% 39 Missing ⚠️
deepmd/pt_expt/train/training.py 80.26% 30 Missing ⚠️
source/api_cc/src/DeepPotPTExpt.cc 65.45% 13 Missing and 6 partials ⚠️
deepmd/pt_expt/utils/serialization.py 90.62% 6 Missing ⚠️
source/api_cc/include/commonPT.h 92.72% 2 Missing and 2 partials ⚠️
deepmd/pt_expt/infer/deep_eval.py 91.17% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5604      +/-   ##
==========================================
- Coverage   82.37%   81.85%   -0.52%     
==========================================
  Files         902      960      +58     
  Lines      101527   106256    +4729     
  Branches     4056     4134      +78     
==========================================
+ Hits        83630    86976    +3346     
- Misses      16432    17787    +1355     
- Partials     1465     1493      +28     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Han Wang added 7 commits June 30, 2026 01:21
… static edge_capacity

The graph-form .pt2 export now marks the edge axis dynamic
(Dim("nedge", min=2)) instead of baking a static E_max=ceil(1.25*nloc*nnei)
capacity. The AOTI artifact accepts any system size with no capacity ceiling.

- _build_graph_dynamic_shapes: edge_index dim1 / edge_vec dim0 / edge_mask dim0
  are now dynamic; mirrors the dense Dim("nnei", min=...) precedent.
- _trace_and_export graph branch: drop the metadata["edge_capacity"] write;
  apply _strip_shape_assertions to neutralise the SIGFPE-prone deferred
  shape guards on the dynamic E axis (same handling the spin dense path uses).
- deep_eval._eval_model_graph: build the carry-all graph at its tight edge
  count (no edge_capacity padding).
- test_graph_deepeval: eval TWO different-size systems (8- and 20-atom,
  56 vs 380 real edges) through the SAME exported artifact; both match eager
  dense dpa1 at 1e-10 pbc+nopbc, both non-binding. The 20-atom system (380 > 263)
  would have overflowed the B1 static artifact -> the RED.
- test_graph_pt2_metadata: graph metadata no longer carries edge_capacity.
…e graph (dynamic-E) caller + edge-axis safety
…) generator

Section B of gen_dpa1.py produces deeppot_dpa1_graph.pt2 (lower_kind="graph",
do_atomic_virial=True) and the accompanying deeppot_dpa1_graph.expected sidecar.

Reference values come from an independent nlist .pt2 eval (NOT the graph .pt2)
so the C++ gtest (B2.5) validates the graph AOTI path against a known-good
reference. Sanity check: graph .pt2 vs nlist ref force diff 1.3e-18 (machine
precision; sel=30 >> actual neighbors, so both paths see identical neighbor
sets). Forces non-degenerate: max |F| ~5.3e-4 (PBC).

Config: type_map=[O,H], sel=30, rcut=6.0, attn_layer=0, neuron=[2,4,8],
axis_neuron=4, fitting neuron=[5,5,5], resnet_dt=True, seed=1 — mirrors
DPA1_CONFIG in source/tests/pt_expt/utils/test_graph_pt2_metadata.py.
B2.2: read lower_input_kind="graph" -> lower_input_is_graph_.
B2.3: run_model_graph with NeighborGraph AOTI input order
  (atype, n_node, edge_index, edge_vec, edge_mask, [fparam], [aparam],
  [charge_spin]); no coord / edge_scatter_index.
B2.4: GraphTensorPack + buildGraphTensors (delegates to
  createEdgeTensors+compactEdgeTensors for the rcut filter, dynamic edge
  count and 2 masked dummy edges; drops edge_index_ext, adds n_node=[nloc],
  node types from atype_ext[0:nloc]) + compute_inner & standalone dispatch
  branches. Multi-rank graph fails fast (PR-B3).
… path

Local std::vector<int64_t> mapping was declared in compute_inner and populated
only inside if(ago==0). The graph branch called buildGraphTensors with this
local vector on every step, causing an OOB heap read on ago>0 (mapping.size()==0).

Fix: promote mapping to a member mapping_ (parallel to mapping_tensor) so it
persists across steps. Edge-path (createEdgeTensors) and dense-path (mapping_tensor)
are unaffected in behavior; only the vector source changes from local to member.
…xtraction

Add test_deeppot_dpa1_graph_ptexpt.cc — the first runtime exercise of the
NeighborGraph .pt2 C++ ingestion (B2.2-4). Four cases x {double,float}:
build-nlist parity vs the committed .expected, a second 12-atom system through
the same model (dynamic edge axis), the LAMMPS InputNlist+ago=0/ago=1 path
(compute_inner + cached mapping_), and a tiny no-edge system (nedge_min=2 guard).

The first run surfaced a B2.2-4 bug: the graph forward emits LOCAL flat-N PUBLIC
keys (atom_energy/energy/force/virial/atom_virial) but compute()'s output
extraction read the dense INTERNAL keys (energy_redu/energy_derv_r/...), so
output_map["energy_redu"].view() threw on an undefined tensor. The graph branch
had only ever been compiled, never run.

Fix: remap_graph_outputs_to_dense_keys() in commonPT.h, called after
extract_outputs in both compute overloads (gated on lower_input_is_graph_).
Rewrites the public keys into the dense internal-key layout; per-atom
force/atom_virial are local (nloc) and zero-padded up to nall so the existing
fold-back is a no-op on ghost rows (local rows already carry the folded ghost
contributions). Dense/edge paths untouched.

gen_dpa1.py now persists the dense nlist-ref .pt2 (deeppot_dpa1_graph_nlist_ref.pt2)
as a live graph-vs-dense oracle for the dynamic-edge cases.
@github-actions github-actions Bot added the C++ label Jun 29, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
source/api_cc/src/DeepPotPTExpt.cc (1)

508-539: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Require lmp_list.mapping for graph artifacts with ghosts.

The graph branch uses mapping_ in buildGraphTensors to fold ghost neighbors onto local owners, but the current guard only requires an atom map for has_message_passing_ models. A DPA1 graph artifact can therefore run PBC LAMMPS without atom_modify map yes, fall back to identity mapping, and feed invalid/wrong folded graph indices.

Proposed fix
   if (lower_input_is_graph_ && multi_rank) {
     throw deepmd::deepmd_exception(
         "Multi-rank graph (NeighborGraph) .pt2 inference is not yet "
         "supported (PR-B3). Run single-rank, or use a dense/edge .pt2 for "
         "multi-rank LAMMPS.");
   }
+  if (lower_input_is_graph_ && nghost > 0 && !atom_map_present) {
+    throw deepmd::deepmd_exception(
+        "Graph (NeighborGraph) .pt2 inference with ghost atoms requires "
+        "`atom_modify map yes` so ghost neighbors can be folded onto their "
+        "local owners. Re-run with an atom map enabled, or use a dense .pt2.");
+  }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/api_cc/src/DeepPotPTExpt.cc` around lines 508 - 539, The
graph-artifact path in DeepPotPTExpt::buildGraphTensors is using mapping_ to
fold ghost neighbors, but the current validation only enforces atom-map presence
for has_message_passing_ models. Add a guard for lower_input_is_graph_ with
ghosts so graph artifacts also require lmp_list.mapping (or atom_modify map yes)
before compute(), and fail fast when it is missing instead of falling back to
identity mapping.
deepmd/pt_expt/utils/serialization.py (1)

793-799: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject unknown lower_kind values before dispatch.

A typo currently falls through to the dense nlist export while recording nlist metadata, producing a valid-looking artifact with the wrong ABI.

Proposed fix
     """
+    if lower_kind not in {"nlist", "graph"}:
+        raise ValueError(
+            f"Unsupported lower_kind={lower_kind!r}; expected 'nlist' or 'graph'."
+        )
     if model_file.endswith(".pt2"):
         _deserialize_to_file_pt2(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt_expt/utils/serialization.py` around lines 793 - 799, The dispatch
in serialization.py currently lets unknown lower_kind values fall through to the
dense nlist export path, which can produce a misleading artifact with incorrect
ABI metadata. Add validation before the model_file.endswith(".pt2") branch so
only the expected lower_kind values are accepted, and raise an error for
anything else. Keep the check close to the dispatch logic in the
deserialization/export flow that calls _deserialize_to_file_pt2 and
_deserialize_to_file_pte.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 793-799: The dispatch in serialization.py currently lets unknown
lower_kind values fall through to the dense nlist export path, which can produce
a misleading artifact with incorrect ABI metadata. Add validation before the
model_file.endswith(".pt2") branch so only the expected lower_kind values are
accepted, and raise an error for anything else. Keep the check close to the
dispatch logic in the deserialization/export flow that calls
_deserialize_to_file_pt2 and _deserialize_to_file_pte.

In `@source/api_cc/src/DeepPotPTExpt.cc`:
- Around line 508-539: The graph-artifact path in
DeepPotPTExpt::buildGraphTensors is using mapping_ to fold ghost neighbors, but
the current validation only enforces atom-map presence for has_message_passing_
models. Add a guard for lower_input_is_graph_ with ghosts so graph artifacts
also require lmp_list.mapping (or atom_modify map yes) before compute(), and
fail fast when it is missing instead of falling back to identity mapping.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c4b21156-2b33-4392-b6b0-4819a492ea77

📥 Commits

Reviewing files that changed from the base of the PR and between b046874 and b25fdfc.

📒 Files selected for processing (9)
  • deepmd/pt_expt/infer/deep_eval.py
  • deepmd/pt_expt/utils/serialization.py
  • source/api_cc/include/DeepPotPTExpt.h
  • source/api_cc/include/commonPT.h
  • source/api_cc/src/DeepPotPTExpt.cc
  • source/api_cc/tests/test_deeppot_dpa1_graph_ptexpt.cc
  • source/tests/infer/gen_dpa1.py
  • source/tests/pt_expt/infer/test_graph_deepeval.py
  • source/tests/pt_expt/utils/test_graph_pt2_metadata.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • deepmd/pt_expt/infer/deep_eval.py
  • source/tests/pt_expt/infer/test_graph_deepeval.py

@wanghan-iapcm wanghan-iapcm changed the title feat(pt_expt): export the graph lower to .pt2 + compiled training on the graph lower (NeighborGraph PR-B1) feat(pt_expt): graph-native dpa1 — .pt2 export + compiled training + C++ inference (NeighborGraph PR-B1+B2) Jun 29, 2026
…-assert on graph .pt2 export)

The graph forward_common_lower_graph .pt2 AOTI export device-asserted on
CUDA ('index out of bounds: 0 <= tmp8 < ks0') in the edge_force_virial
scatter. A padding/guard edge's node index reaches the index_add address
computation BEFORE edge_mask zeroes its (zero) contribution; CPU does not
bounds-check the address (so all dev + CI, which are CPU-only, were green),
CUDA device-asserts. Clamp src/dst into [0, n_out) and edge_frame into
[0, nf) before the scatters -- padding edges carry w_edge==0, so a clamped
out-of-range index scatters zero (numerically harmless). Verified on Tesla
T4: graph .pt2 export + DeepEval parity (small_8/large_20) + full graph
suite + consistency all pass; parity confirms the clamp does not corrupt
(N bound was correct, the stray index was genuine padding).

Also bind the per-node scatter to the input atype.shape[0] (node_capacity)
rather than the re-derived fit_ret.shape[0] -- hardening.
@wanghan-iapcm

Copy link
Copy Markdown
Collaborator Author

Fix: CUDA device-assert in graph .pt2 export (surfaced by the first GPU CI run)

The "Test Python and C++ on CUDA" job (skipped on earlier runs, first executed on e2e07f160) hit a CUDA-only failure: the graph forward_common_lower_graph .pt2 AOTI export device-asserts (index out of bounds: 0 <= tmp8 < ks0) in the edge_force_virial scatter, which also aborts gen_dpa1.py and cascade-poisons the CUDA context for the rest of pytest source/tests.

Cause: a padding/guard edge's node index reaches the index_add address computation before edge_mask zeroes its (zero-valued) contribution. CPU does not bounds-check the scatter address, so all prior dev + CI (CPU-only) were green; CUDA device-asserts.

Fix (7658091dd): clamp the scatter indices into range in edge_force_virialsrc/dst[0, n_out), edge_frame[0, nf). Padding edges carry w_edge == 0, so a clamped out-of-range index scatters zero (numerically harmless). Also binds the per-node scatter bound to the input atype.shape[0] (hardening). Plus 942de1f4f: a small edge_energy_deriv arg-order cleanup.

Validated on a Tesla T4 (same GPU class as the CI runner): graph .pt2 export ✅, DeepEval parity small_8 + large_20 ✅ (the correctness arbiter — confirms the clamp does not corrupt: the N bound was correct and the stray index was genuine padding), full graph Python suite + test_dpa1 consistency ✅, C++ TestInferDpa1GraphPtExpt 10/10 ✅, LAMMPS test_lammps_dpa1_graph_pt2 5/5 (single + virial + mpirun -n 2 + multi≡single + empty-subdomain) ✅.

@wanghan-iapcm wanghan-iapcm requested review from OutisLi and iProzd June 30, 2026 06:13
Han Wang added 3 commits June 30, 2026 15:24
The xp.clip clamp added in the previous commit breaks under torch.export:
the bound n_out is a SymInt and array_api_compat's clip reads .shape on it
('SymInt' object has no attribute 'shape'), failing every graph .pt2 export
on CI/local (it only passed on the GPU box's newer array_api_compat). Replace
with modulo (src % n_out, dst % n_out, edge_frame % nf) -- pure arithmetic, so
torch.export-safe, and a no-op on in-range real indices. Also fixes the case a
mask-multiply missed: the out-of-range index occurs on an edge_mask==1 edge
(ks0=n_out binds to a smaller symbol than the live node count at AOTI runtime),
which only a clamp-every-index (clip or modulo), not a mask-zero, neutralizes.
Out-of-range edges carry ~zero w_edge, so wrapping them is numerically harmless.

Verified on Tesla T4: graph .pt2 export (gen_dpa1) + DeepEval parity
(small_8/large_20) + C++ TestInferDpa1GraphPtExpt 10/10 all pass; CPU export
suite 12/12 (the SymInt failure the prior commit would have hit in CI).
…#5604

- training.py: add explanatory comments to four pass-only except clauses
  (CodeQL empty-except); replace the nloc==1-fragile 'N != nframes' shape
  heuristic in the compiled-graph unravel with an explicit node-level key set
  so single-atom-per-frame outputs reshape to (nf, 1, *) (CodeRabbit).
- serialization.py: mark aparam's atom axis dynamic in the graph export
  dynamic-shape spec ({0: nframes, 1: nloc}), matching the dense path, so a
  dim_aparam>0 graph export no longer specializes nloc (CodeRabbit).
- test_lammps_dpa1_graph_pt2.py: skip the two reference-comparison tests when
  the gen_dpa1 .expected fixture is absent (clean skip vs TypeError on None);
  force 'processors 2 1 1' in the empty-subdomain MPI test so the empty-rank
  branch is genuinely exercised (CodeRabbit).

Skipped as invalid: CodeQL 'import DEVICE' (house pattern, benign in test);
CodeRabbit do_atomic_virial-in-compiled-graph (the dense compiled path drops it
identically; training never requests atom_virial).

Validated: CPU export 8/8 + compiled-varying-natoms 5/5 (incl. dpa1_no_attn).
…port

Add a user-facing entry point to the graph C++ inference path. Before this,
the graph lower was reachable only via the internal API / gen_dpa1.py test
fixture, so a user-frozen dpa1 .pt2 always used the dense (nlist) lower and
the tested C++ graph path was unreachable from the CLI.

- main.py: add --lower-kind {nlist,graph} to the freeze subparser (default
  nlist; PyTorch-Exportable backend only, same convention as --head/--node-names).
- entrypoints/main.py: thread lower_kind into freeze() -> deserialize_to_file.
  Fail fast (ValueError) when 'graph' is requested for a non-graph-eligible
  model (reuses _model_uses_graph_lower; currently dpa1 attn_layer==0 only).
  Enable do_atomic_virial for the graph form -- near-free there (one extra
  scatter off the shared single backward).
- test_dp_freeze.py: graph-eligibility rejection (se_e2_a) + a public-CLI
  graph freeze of dpa1(attn_layer=0) asserting metadata lower_input_kind=graph.

Opt-in by design; auto-selecting graph for eligible models (mirroring training)
is deferred until the graph path covers attention/angles/MP. Both tests pass.
@wanghan-iapcm

Copy link
Copy Markdown
Collaborator Author

Pushed: CUDA-fix correction + AI-review responses + freeze --lower-kind graph

0f4373172 — corrects the prior clamp. The xp.clip I added in 7658091dd broke under torch.export (its SymInt bound trips array_api_compat's clip'SymInt' object has no attribute 'shape'); it only passed on the GPU box's newer array_api_compat and would have failed CI's graph-export tests. Replaced with modulo (% n_out/% nf) — pure arithmetic, export-safe, no-op on in-range indices. GPU-revalidated (gen + DeepEval parity small_8/large_20 + C++ TestInferDpa1GraphPtExpt 10/10) and CPU export suite 12/12.

afda4c7b3 — AI review. Addressed: CodeQL empty-except ×4 (comments); CodeRabbit nloc==1 unravel (explicit node-level key set), aparam atom-axis dynamic in graph export (matches dense), LAMMPS test robustness (skip-on-missing-fixture + forced processors 2 1 1 for the empty-subdomain branch). Declined (with reasons): CodeQL import DEVICE (house pattern, benign in test); CodeRabbit do_atomic_virial in compiled-graph (the dense compiled path drops it identically; training never requests atom_virial).

0ea2c3435dp --pt_expt freeze --lower-kind {nlist,graph}. Closes the gap that the graph C++ path was tested only via the gen_dpa1.py fixture and unreachable from the CLI. Default nlist; graph fails fast for non-eligible models and enables the near-free per-atom virial. New tests: eligibility rejection + a public-CLI dpa1(attn_layer=0) graph freeze asserting metadata.lower_input_kind=='graph'. Opt-in by design; auto-select (mirroring training) deferred until graph covers attention/angles/MP.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pt_expt/entrypoints/main.py`:
- Around line 734-741: The output suffix handling in main() still forces
unsuffixed exports to .pte even when freeze() is running with lower_kind set to
graph. Update the logic around the FLAGS.output normalization so it chooses the
default suffix based on the effective lower_kind (graph should default to .pt2,
otherwise .pte), while still preserving explicit user-provided .pte/.pt2
outputs. Use the freeze() call and the lower_kind lookup as the key symbols to
locate the fix.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5a56748f-b853-426d-96fb-a1bb13a56f0f

📥 Commits

Reviewing files that changed from the base of the PR and between 7658091 and 0ea2c34.

📒 Files selected for processing (7)
  • deepmd/dpmodel/utils/neighbor_graph/derivatives.py
  • deepmd/main.py
  • deepmd/pt_expt/entrypoints/main.py
  • deepmd/pt_expt/train/training.py
  • deepmd/pt_expt/utils/serialization.py
  • source/lmp/tests/test_lammps_dpa1_graph_pt2.py
  • source/tests/pt_expt/test_dp_freeze.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • deepmd/dpmodel/utils/neighbor_graph/derivatives.py
  • source/lmp/tests/test_lammps_dpa1_graph_pt2.py
  • deepmd/pt_expt/utils/serialization.py

Comment thread deepmd/pt_expt/entrypoints/main.py
@wanghan-iapcm wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label Jun 30, 2026
Wire the dpa1 graph .pt2 (attn_layer=0, lower_input_kind=graph) into the
parametrized VariantDeepPotTest as case dpa1_graph_ptexpt (Backend::PTExpt,
ref deeppot_dpa1_graph.expected, pbc/nopbc, tol 1e-10/1e-4). Flags mirror
dpa1_pytorch_pt2 exactly; all 19 enabled subtests pass on remote GPU
(FiniteDifference, LmpNlist*, CutoffTwice, TypeSel, NoPbc*). The dedicated
test_deeppot_dpa1_graph_ptexpt.cc is retained for graph-unique coverage.
@github-actions github-actions Bot removed the Test CUDA Trigger test CUDA workflow label Jun 30, 2026
@wanghan-iapcm wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label Jun 30, 2026
@github-actions github-actions Bot removed the Test CUDA Trigger test CUDA workflow label Jun 30, 2026
Comment thread deepmd/pt_expt/entrypoints/main.py
Comment thread deepmd/dpmodel/utils/neighbor_graph/derivatives.py Outdated
Han Wang and others added 2 commits July 1, 2026 08:43
… modulo-clamp invariant test

- entrypoints/main.py: 'dp --pt_expt freeze --lower-kind graph' now defaults a
  suffix-less output to .pt2 (the AOTI archive the C++ graph path consumes)
  instead of .pte; explicit .pte/.pt2 suffixes are preserved for both kinds.
  New test (freeze mocked) covers all four suffix/kind combinations.
- test_edge_force_virial.py: add an invariant test that the in-bounds index
  clamp (% n_out) NEVER alters a real (edge_mask==True) edge -- a boundary real
  edge on node n_out-1 scatters unwrapped, and a masked guard edge with
  deliberately out-of-range indices contributes nothing; the result equals the
  real-edges-only reference. Directly answers iProzd's ask to prove the modulo
  cannot silently remap real edges (only zero-weight guard edges).

Both new tests pass.
@wanghan-iapcm wanghan-iapcm requested a review from iProzd July 1, 2026 03:41
…t fix + root cause

Reframe the comment on the % n_out index guard: it is the PERMANENT fix, not an
interim workaround. Records the GPU-confirmed root cause -- the dynamic-edge
graph torch.export path traces the node count as several equal-but-distinct
symbols tied only by aten._assert_scalar(Eq(...)) nodes, which
_strip_shape_assertions neutralises wholesale (needed for export to trace),
dropping those equalities so inductor can no longer prove idx < n_out and emits
a device_assert. The upstream alternative (making the shared, spin-critical
_strip_shape_assertions selective) is deliberately not taken -- it risks the
torch.export bugs that helper bypasses and the spin .pt2 path. Comment-only.

@OutisLi OutisLi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One performance issue on the C++ graph LAMMPS path (per-step host-side topology rebuild). Details inline.

Comment thread source/api_cc/src/DeepPotPTExpt.cc Outdated
Comment thread source/api_cc/src/DeepPotPTExpt.cc Outdated

@OutisLi OutisLi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One maintainability issue: the graph trace/export sample builders are duplicated across training.py and serialization.py. Please consolidate. Details inline.

Comment thread deepmd/pt_expt/utils/serialization.py Outdated
Han Wang and others added 3 commits July 1, 2026 14:36
…eview)

Merge the two near-identical synthetic-graph builders
_make_graph_sample_inputs (serialization.py) and _make_graph_trace_inputs
(training.py) into one build_synthetic_graph_inputs in serialization.py,
parameterized by dtype (export=float64 ABI, training=GLOBAL_PT precision) and
device (export passes cpu explicitly instead of mutating env.DEVICE) plus the
want_fparam/aparam/charge_spin gating. Also factor the duplicated prime-collision
_forbidden scan into _forbidden_dims_from_model, shared by the dense and graph
_trace_and_compile. Removes the desync risk between the training and export
graph traces flagged in review.
…nk (OutisLi review)

Comment 1 (perf): the graph LAMMPS path called buildGraphTensors every timestep,
whose createEdgeTensors stage is an O(E) host loop + H2D copy that rebuilds the
edge topology from scratch even when LAMMPS has not refreshed the neighbor list.
Mirror the edge path: at ago==0 cache the skin topology via
createEdgeTensors(with_geometry=false, fold_to_local=!multi_rank) into
edge_index_tensor / edge_index_ext_tensor; each step run only the on-device
compactEdgeTensors (geometry recompute + rcut filter) and assemble the cheap
n_node / node-atype tensors. Topology now rebuilds only on a neighbor-list
rebuild, consistent with the edge path.

Comment 2 (empty-rank nit): a truly-empty rank (nall_real == 0, no local atoms
AND no ghosts) would feed N == 0 into the graph, and edge_force_virial's
edge_index % node_capacity would SIGFPE (div-by-zero). Early-return zero outputs
for that rank. The tested nloc==0 empty-subdomain case has nall_real>0 (ghosts
within rcut) and still runs normally.

Validated: 29/33 graph gtests (4 pre-existing NoPBC skips) + 5/5
test_lammps_dpa1_graph_pt2 (single-rank multi-step ref, multi-rank,
mpirun -n 2 == single-rank, empty-subdomain).
@wanghan-iapcm wanghan-iapcm requested a review from OutisLi July 1, 2026 06:56

@iProzd iProzd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

@wanghan-iapcm wanghan-iapcm enabled auto-merge July 1, 2026 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants