[Common] Fix int32 overflow in multi_tensor_apply tensor sizes for numel > INT_MAX by javierdejesusda · Pull Request #3136 · NVIDIA/TransformerEngine

javierdejesusda · 2026-06-21T16:20:59Z

Description

TensorListMetadataBase::sizes in multi_tensor_apply.cuh is declared as int (int32) but populated from Tensor::numel(), which is 64-bit. A tensor with numel() > INT_MAX truncates to a negative size, and the consumer kernels then compute out-of-bounds element offsets from it, raising an illegal memory access at the next CUDA sync. This is hit by any model with a single parameter over 2.14B elements (large-vocab embeddings, tied output layers) feeding TE's multi-tensor utilities, e.g. through Megatron's calc_params_l2_norm.

Fixes #2918

Type of change

Bug fix (non-breaking change which fixes an issue)

Changes

Store TensorListMetadataBase::sizes as int64_t in multi_tensor_apply.cuh.
Read each size into an int64_t (or the existing templated index_t in the Adam kernels) before the n -= chunk_idx * chunk_size subtraction in adam.cu, compute_scale.cu, l2norm.cu, scale.cu, and sgd.cu.
Widen the chunk_size kernel argument to int64_t in every consumer that took it as int (the AdamCapturable/AdamCapturableMaster functors plus the non-Adam kernels) so the chunk_idx * chunk_size element offset is computed in 64-bit. The already-templated Adam kernels keep index_t chunk_size.

No public API change. The widening is confined to the multi_tensor metadata and the kernel index arithmetic.

Testing

Verified statically on a box without a GPU: the size and chunk_size paths are now 64-bit, so values above INT_MAX no longer wrap negative. I could not run the runtime repro here, since reproducing the numel > INT_MAX path needs a multi-GB GPU allocation. Happy to add a guarded large-tensor regression test for multi_tensor_l2norm if maintainers want one gated behind sufficient device memory.

Signed-off-by: Javier de Jesus javier.dejesusj9@gmail.com

TensorListMetadataBase::sizes was declared int32 but populated from Tensor::numel(), so a tensor with numel > INT_MAX truncated to a negative size and the consumer kernels then computed out-of-bounds offsets, hitting an illegal memory access at the next sync. Store sizes as int64_t and read them into an int64_t (or the existing index_t on the already-templated Adam kernels) before the n -= chunk_idx * chunk_size subtraction. Widen the chunk_size kernel argument in the non-Adam consumers to int64_t as well so the chunk_idx * chunk_size element offset is computed in 64-bit. Fixes NVIDIA#2918 Signed-off-by: Javier de Jesus <javier.dejesusj9@gmail.com>

greptile-apps · 2026-06-21T16:25:42Z

Greptile Summary

This PR fixes a silent int32 overflow in TensorListMetadataBase::sizes — the field was declared as int but populated from Tensor::numel() (64-bit), causing any tensor with more than 2.14B elements to store a negative size and produce out-of-bounds CUDA memory accesses in all multi-tensor utilities.

TensorListMetadataBase::sizes widened from int to int64_t in multi_tensor_apply.cuh, with all consumers (l2norm.cu, scale.cu, compute_scale.cu, sgd.cu) updating their n and chunk_size variables to int64_t correspondingly.
AdamCapturableFunctor and AdamCapturableMasterFunctor in adam.cu updated to int64_t chunk_size and int64_t n; the already-templated AdamFunctor/AdamFunctorMaster retain their index_t parameter and dispatch to int64_t when requires_64bit_indexing is detected.
The int i_start / int i inner-loop variables inside each kernel are bounded by chunk_size (always a small constant in practice), so no overflow risk remains there.

Confidence Score: 4/5

Safe to merge — the overflow fix is correct and complete across all six changed files, with no public API changes.

The core fix is sound: sizes is now int64_t end-to-end from population through all functor consumers. The one minor concern is that the int32_t-instantiated paths of AdamFunctor/AdamFunctorMaster now implicitly narrow int64_t sizes[...] back to index_t n without an explicit cast — safe by construction (the 32-bit path is only taken when all tensor sizes are guaranteed to fit), but likely to produce compiler narrowing warnings that weren't present before.

adam.cu — the templated AdamFunctor/AdamFunctorMaster implicit narrowing at the index_t n = tl.sizes[tensor_loc] assignment (lines 74 and 312) deserves a quick look to confirm the NVCC build stays warning-clean.

Important Files Changed

Filename	Overview
transformer_engine/common/multi_tensor/multi_tensor_apply.cuh	Root fix: `sizes` field widened from `int` to `int64_t`. Host-side population from `tensor->numel()` and all consumer reads are now consistent 64-bit. Struct still fits within the ~4KB CUDA kernel argument limit.
transformer_engine/common/multi_tensor/adam.cu	AdamCapturableFunctor and AdamCapturableMasterFunctor correctly updated (`n` and `chunk_size` both int64_t). The templated AdamFunctor/AdamFunctorMaster retain `index_t` with correct 32/64-bit dispatch, but the `int32_t` path now has an implicit narrowing from `int64_t sizes` to `index_t n` that is safe by design but will trigger compiler warnings.
transformer_engine/common/multi_tensor/l2norm.cu	All three functors (L2Norm, UnscaleL2Norm, MaxNorm) correctly updated: `chunk_size` and `n` widened to `int64_t`. No issues found.
transformer_engine/common/multi_tensor/scale.cu	`scale_chunk` helper and both ScaleFunctor/ScalePtrFunctor correctly updated to `int64_t chunk_size` and `int64_t n`. Clean change.
transformer_engine/common/multi_tensor/compute_scale.cu	Both ComputeScaleAndScaleInvFunctor and ComputeScaleInvE8M0Functor correctly updated. No issues.
transformer_engine/common/multi_tensor/sgd.cu	SGDFunctor correctly updated to `int64_t chunk_size` and `int64_t n`. No issues.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Host as Host (multi_tensor_apply)
    participant Meta as TensorListMetadataBase
    participant Kernel as multi_tensor_apply_kernel
    participant Functor as Consumer Functor (e.g. L2NormFunctor)

    Host->>Meta: "tl.sizes[i] = tensor->numel()  [int64_t <- int64_t, no truncation]"
    Host->>Kernel: launch(int64_t chunk_size, ..., tl, callable)
    Kernel->>Functor: callable(int64_t chunk_size, noop_flag, tl, ...)
    Functor->>Functor: "int64_t n = tl.sizes[tensor_loc]"
    Functor->>Functor: "ptr += chunk_idx * chunk_size  [int64_t offset]"
    Functor->>Functor: "n -= chunk_idx * chunk_size    [int64_t subtraction]"
    Functor->>Functor: iterate over [0, min(n, chunk_size)) elements

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Host as Host (multi_tensor_apply)
    participant Meta as TensorListMetadataBase
    participant Kernel as multi_tensor_apply_kernel
    participant Functor as Consumer Functor (e.g. L2NormFunctor)

    Host->>Meta: "tl.sizes[i] = tensor->numel()  [int64_t <- int64_t, no truncation]"
    Host->>Kernel: launch(int64_t chunk_size, ..., tl, callable)
    Kernel->>Functor: callable(int64_t chunk_size, noop_flag, tl, ...)
    Functor->>Functor: "int64_t n = tl.sizes[tensor_loc]"
    Functor->>Functor: "ptr += chunk_idx * chunk_size  [int64_t offset]"
    Functor->>Functor: "n -= chunk_idx * chunk_size    [int64_t subtraction]"
    Functor->>Functor: iterate over [0, min(n, chunk_size)) elements

Comments Outside Diff (1)

transformer_engine/common/multi_tensor/adam.cu, line 312 (link)

Implicit narrowing from int64_t to index_t in int32 path

tl.sizes[tensor_loc] is now int64_t (after this PR). In the int32_t path (requires_64bit_indexing == false), index_t n = tl.sizes[tensor_loc] silently narrows a 64-bit value to 32-bit. The narrowing is safe in practice because the 32-bit dispatch path is only reached when all tensor sizes fit in int32, but the conversion is implicit and will likely produce a compiler narrowing warning under -Wconversion. The same pattern appears at line 74 in AdamFunctorMaster (also templated as int32_t in the non-large-tensor path). An explicit cast like static_cast<index_t>(tl.sizes[tensor_loc]) would suppress the warning and document the intent.

_{Reviews (1): Last reviewed commit: "Widen multi_tensor_apply tensor sizes to..." | Re-trigger Greptile}

vthumbe1503 · 2026-06-22T04:19:54Z

/te-ci

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Fix int32 overflow in multi_tensor_apply tensor sizes for numel > INT_MAX#3136

[Common] Fix int32 overflow in multi_tensor_apply tensor sizes for numel > INT_MAX#3136
javierdejesusda wants to merge 1 commit into
NVIDIA:mainfrom
javierdejesusda:fix/2918-multi-tensor-int64-sizes

javierdejesusda commented Jun 21, 2026

Uh oh!

greptile-apps Bot commented Jun 21, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

vthumbe1503 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

javierdejesusda commented Jun 21, 2026

Description

Type of change

Changes

Testing

Uh oh!

greptile-apps Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

vthumbe1503 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 21, 2026 •

edited

Loading