Skip to content

[Common] Fix int32 overflow in multi_tensor_apply tensor sizes for numel > INT_MAX#3136

Open
javierdejesusda wants to merge 1 commit into
NVIDIA:mainfrom
javierdejesusda:fix/2918-multi-tensor-int64-sizes
Open

[Common] Fix int32 overflow in multi_tensor_apply tensor sizes for numel > INT_MAX#3136
javierdejesusda wants to merge 1 commit into
NVIDIA:mainfrom
javierdejesusda:fix/2918-multi-tensor-int64-sizes

Conversation

@javierdejesusda

Copy link
Copy Markdown

Description

TensorListMetadataBase::sizes in multi_tensor_apply.cuh is declared as int (int32) but populated from Tensor::numel(), which is 64-bit. A tensor with numel() > INT_MAX truncates to a negative size, and the consumer kernels then compute out-of-bounds element offsets from it, raising an illegal memory access at the next CUDA sync. This is hit by any model with a single parameter over 2.14B elements (large-vocab embeddings, tied output layers) feeding TE's multi-tensor utilities, e.g. through Megatron's calc_params_l2_norm.

Fixes #2918

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Changes

  • Store TensorListMetadataBase::sizes as int64_t in multi_tensor_apply.cuh.
  • Read each size into an int64_t (or the existing templated index_t in the Adam kernels) before the n -= chunk_idx * chunk_size subtraction in adam.cu, compute_scale.cu, l2norm.cu, scale.cu, and sgd.cu.
  • Widen the chunk_size kernel argument to int64_t in every consumer that took it as int (the AdamCapturable/AdamCapturableMaster functors plus the non-Adam kernels) so the chunk_idx * chunk_size element offset is computed in 64-bit. The already-templated Adam kernels keep index_t chunk_size.

No public API change. The widening is confined to the multi_tensor metadata and the kernel index arithmetic.

Testing

Verified statically on a box without a GPU: the size and chunk_size paths are now 64-bit, so values above INT_MAX no longer wrap negative. I could not run the runtime repro here, since reproducing the numel > INT_MAX path needs a multi-GB GPU allocation. Happy to add a guarded large-tensor regression test for multi_tensor_l2norm if maintainers want one gated behind sufficient device memory.

Signed-off-by: Javier de Jesus javier.dejesusj9@gmail.com

TensorListMetadataBase::sizes was declared int32 but populated from
Tensor::numel(), so a tensor with numel > INT_MAX truncated to a negative
size and the consumer kernels then computed out-of-bounds offsets, hitting
an illegal memory access at the next sync.

Store sizes as int64_t and read them into an int64_t (or the existing
index_t on the already-templated Adam kernels) before the
n -= chunk_idx * chunk_size subtraction. Widen the chunk_size kernel
argument in the non-Adam consumers to int64_t as well so the
chunk_idx * chunk_size element offset is computed in 64-bit.

Fixes NVIDIA#2918

Signed-off-by: Javier de Jesus <javier.dejesusj9@gmail.com>
@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 21, 2026
@greptile-apps

greptile-apps Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a silent int32 overflow in TensorListMetadataBase::sizes — the field was declared as int but populated from Tensor::numel() (64-bit), causing any tensor with more than 2.14B elements to store a negative size and produce out-of-bounds CUDA memory accesses in all multi-tensor utilities.

  • TensorListMetadataBase::sizes widened from int to int64_t in multi_tensor_apply.cuh, with all consumers (l2norm.cu, scale.cu, compute_scale.cu, sgd.cu) updating their n and chunk_size variables to int64_t correspondingly.
  • AdamCapturableFunctor and AdamCapturableMasterFunctor in adam.cu updated to int64_t chunk_size and int64_t n; the already-templated AdamFunctor/AdamFunctorMaster retain their index_t parameter and dispatch to int64_t when requires_64bit_indexing is detected.
  • The int i_start / int i inner-loop variables inside each kernel are bounded by chunk_size (always a small constant in practice), so no overflow risk remains there.

Confidence Score: 4/5

Safe to merge — the overflow fix is correct and complete across all six changed files, with no public API changes.

The core fix is sound: sizes is now int64_t end-to-end from population through all functor consumers. The one minor concern is that the int32_t-instantiated paths of AdamFunctor/AdamFunctorMaster now implicitly narrow int64_t sizes[...] back to index_t n without an explicit cast — safe by construction (the 32-bit path is only taken when all tensor sizes are guaranteed to fit), but likely to produce compiler narrowing warnings that weren't present before.

adam.cu — the templated AdamFunctor/AdamFunctorMaster implicit narrowing at the index_t n = tl.sizes[tensor_loc] assignment (lines 74 and 312) deserves a quick look to confirm the NVCC build stays warning-clean.

Important Files Changed

Filename Overview
transformer_engine/common/multi_tensor/multi_tensor_apply.cuh Root fix: sizes field widened from int to int64_t. Host-side population from tensor->numel() and all consumer reads are now consistent 64-bit. Struct still fits within the ~4KB CUDA kernel argument limit.
transformer_engine/common/multi_tensor/adam.cu AdamCapturableFunctor and AdamCapturableMasterFunctor correctly updated (n and chunk_size both int64_t). The templated AdamFunctor/AdamFunctorMaster retain index_t with correct 32/64-bit dispatch, but the int32_t path now has an implicit narrowing from int64_t sizes to index_t n that is safe by design but will trigger compiler warnings.
transformer_engine/common/multi_tensor/l2norm.cu All three functors (L2Norm, UnscaleL2Norm, MaxNorm) correctly updated: chunk_size and n widened to int64_t. No issues found.
transformer_engine/common/multi_tensor/scale.cu scale_chunk helper and both ScaleFunctor/ScalePtrFunctor correctly updated to int64_t chunk_size and int64_t n. Clean change.
transformer_engine/common/multi_tensor/compute_scale.cu Both ComputeScaleAndScaleInvFunctor and ComputeScaleInvE8M0Functor correctly updated. No issues.
transformer_engine/common/multi_tensor/sgd.cu SGDFunctor correctly updated to int64_t chunk_size and int64_t n. No issues.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Host as Host (multi_tensor_apply)
    participant Meta as TensorListMetadataBase
    participant Kernel as multi_tensor_apply_kernel
    participant Functor as Consumer Functor (e.g. L2NormFunctor)

    Host->>Meta: "tl.sizes[i] = tensor->numel()  [int64_t <- int64_t, no truncation]"
    Host->>Kernel: launch(int64_t chunk_size, ..., tl, callable)
    Kernel->>Functor: callable(int64_t chunk_size, noop_flag, tl, ...)
    Functor->>Functor: "int64_t n = tl.sizes[tensor_loc]"
    Functor->>Functor: "ptr += chunk_idx * chunk_size  [int64_t offset]"
    Functor->>Functor: "n -= chunk_idx * chunk_size    [int64_t subtraction]"
    Functor->>Functor: iterate over [0, min(n, chunk_size)) elements
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Host as Host (multi_tensor_apply)
    participant Meta as TensorListMetadataBase
    participant Kernel as multi_tensor_apply_kernel
    participant Functor as Consumer Functor (e.g. L2NormFunctor)

    Host->>Meta: "tl.sizes[i] = tensor->numel()  [int64_t <- int64_t, no truncation]"
    Host->>Kernel: launch(int64_t chunk_size, ..., tl, callable)
    Kernel->>Functor: callable(int64_t chunk_size, noop_flag, tl, ...)
    Functor->>Functor: "int64_t n = tl.sizes[tensor_loc]"
    Functor->>Functor: "ptr += chunk_idx * chunk_size  [int64_t offset]"
    Functor->>Functor: "n -= chunk_idx * chunk_size    [int64_t subtraction]"
    Functor->>Functor: iterate over [0, min(n, chunk_size)) elements
Loading

Comments Outside Diff (1)

  1. transformer_engine/common/multi_tensor/adam.cu, line 312 (link)

    P2 Implicit narrowing from int64_t to index_t in int32 path

    tl.sizes[tensor_loc] is now int64_t (after this PR). In the int32_t path (requires_64bit_indexing == false), index_t n = tl.sizes[tensor_loc] silently narrows a 64-bit value to 32-bit. The narrowing is safe in practice because the 32-bit dispatch path is only reached when all tensor sizes fit in int32, but the conversion is implicit and will likely produce a compiler narrowing warning under -Wconversion. The same pattern appears at line 74 in AdamFunctorMaster (also templated as int32_t in the non-large-tensor path). An explicit cast like static_cast<index_t>(tl.sizes[tensor_loc]) would suppress the warning and document the intent.

Reviews (1): Last reviewed commit: "Widen multi_tensor_apply tensor sizes to..." | Re-trigger Greptile

@vthumbe1503

Copy link
Copy Markdown
Collaborator

/te-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] multi_tensor_apply: int32 overflow in TensorListMetadata::sizes causes illegal memory access for tensors with numel > INT_MAX

2 participants