[Common] Fix int32 overflow in multi_tensor_apply tensor sizes for numel > INT_MAX#3136
Conversation
TensorListMetadataBase::sizes was declared int32 but populated from Tensor::numel(), so a tensor with numel > INT_MAX truncated to a negative size and the consumer kernels then computed out-of-bounds offsets, hitting an illegal memory access at the next sync. Store sizes as int64_t and read them into an int64_t (or the existing index_t on the already-templated Adam kernels) before the n -= chunk_idx * chunk_size subtraction. Widen the chunk_size kernel argument in the non-Adam consumers to int64_t as well so the chunk_idx * chunk_size element offset is computed in 64-bit. Fixes NVIDIA#2918 Signed-off-by: Javier de Jesus <javier.dejesusj9@gmail.com>
Greptile SummaryThis PR fixes a silent int32 overflow in
Confidence Score: 4/5Safe to merge — the overflow fix is correct and complete across all six changed files, with no public API changes. The core fix is sound: adam.cu — the templated AdamFunctor/AdamFunctorMaster implicit narrowing at the Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Host as Host (multi_tensor_apply)
participant Meta as TensorListMetadataBase
participant Kernel as multi_tensor_apply_kernel
participant Functor as Consumer Functor (e.g. L2NormFunctor)
Host->>Meta: "tl.sizes[i] = tensor->numel() [int64_t <- int64_t, no truncation]"
Host->>Kernel: launch(int64_t chunk_size, ..., tl, callable)
Kernel->>Functor: callable(int64_t chunk_size, noop_flag, tl, ...)
Functor->>Functor: "int64_t n = tl.sizes[tensor_loc]"
Functor->>Functor: "ptr += chunk_idx * chunk_size [int64_t offset]"
Functor->>Functor: "n -= chunk_idx * chunk_size [int64_t subtraction]"
Functor->>Functor: iterate over [0, min(n, chunk_size)) elements
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Host as Host (multi_tensor_apply)
participant Meta as TensorListMetadataBase
participant Kernel as multi_tensor_apply_kernel
participant Functor as Consumer Functor (e.g. L2NormFunctor)
Host->>Meta: "tl.sizes[i] = tensor->numel() [int64_t <- int64_t, no truncation]"
Host->>Kernel: launch(int64_t chunk_size, ..., tl, callable)
Kernel->>Functor: callable(int64_t chunk_size, noop_flag, tl, ...)
Functor->>Functor: "int64_t n = tl.sizes[tensor_loc]"
Functor->>Functor: "ptr += chunk_idx * chunk_size [int64_t offset]"
Functor->>Functor: "n -= chunk_idx * chunk_size [int64_t subtraction]"
Functor->>Functor: iterate over [0, min(n, chunk_size)) elements
|
|
/te-ci |
Description
TensorListMetadataBase::sizesinmulti_tensor_apply.cuhis declared asint(int32) but populated fromTensor::numel(), which is 64-bit. A tensor withnumel() > INT_MAXtruncates to a negative size, and the consumer kernels then compute out-of-bounds element offsets from it, raising an illegal memory access at the next CUDA sync. This is hit by any model with a single parameter over 2.14B elements (large-vocab embeddings, tied output layers) feeding TE's multi-tensor utilities, e.g. through Megatron'scalc_params_l2_norm.Fixes #2918
Type of change
Changes
TensorListMetadataBase::sizesasint64_tinmulti_tensor_apply.cuh.int64_t(or the existing templatedindex_tin the Adam kernels) before then -= chunk_idx * chunk_sizesubtraction inadam.cu,compute_scale.cu,l2norm.cu,scale.cu, andsgd.cu.chunk_sizekernel argument toint64_tin every consumer that took it asint(theAdamCapturable/AdamCapturableMasterfunctors plus the non-Adam kernels) so thechunk_idx * chunk_sizeelement offset is computed in 64-bit. The already-templated Adam kernels keepindex_tchunk_size.No public API change. The widening is confined to the multi_tensor metadata and the kernel index arithmetic.
Testing
Verified statically on a box without a GPU: the size and chunk_size paths are now 64-bit, so values above INT_MAX no longer wrap negative. I could not run the runtime repro here, since reproducing the
numel > INT_MAXpath needs a multi-GB GPU allocation. Happy to add a guarded large-tensor regression test formulti_tensor_l2normif maintainers want one gated behind sufficient device memory.Signed-off-by: Javier de Jesus javier.dejesusj9@gmail.com