feat(compute): native bf16 GPU transpose kernels (capture-safe)#160
Merged
Conversation
Extends the ADR-091 PyTorch-oracle / gradcheck harness to the GroupNorm op class (zerfoo E127/T127.1.0a, first of six new diffusion-DiT op classes). GroupNorm composes entirely from existing engine reduce/elementwise ops: reshape [N,C] -> [N*groups, C/groups], normalize the last axis exactly like the LayerNorm node, reshape back, apply a per-channel affine. No new engine kernel. Adds the node (gradcheck/ops.go), the registry entry + dispatch (registry.go, dim=4 groups=2), and the torch replay + tolerance (torchmap.go, torch.nn.functional.group_norm). Verified: TestRegistry/GroupNorm gradcheck passes (analytic backward vs finite-difference); full gradcheck + oracle registry<->torchmap lockstep green. Unlocks the convolutional-VAE/UNet GroupNorm primitive for the diffusion class.
GPUEngine.Transpose routed every non-float32 type to the CPU engine, whose host
memcpy breaks CUDA-graph capture -- so any bf16 transpose under capture failed
("operation would make the legacy stream depend on a capturing blocking stream",
e.g. node QKL2Norm's Transpose). This forced the bf16 CrossAsset GPU bench to run
with capture DISABLED (~190 s/epoch).
Add native bf16 (16-bit) transpose kernels. Transpose is pure data movement, so
the kernels operate on `unsigned short` -- a bitwise element copy independent of
the bf16 numeric interpretation (no bf16 math, no new headers):
- transpose.cu: kernel_transpose_2d_bf16 + kernel_transpose_nd_bf16 (+ launchers)
- cuda/kernels: Transpose2DBF16 / TransposeNDBF16 (cuda + purego builds) and
dlopen symbol registration
- gpuapi: optional BFloat16Transposer extension + CUDAKernels impl
- GPUEngine.Transpose: for bf16 with GPU-resident input and a backend that
implements BFloat16Transposer, transpose on-device via the bf16 kernels;
otherwise fall back to the CPU engine as before. The f32 path is byte-for-byte
unchanged (element size threaded only through the byteSize/kernel-select).
CUDA-gated parity tests: 2D + 3D[0,2,1] (the QKL2Norm shape), exact-match.
Lets the bf16 CrossAsset GPU bench run with CUDA-graph capture ON. ADR-075 L4.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Native bf16 (16-bit) GPU transpose kernels so bf16 transposes stay on-device and are CUDA-graph-capturable.
Why
GPUEngine.Transposerouted every non-float32 type to the CPU engine, whose host memcpy breaks CUDA-graph capture: any bf16 transpose under capture failed withoperation would make the legacy stream depend on a capturing blocking stream(e.g. QKL2Norm'sTranspose). This forced the bf16 CrossAsset GPU bench to run with capture disabled (~190 s/epoch), an unrepresentative speed number.What changed
Transpose is pure data movement, so the bf16 kernels operate on
unsigned short— a bitwise element copy independent of the bf16 numeric interpretation (no bf16 math, no new headers):transpose.cu:kernel_transpose_2d_bf16+kernel_transpose_nd_bf16(+ launchers)internal/cuda/kernels:Transpose2DBF16/TransposeNDBF16(cuda + purego builds) + dlopen symbol registrationgpuapi: optionalBFloat16Transposerextension +CUDAKernelsimpl + assertionGPUEngine.Transpose: for bf16 with GPU-resident input and a backend implementingBFloat16Transposer, transpose on-device via the bf16 kernels; otherwise the CPU fallback as before. The f32 path is byte-for-byte unchanged (element size threaded only throughbyteSize+ kernel selection).Verification
compute/gpuapitests green.TestGPUBF16_TransposeParity(2D + 3D[0,2,1], the QKL2Norm shape), exact-match (ok compute 3.521s).Lets the bf16 CrossAsset GPU bench run with CUDA-graph capture ON (representative s/epoch). Final piece of the bf16 GPU backward chain (ztensor v1.16.0 NT/TN + zerfoo v1.53.1 grad-accum). ADR-075 lever L4.