Add staged (tiled_reduce_staged) multi-vector MaxSim kernel POC#1200
Draft
suri-kumkaran wants to merge 1 commit into
Draft
Add staged (tiled_reduce_staged) multi-vector MaxSim kernel POC#1200suri-kumkaran wants to merge 1 commit into
suri-kumkaran wants to merge 1 commit into
Conversation
Introduce an experimental, generic cache-tiled reduction driver,
`tiled_reduce_staged`, that computes multi-vector MaxSim/Chamfer for any
element type and quantization by swapping four pluggable stages
(StagedKernel, Postprocess, Reducer, StagedConvert) instead of forking the
tiled loop nest per datatype.
Validated with two instantiations:
- f32: bit-identical to and on par with the hand-fused V3 kernel
(selectable for A/B as MaxSimIsa::X86_64_V3_Staged).
- 4-bit MinMax-quantized i8: a new datatype added by swapping Stage A + the
Acc type + Stage B only; 1.5-4.1x over the per-pair SIMD reference.
The driver owns all scratch, allocated from a caller-supplied ScopedAllocator;
zero-allocation steady state is provided by a single-owner resettable bump
arena (ResettableArena). A crisp design overview lives in the staged module
README (diskann-quantization/.../kernels/staged/README.md).
Also adds benchmark examples (multi-vector-{staged,quant,3way}.json) and the
quantized multi-vector benchmark backend.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce an experimental, generic cache-tiled reduction driver,
tiled_reduce_staged, that computes multi-vector MaxSim/Chamfer for any element type and quantization by swapping four pluggable stages (StagedKernel, Postprocess, Reducer, StagedConvert) instead of forking the tiled loop nest per datatype.Validated with two instantiations:
The driver owns all scratch, allocated from a caller-supplied ScopedAllocator; zero-allocation steady state is provided by a single-owner resettable bump arena (ResettableArena). A crisp design overview lives in the staged module README (diskann-quantization/.../kernels/staged/README.md).
Also adds benchmark examples (multi-vector-{staged,quant,3way}.json) and the quantized multi-vector benchmark backend.