Add llama.cpp / GGUF runtime for Fun-ASR-Nano (CPU / edge)#115
Merged
Conversation
Run Fun-ASR-Nano (SenseVoice SAN-M encoder + adaptor + Qwen3-0.6B) entirely on the llama.cpp / ggml stack: CPU / edge, single binary, no Python at runtime. Audio embeddings are injected into the LLM via llama_decode embedding input (the llava/mtmd mechanism). Includes the ggml encoder forward, a GGUF export script, and an integrated WAV->text CLI. Validated against PyTorch (encoder cosine 1.0; aggregate CER matches within 0.02% under identical conditions).
…tart, accuracy, gotchas)
…alidation, gotchas)
…eak checks) Fixes gemini-code-assist findings: validate channels/bits in the WAV reader (reject non-16-bit/zero-channel instead of dividing by zero), guard chunk size >=1, null-check fopen/llama_context, and fclose/ggml_free on all paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fun-ASR-Nano on llama.cpp / GGUF
Run Fun-ASR-Nano (SenseVoice SAN-M encoder + adaptor + Qwen3-0.6B) entirely on the
llama.cpp / ggml stack — CPU / edge, a single binary, no Python at runtime. Like
whisper.cpp, but for Fun-ASR.
What's added (
runtime/llama.cpp/)funasr-cli— integrated WAV → transcription binaryfunasr-encoder/funasr-embd— the ggml encoder+adaptor and the LLM-from-embeds tools (validation)export_encoder_gguf.py— export the audio encoder + adaptor to GGUF (f32 / f16)How it works
fbank (C++) → SAN-M encoder + adaptor (ggml) → low-frame-rate truncation →
[prefix tokens | audio embeds | suffix tokens]→ Qwen3-0.6B (llama.cpp). The audioembeddings are injected via
llama_decode's embedding-input path (the llava/mtmd mechanism).Validation (184-file benchmark)
Notes
--chunk 15for long audio. WAV input assumes 16 kHz mono PCM16 for now.runtime/llama.cpp/.