fix: `encode()` and `decode()` read vocab arrays in wrong direction by CodeWithKyrian · Pull Request #7 · CodeWithKyrian/tokenizers-php

CodeWithKyrian · 2026-06-17T20:44:58Z

This PR fixes a bug where FallbackModel::encode() and FallbackModel::decode() read from the wrong internal arrays, causing every lookup to miss and fall back to the unknown-token placeholder.

Motivation and context

The FallbackModel stores two vocabularies: a forward mapping from tokens to IDs, and a reverse mapping from IDs back to tokens. The class had these two arrays stored under variable names that suggested the opposite direction — so encode() was reading from the reverse map and decode() was reading from the forward map. For CTC tokenizers like Wav2Vec2 where token strings are purely alphabetic characters (e.g. 'e' → 5) and IDs are integers, both lookups silently returned nothing and fell back to <pad> or an empty string. This produced completely silent ASR output for every Wav2Vec2 pipeline.

What's changed

Fixed encode() to read from the token→ID mapping instead of the reverse map
Fixed decode() to read from the ID→token mapping instead of the forward map

Breaking changes

None.

CodeWithKyrian force-pushed the fix/fallback-vocab-direction branch 2 times, most recently from 5fa04e2 to d807539 Compare June 17, 2026 20:54

fix: swap encode/decode to read correct vocab direction

82a3ce4

CodeWithKyrian force-pushed the fix/fallback-vocab-direction branch from d807539 to 82a3ce4 Compare June 17, 2026 20:58

CodeWithKyrian merged commit fb2b30d into main Jun 17, 2026
5 checks passed

CodeWithKyrian deleted the fix/fallback-vocab-direction branch June 17, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `encode()` and `decode()` read vocab arrays in wrong direction#7

fix: `encode()` and `decode()` read vocab arrays in wrong direction#7
CodeWithKyrian merged 1 commit into
mainfrom
fix/fallback-vocab-direction

CodeWithKyrian commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CodeWithKyrian commented Jun 17, 2026

Motivation and context

What's changed

Breaking changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant