perf: skip grapheme regex for ASCII in utf8_clusters by AmitMY · Pull Request #38 · sign-language-processing/complex-tokenization

AmitMY · 2026-06-26T14:37:39Z

utf8_clusters splits every pretokenized word into grapheme clusters with regex.findall(r'\X', s). ASCII text has no multi-codepoint grapheme clusters (no combining marks), so list(s) is identical there and skips the regex entirely:

clusters = list(s) if s.isascii() else regex.findall(r'\X', s)

Non-ASCII strings still use \X, so output is unchanged — Hebrew/Chinese tests pass unchanged.

Output identical, 137 tests pass. Measured back-to-back (50 texts / 200 merges; best of 3):

	main	this PR
BPE	0.427s	0.415s (−2.8%)
BNE n=4	0.836s	0.825s (−1.3%)
Boundless	0.512s	0.499s (−2.5%)

Memory unchanged. (Build is ~4% of a training run, so this is most of that fraction.)

🤖 Generated with Claude Code

utf8_clusters splits every word with regex.findall(r'\X', ...) to find grapheme clusters. ASCII text has no multi-codepoint graphemes (no combining marks), so for ASCII strings list(s) is identical and skips the regex. Non-ASCII still uses \X, so output is unchanged. ~1-3% faster on text-heavy training; memory unchanged. 137 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AmitMY · 2026-06-26T14:52:20Z

Closing. The original change was subtly incorrect: ASCII does contain one multi-codepoint grapheme — CR LF (UAX#29 GB3) — so list(s) diverges from regex.findall(r'\X', s) on text with Windows line endings (e.g. ' \r\n' gave merges {b'\r\n', b' \r'} instead of {b'\r\n'}). The wikitext benchmark is Unix-LF, so it never surfaced.

Adding a correct guard (s.isascii() and '\r' not in s) fixes it, but then the change is flat — the extra guard scan costs about as much as the \X regex it avoids. The apparent −2–3% was just the incorrect unguarded path. Not worth shipping.

AmitMY closed this Jun 26, 2026

AmitMY deleted the perf/ascii-grapheme-fastpath branch June 26, 2026 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: skip grapheme regex for ASCII in utf8_clusters#38

perf: skip grapheme regex for ASCII in utf8_clusters#38
AmitMY wants to merge 1 commit into
mainfrom
perf/ascii-grapheme-fastpath

AmitMY commented Jun 26, 2026

Uh oh!

AmitMY commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AmitMY commented Jun 26, 2026

Uh oh!

AmitMY commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant