Skip to content

perf: skip grapheme regex for ASCII in utf8_clusters#38

Closed
AmitMY wants to merge 1 commit into
mainfrom
perf/ascii-grapheme-fastpath
Closed

perf: skip grapheme regex for ASCII in utf8_clusters#38
AmitMY wants to merge 1 commit into
mainfrom
perf/ascii-grapheme-fastpath

Conversation

@AmitMY

@AmitMY AmitMY commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

utf8_clusters splits every pretokenized word into grapheme clusters with regex.findall(r'\X', s). ASCII text has no multi-codepoint grapheme clusters (no combining marks), so list(s) is identical there and skips the regex entirely:

clusters = list(s) if s.isascii() else regex.findall(r'\X', s)

Non-ASCII strings still use \X, so output is unchanged — Hebrew/Chinese tests pass unchanged.

Output identical, 137 tests pass. Measured back-to-back (50 texts / 200 merges; best of 3):

main this PR
BPE 0.427s 0.415s (−2.8%)
BNE n=4 0.836s 0.825s (−1.3%)
Boundless 0.512s 0.499s (−2.5%)

Memory unchanged. (Build is ~4% of a training run, so this is most of that fraction.)

🤖 Generated with Claude Code

utf8_clusters splits every word with regex.findall(r'\X', ...) to find
grapheme clusters. ASCII text has no multi-codepoint graphemes (no
combining marks), so for ASCII strings list(s) is identical and skips the
regex. Non-ASCII still uses \X, so output is unchanged.

~1-3% faster on text-heavy training; memory unchanged. 137 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AmitMY

AmitMY commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Closing. The original change was subtly incorrect: ASCII does contain one multi-codepoint grapheme — CR LF (UAX#29 GB3) — so list(s) diverges from regex.findall(r'\X', s) on text with Windows line endings (e.g. ' \r\n' gave merges {b'\r\n', b' \r'} instead of {b'\r\n'}). The wikitext benchmark is Unix-LF, so it never surfaced.

Adding a correct guard (s.isascii() and '\r' not in s) fixes it, but then the change is flat — the extra guard scan costs about as much as the \X regex it avoids. The apparent −2–3% was just the incorrect unguarded path. Not worth shipping.

@AmitMY AmitMY closed this Jun 26, 2026
@AmitMY AmitMY deleted the perf/ascii-grapheme-fastpath branch June 26, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant