Skip to content

perf: intern per-character Nodes in characters()#39

Merged
AmitMY merged 1 commit into
mainfrom
perf/intern-char-nodes
Jun 26, 2026
Merged

perf: intern per-character Nodes in characters()#39
AmitMY merged 1 commit into
mainfrom
perf/intern-char-nodes

Conversation

@AmitMY

@AmitMY AmitMY commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

The characters() unit function allocated a fresh Node for every character. This mirrors the utf8 byte layer before #36 — share one immutable Node per character instead, via a lazy cache (the unbounded analogue of _BYTE_NODES):

@lru_cache(maxsize=None)
def _char_node(char: str) -> Node:
    return Node(char.encode("utf-8"))

Nodes are frozen, so sharing is safe, and unlike caching NodesSequence/graph builders there's no memo/GraphSettings dependency (a Node is just bytes). Wins the same two ways #36 did:

  • Memory — repeated characters collapse to shared objects.
  • Speed — equal characters are the same object, so merge's nodes[i:i+m] == merge comparisons hit CPython's per-element identity short-circuit.

Output identical, 137 tests pass. Measured on repeated multilingual text with units="characters" (best of 3):

main this PR
time 0.114s 0.073s (−36%)
peak mem 2.26MB 1.25MB (−45%)

Only affects the characters unit path (not the default utf8_clusters), so the default BPE benchmark is unchanged — but for codepoint-level tokenization it's a sizable win.

🤖 Generated with Claude Code

characters() allocated a fresh Node per character. Share one immutable
Node per character via a lazy cache (the unbounded analogue of _BYTE_NODES
for bytes). Nodes are frozen, so sharing is safe and has no memo/settings
dependency.

Dedups repeated characters (memory) and lets equal characters compare by
identity, so merge's tuple comparisons hit CPython's identity short-circuit
(speed) — the same effect #36 gave the utf8 byte layer. On repeated
multilingual text with units="characters": ~-36% time, ~-45% peak memory.
Output identical; 137 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY merged commit 5865493 into main Jun 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant