perf: intern per-character Nodes in characters() by AmitMY · Pull Request #39 · sign-language-processing/complex-tokenization

AmitMY · 2026-06-26T14:46:06Z

The characters() unit function allocated a fresh Node for every character. This mirrors the utf8 byte layer before #36 — share one immutable Node per character instead, via a lazy cache (the unbounded analogue of _BYTE_NODES):

@lru_cache(maxsize=None)
def _char_node(char: str) -> Node:
    return Node(char.encode("utf-8"))

Nodes are frozen, so sharing is safe, and unlike caching NodesSequence/graph builders there's no memo/GraphSettings dependency (a Node is just bytes). Wins the same two ways #36 did:

Memory — repeated characters collapse to shared objects.
Speed — equal characters are the same object, so merge's nodes[i:i+m] == merge comparisons hit CPython's per-element identity short-circuit.

Output identical, 137 tests pass. Measured on repeated multilingual text with units="characters" (best of 3):

	main	this PR
time	0.114s	0.073s (−36%)
peak mem	2.26MB	1.25MB (−45%)

Only affects the characters unit path (not the default utf8_clusters), so the default BPE benchmark is unchanged — but for codepoint-level tokenization it's a sizable win.

🤖 Generated with Claude Code

characters() allocated a fresh Node per character. Share one immutable Node per character via a lazy cache (the unbounded analogue of _BYTE_NODES for bytes). Nodes are frozen, so sharing is safe and has no memo/settings dependency. Dedups repeated characters (memory) and lets equal characters compare by identity, so merge's tuple comparisons hit CPython's identity short-circuit (speed) — the same effect #36 gave the utf8 byte layer. On repeated multilingual text with units="characters": ~-36% time, ~-45% peak memory. Output identical; 137 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AmitMY merged commit 5865493 into main Jun 26, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: intern per-character Nodes in characters()#39

perf: intern per-character Nodes in characters()#39
AmitMY merged 1 commit into
mainfrom
perf/intern-char-nodes

AmitMY commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AmitMY commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant