perf: skip merge on subtrees the memo proves are unaffected by AmitMY · Pull Request #35 · sign-language-processing/complex-tokenization

AmitMY · 2026-06-26T13:20:22Z

Reworked from the earlier lazy-out version (which read messy and only gave ~5%) to reuse the get_merges memo as an "is this subtree affected?" check, per review.

_merges (when memoized — the default) already lists every mergeable subsequence in the subtree, including nested ones (yield from child.get_merges()). So if the chosen merge isn't among them, this subtree contains it nowhere and merge() can return self immediately — skipping the scan, the child recursion, and the rebuild in one membership check:

cached = getattr(self, "_merges", None)
if cached is not None and merge not in cached:
    return self
# ... otherwise the plain #29 scan/rebuild ...

Falls back to the scan when the node isn't memoized (e.g. TRADE_MEMORY_FOR_SPEED=False, or pre-seeded merges before any get_merges), so behavior is unchanged there. Output identical; 137 tests pass (rebased on current main, which now has #32/#33/#34/#36).

Measured back-to-back, 3 identical rounds (50 texts / 200 merges; time = best of 3):

	main	this PR	Δ time	Δ mem
BPE	0.655s / 1.89MB	0.428s / 2.14MB	−35%	+0.25MB
BNE n=4	0.966s / 4.80MB	0.845s / 3.63MB	−13%	−1.17MB
Boundless	0.765s / 1.99MB	0.510s / 2.28MB	−33%	+0.29MB

Far bigger than the original lazy-out (−4–7%): an unaffected word now skips the scan, recursion, and rebuild in a single C-level in check. Memory is mixed and small (BNE −1.2MB from avoided transient rebuilds; BPE/Boundless +~0.25MB).

🤖 Generated with Claude Code

AmitMY · 2026-06-26T13:31:01Z

+            out.extend(nodes[i:])
+            if len(out) == 1:
+                return out[0]
+            merged_nodes = tuple(node.merge(token, merge) for node in out)


i love the idea, but think we might need a better implementation - this is a bit messy and not straightforward

AmitMY · 2026-06-26T13:42:25Z

Fair — I tried two cleaner rewrites and benchmarked both against main; both lose the speedup:

	main	inline (current)	`_contains` pre-check	extracted `_replace_top`
BPE	0.841s	0.796s (−6.9%)	0.856s (~flat)	0.840s (~flat)
BNE	1.443s	1.377s (−4.8%)	1.440s (~flat)	1.448s (~flat)
Boundless	0.991s	0.907s (−4.0%)	0.957s (−3%)	0.945s (~flat)

The win comes specifically from the single inline pass: the _contains version double-scans the words that do contain the merge, and extracting a helper adds a method call per word per merge — both erase the gain.

So it's a real tension: the speedup needs the inline scan, which is the part that reads as messy. Given this is the smallest of the five wins and resists a clean expression, my recommendation is to close this one rather than add code that's either messy or doesn't help. The other four (slots, skip-flatten, argmax-items, interning) carry the bulk of the improvement. Happy to close it unless you'd rather keep the inline version as-is.

When get_merges is memoized, _merges lists every mergeable subsequence in the subtree. If the chosen merge isn't among them, this subtree contains it nowhere, so merge() returns self immediately — skipping the scan, the child recursion, and the rebuild. Falls back to the plain scan when the node isn't memoized. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AmitMY commented Jun 26, 2026

View reviewed changes

AmitMY force-pushed the perf/merge-lazy-out branch from 5babe23 to c4d8cee Compare June 26, 2026 13:56

AmitMY changed the title ~~perf: defer out-list construction in merge until first match~~ perf: skip merge on subtrees the memo proves are unaffected Jun 26, 2026

AmitMY merged commit 2f6cdf0 into main Jun 26, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: skip merge on subtrees the memo proves are unaffected#35

perf: skip merge on subtrees the memo proves are unaffected#35
AmitMY merged 1 commit into
mainfrom
perf/merge-lazy-out

AmitMY commented Jun 26, 2026 •

edited

Loading

Uh oh!

AmitMY Jun 26, 2026

Uh oh!

AmitMY commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AmitMY commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmitMY Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

AmitMY commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AmitMY commented Jun 26, 2026 •

edited

Loading