perf: deduplicate identical word graphs within a build by AmitMY · Pull Request #40 · sign-language-processing/complex-tokenization

AmitMY · 2026-06-26T14:57:27Z

_build_graphs built a fresh graph for every word occurrence. In natural text most words repeat, so this allocates many identical subgraphs (and, during training, an independent get_merges memo for each). Dedup them: repeated words share one immutable subgraph.

def _build_graphs(self, texts):
    if self.cache_maxsize == 0:
        units = self.units                                    # dedup disabled
    else:
        units = lru_cache(maxsize=self.cache_maxsize)(self.units)
    return tuple(words(text, ..., units=units) for text in texts)

Configurable via Tokenizer(cache_maxsize=None) (plumbed through all subclasses): None (default) unbounded, a number bounds the cache for huge vocabularies, 0 disables dedup.

Why the cache is build-local (not set once in __init__): scoping it to one build matters two ways —

It's freed before training, so it doesn't pin the pre-merge word graphs in memory while the trainer merges past them. (Holding it for the whole run drops the memory win from −38% to ~−4%.)
A word graph is a NodesSequence whose memoized get_merges depends on GraphSettings; a build-local cache lives under one settings regime and can't serve a stale memo to a later run — the hazard that makes @cache chinese_character_to_graph incorrect.

Output identical — new test test_cache_maxsize_does_not_change_merges pins that 0 and 10 produce the same merges; full digests unchanged; 138 tests pass.

Measured back-to-back (50 texts / 200 merges; best of 3):

	main	this PR
BPE	0.414s / 2.14MB	0.401s / 1.33MB (−3% / −38%)
BNE n=4	0.806s / 3.63MB	0.786s / 2.63MB (−2% / −28%)
Boundless	0.494s / 2.28MB	0.482s / 1.42MB (−2% / −38%)

Shared subgraphs also share their get_merges memo (computed once per unique word instead of per occurrence) — the source of the speed bonus.

🤖 Generated with Claude Code

_build_graphs rebuilt a fresh graph for every word occurrence. Repeated words (most of natural text) now share one immutable subgraph — and its get_merges memo — via an lru_cache local to the build. Scoping it to the build matters two ways: it's freed before training (so it doesn't pin the pre-merge word graphs while the trainer merges past them — keeping the memory win), and it can't leak a settings/script-dependent graph to a later run with different GraphSettings. Configurable via Tokenizer(cache_maxsize=None): None (default) unbounded, a number bounds it for huge vocabularies, 0 disables dedup. New test pins that merges are identical with the cache off vs on. -28% to -38% peak memory and ~2-4% faster across BPE/BNE/Boundless; output identical (merge digests unchanged). 138 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AmitMY force-pushed the perf/dedup-word-graphs branch 3 times, most recently from 4b1526c to 70215a9 Compare June 26, 2026 17:19

AmitMY force-pushed the perf/dedup-word-graphs branch from 70215a9 to 1538a48 Compare June 26, 2026 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: deduplicate identical word graphs within a build#40

perf: deduplicate identical word graphs within a build#40
AmitMY wants to merge 1 commit into
mainfrom
perf/dedup-word-graphs

AmitMY commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AmitMY commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AmitMY commented Jun 26, 2026 •

edited

Loading