Skip to content

perf: deduplicate identical word graphs within a build#40

Open
AmitMY wants to merge 1 commit into
mainfrom
perf/dedup-word-graphs
Open

perf: deduplicate identical word graphs within a build#40
AmitMY wants to merge 1 commit into
mainfrom
perf/dedup-word-graphs

Conversation

@AmitMY

@AmitMY AmitMY commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

_build_graphs built a fresh graph for every word occurrence. In natural text most words repeat, so this allocates many identical subgraphs (and, during training, an independent get_merges memo for each). Dedup them: repeated words share one immutable subgraph.

def _build_graphs(self, texts):
    if self.cache_maxsize == 0:
        units = self.units                                    # dedup disabled
    else:
        units = lru_cache(maxsize=self.cache_maxsize)(self.units)
    return tuple(words(text, ..., units=units) for text in texts)

Configurable via Tokenizer(cache_maxsize=None) (plumbed through all subclasses): None (default) unbounded, a number bounds the cache for huge vocabularies, 0 disables dedup.

Why the cache is build-local (not set once in __init__): scoping it to one build matters two ways —

  1. It's freed before training, so it doesn't pin the pre-merge word graphs in memory while the trainer merges past them. (Holding it for the whole run drops the memory win from −38% to ~−4%.)
  2. A word graph is a NodesSequence whose memoized get_merges depends on GraphSettings; a build-local cache lives under one settings regime and can't serve a stale memo to a later run — the hazard that makes @cache chinese_character_to_graph incorrect.

Output identical — new test test_cache_maxsize_does_not_change_merges pins that 0 and 10 produce the same merges; full digests unchanged; 138 tests pass.

Measured back-to-back (50 texts / 200 merges; best of 3):

main this PR
BPE 0.414s / 2.14MB 0.401s / 1.33MB (−3% / −38%)
BNE n=4 0.806s / 3.63MB 0.786s / 2.63MB (−2% / −28%)
Boundless 0.494s / 2.28MB 0.482s / 1.42MB (−2% / −38%)

Shared subgraphs also share their get_merges memo (computed once per unique word instead of per occurrence) — the source of the speed bonus.

🤖 Generated with Claude Code

@AmitMY AmitMY force-pushed the perf/dedup-word-graphs branch 3 times, most recently from 4b1526c to 70215a9 Compare June 26, 2026 17:19
_build_graphs rebuilt a fresh graph for every word occurrence. Repeated
words (most of natural text) now share one immutable subgraph — and its
get_merges memo — via an lru_cache local to the build. Scoping it to the
build matters two ways: it's freed before training (so it doesn't pin the
pre-merge word graphs while the trainer merges past them — keeping the
memory win), and it can't leak a settings/script-dependent graph to a
later run with different GraphSettings.

Configurable via Tokenizer(cache_maxsize=None): None (default) unbounded,
a number bounds it for huge vocabularies, 0 disables dedup. New test pins
that merges are identical with the cache off vs on.

-28% to -38% peak memory and ~2-4% faster across BPE/BNE/Boundless;
output identical (merge digests unchanged). 138 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY force-pushed the perf/dedup-word-graphs branch from 70215a9 to 1538a48 Compare June 26, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant