perf: deduplicate identical word graphs within a build#40
Open
AmitMY wants to merge 1 commit into
Open
Conversation
4b1526c to
70215a9
Compare
_build_graphs rebuilt a fresh graph for every word occurrence. Repeated words (most of natural text) now share one immutable subgraph — and its get_merges memo — via an lru_cache local to the build. Scoping it to the build matters two ways: it's freed before training (so it doesn't pin the pre-merge word graphs while the trainer merges past them — keeping the memory win), and it can't leak a settings/script-dependent graph to a later run with different GraphSettings. Configurable via Tokenizer(cache_maxsize=None): None (default) unbounded, a number bounds it for huge vocabularies, 0 disables dedup. New test pins that merges are identical with the cache off vs on. -28% to -38% peak memory and ~2-4% faster across BPE/BNE/Boundless; output identical (merge digests unchanged). 138 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
70215a9 to
1538a48
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
_build_graphsbuilt a fresh graph for every word occurrence. In natural text most words repeat, so this allocates many identical subgraphs (and, during training, an independentget_mergesmemo for each). Dedup them: repeated words share one immutable subgraph.Configurable via
Tokenizer(cache_maxsize=None)(plumbed through all subclasses):None(default) unbounded, a number bounds the cache for huge vocabularies,0disables dedup.Why the cache is build-local (not set once in
__init__): scoping it to one build matters two ways —NodesSequencewhose memoizedget_mergesdepends onGraphSettings; a build-local cache lives under one settings regime and can't serve a stale memo to a later run — the hazard that makes@cache chinese_character_to_graphincorrect.Output identical — new test
test_cache_maxsize_does_not_change_mergespins that0and10produce the same merges; full digests unchanged; 138 tests pass.Measured back-to-back (50 texts / 200 merges; best of 3):
Shared subgraphs also share their
get_mergesmemo (computed once per unique word instead of per occurrence) — the source of the speed bonus.🤖 Generated with Claude Code