fix(tracing): harden offline buffer for concurrency and high volume#653
Open
viniciusdsmello wants to merge 1 commit into
Open
fix(tracing): harden offline buffer for concurrency and high volume#653viniciusdsmello wants to merge 1 commit into
viniciusdsmello wants to merge 1 commit into
Conversation
The offline buffer could lose or over-retain data under multi-process and high-volume use. Fixes mirror the openlayer-ts SDK: - store_trace: evict in a batch down to max_buffer_size instead of dropping a single file per call, so the cap holds even when multiple processes share the buffer directory and evict concurrently. Stat each candidate defensively so a file removed by another writer mid-scan doesn't abort the store (which previously dropped the trace). - Use the full UUID in filenames instead of an 8-char slice; worker threads share a PID, so a truncated id could collide and silently overwrite a buffered trace at high volume. - max_buffer_size=0 is now honored (store nothing) instead of being coerced to the 1000 default by 'or'. - clear_buffer isolates each removal so one failure no longer aborts the rest, and returns the count actually removed. - replay_buffered_traces always attempts each trace at least once, so max_retries=0 is no longer a silent no-op. Adds regression tests (each verified to fail on the prior implementation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7ed19a5 to
def7e06
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens the offline trace buffer against multi-process / multi-thread use and high volume, where it could silently lose data or grow unbounded past its configured cap. These issues were found while porting the feature to the TypeScript SDK and validating it end-to-end (worker threads + child processes + real HTTP); the same root causes exist here. (Note: the
configure()"replace-all" footgun is already fixed onmainvia the merge-semanticsinit(), so it is not included.)Fixes
Eviction cap not enforced across processes —
store_traceremoved only one oldest file per call. With multiple processes sharing the default~/.openlayer/buffer, concurrent writers each drop one while many add, so the cap is never enforced (reproduced: 4 processes × 800 writes to a cap-200 buffer left 220+ files). Now evicts the full excess in one batch down tomax_buffer_size. Also stats each candidate defensively, so a file removed by another writer mid-scan no longer raises and aborts the store (which previously dropped that trace).Filename collisions at high volume — filenames used
str(uuid.uuid4())[:8](32 bits). Across processes the PID disambiguates, but worker threads share a PID, so birthday collisions become likely at tens of thousands of buffered traces → silent overwrite. Now uses the full UUID (122 bits).max_buffer_size=0coerced to 1000 —max_buffer_size or 1000treated an explicit0as falsy. Now uses an explicitNonecheck and honors0(store nothing).clear_bufferaborted on the first failed unlink and could mis-report the count (returned 0 even after deleting some). Now isolates each removal and returns the count actually removed.replay_buffered_traces(max_retries=0)was a silent no-op —range(0)never ran the loop, so traces stayed buffered with a misleading 0/0 result. Now always attempts each trace at least once.What is NOT changed
get_buffered_tracesalready usessorted(key=...)(onestatper file) and per-filetry/except, so no perf/robustness change was needed there (the TS port needed it because JS sort calls the comparator repeatedly).Tests
Adds 4 regression tests in
tests/test_offline_buffering.py(batch eviction,max_buffer_size=0, resilientclear_buffer,replay(max_retries=0)) — each verified to fail on the prior implementation and pass with the fix. Full tracer test suite (test_offline_buffering,test_tracer_configuration,test_tracing_core) passes (83 tests).ruff check/ruff formatclean;mypyintroduces no new errors vs. baseline.🤖 Generated with Claude Code