Skip to content

fix(tracing): harden offline buffer for concurrency and high volume#653

Open
viniciusdsmello wants to merge 1 commit into
mainfrom
feat/offline-buffering-hardening
Open

fix(tracing): harden offline buffer for concurrency and high volume#653
viniciusdsmello wants to merge 1 commit into
mainfrom
feat/offline-buffering-hardening

Conversation

@viniciusdsmello

Copy link
Copy Markdown
Contributor

Summary

Hardens the offline trace buffer against multi-process / multi-thread use and high volume, where it could silently lose data or grow unbounded past its configured cap. These issues were found while porting the feature to the TypeScript SDK and validating it end-to-end (worker threads + child processes + real HTTP); the same root causes exist here. (Note: the configure() "replace-all" footgun is already fixed on main via the merge-semantics init(), so it is not included.)

Fixes

  1. Eviction cap not enforced across processesstore_trace removed only one oldest file per call. With multiple processes sharing the default ~/.openlayer/buffer, concurrent writers each drop one while many add, so the cap is never enforced (reproduced: 4 processes × 800 writes to a cap-200 buffer left 220+ files). Now evicts the full excess in one batch down to max_buffer_size. Also stats each candidate defensively, so a file removed by another writer mid-scan no longer raises and aborts the store (which previously dropped that trace).

  2. Filename collisions at high volume — filenames used str(uuid.uuid4())[:8] (32 bits). Across processes the PID disambiguates, but worker threads share a PID, so birthday collisions become likely at tens of thousands of buffered traces → silent overwrite. Now uses the full UUID (122 bits).

  3. max_buffer_size=0 coerced to 1000max_buffer_size or 1000 treated an explicit 0 as falsy. Now uses an explicit None check and honors 0 (store nothing).

  4. clear_buffer aborted on the first failed unlink and could mis-report the count (returned 0 even after deleting some). Now isolates each removal and returns the count actually removed.

  5. replay_buffered_traces(max_retries=0) was a silent no-oprange(0) never ran the loop, so traces stayed buffered with a misleading 0/0 result. Now always attempts each trace at least once.

What is NOT changed

  • get_buffered_traces already uses sorted(key=...) (one stat per file) and per-file try/except, so no perf/robustness change was needed there (the TS port needed it because JS sort calls the comparator repeatedly).
  • Replay still relies on the SDK client's built-in retry/backoff between attempts (unchanged).

Tests

Adds 4 regression tests in tests/test_offline_buffering.py (batch eviction, max_buffer_size=0, resilient clear_buffer, replay(max_retries=0)) — each verified to fail on the prior implementation and pass with the fix. Full tracer test suite (test_offline_buffering, test_tracer_configuration, test_tracing_core) passes (83 tests). ruff check/ruff format clean; mypy introduces no new errors vs. baseline.

🤖 Generated with Claude Code

The offline buffer could lose or over-retain data under multi-process and
high-volume use. Fixes mirror the openlayer-ts SDK:

- store_trace: evict in a batch down to max_buffer_size instead of dropping
  a single file per call, so the cap holds even when multiple processes
  share the buffer directory and evict concurrently. Stat each candidate
  defensively so a file removed by another writer mid-scan doesn't abort
  the store (which previously dropped the trace).
- Use the full UUID in filenames instead of an 8-char slice; worker threads
  share a PID, so a truncated id could collide and silently overwrite a
  buffered trace at high volume.
- max_buffer_size=0 is now honored (store nothing) instead of being
  coerced to the 1000 default by 'or'.
- clear_buffer isolates each removal so one failure no longer aborts the
  rest, and returns the count actually removed.
- replay_buffered_traces always attempts each trace at least once, so
  max_retries=0 is no longer a silent no-op.

Adds regression tests (each verified to fail on the prior implementation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@viniciusdsmello viniciusdsmello force-pushed the feat/offline-buffering-hardening branch from 7ed19a5 to def7e06 Compare June 16, 2026 01:12
@viniciusdsmello viniciusdsmello self-assigned this Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant