Skip to content

Add GET /telemetry/events to read archived telemetry from S2#285

Open
archandatta wants to merge 14 commits into
mainfrom
archand/kernel-1338/add-s2-events-get
Open

Add GET /telemetry/events to read archived telemetry from S2#285
archandatta wants to merge 14 commits into
mainfrom
archand/kernel-1338/add-s2-events-get

Conversation

@archandatta

@archandatta archandatta commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a paginated read path for browser telemetry archived in S2, so events stay available after they age out of the live SSE ring buffer (GET /telemetry/stream).

  • GET /telemetry/events — reads archived telemetry envelopes for the browser in ascending sequence order. Returns a top-level JSON array of envelopes plus X-Has-More / X-Next-Offset pagination headers. Empty result serializes as [], never null.
  • Cursor pagination over S2offset is an S2 sequence cursor (pass the previous response's X-Next-Offset); since/until bound the time window; category filters; limit is the page size (default 100, max 1000). has_more is derived from the S2 read's next-position vs the stream tail, so it is correct whether a page stops on the count, byte, or time bound. Negative since/until/offset are clamped to 0 (minimum: 0 in the spec; no request-validation middleware is mounted).
  • events.S2Reader — one-shot, bounded S2 reader injected into ApiService; builds its S2 client once and reuses it across pages.
  • Configurable batcherS2_BATCHER_LINGER / S2_BATCHER_MAX_RECORDS env vars wire into the existing S2Config (defaults unchanged: 100ms / 50).
  • Category schema unified — the telemetry category enum is promoted to a shared TelemetryEventCategory schema; both the read (TelemetryEvent) and write (PublishEventRequest) sides reference it, so adding a category can't drift the two apart.
  • Fix: telemetry read endpoints no longer self-emit api_call events. The per-request api_call middleware was firing for ReadTelemetryEvents / StreamTelemetryEvents, appending to the very stream being read — a feedback loop that, at page size 1, prevented a paginated read from ever catching the tail. Reads are now side-effect-free.

Envelope/event schemas are unchanged. The response is encoded directly from events.Envelope (matching the SSE stream and publish endpoints), so all telemetry endpoints emit an identical envelope shape.

Consumers paginate via X-Next-Offset (an S2 stream position) — not the envelope's seq field; the param/header docs call this out.

This endpoint is harmless until a consumer calls it, so it can ship independently. A follow-up on the control-plane API will mirror it and wire Stainless offset_pagination so the SDK auto-paginates.

Note on category-filtered pagination

The category filter is applied after the cursor-bounded S2 read (S2 has no server-side filtering), so a filtered page can come back empty while X-Has-More is true. has_more is authoritative — a client pages until it is false, not until a page is empty. Verified end-to-end (see below).

Test plan

  • Unit: handler (S2-disabled empty path), buildReadOptions (offset-over-since precedence, negative-bound clamping, Count mapping), pageSize clamping, filterByCategory, and a middleware regression asserting the telemetry read endpoints emit no api_call.
  • e2e (server/e2e, real S2 via KI_E2E_BACKEND=docker): round-trip, category filter, empty-window [], multi-page cursor walk (ascending, no gaps/dupes), category-filtered pagination across empty intermediate pages, and a read-is-side-effect-free assertion. Auto-skips without S2_BASIN/S2_ACCESS_TOKEN/S2_STREAM.
  • go build, go vet, package tests green; make oapi-generate output committed.

Live runs

Ran against a real S2 stream (freshly built headless image, KI_E2E_BACKEND=docker):

--- PASS: TestReadTelemetryEvents
--- PASS: TestReadTelemetryEventsPagination
--- PASS: TestReadTelemetryEventsFilteredPagination
ok  github.com/kernel/kernel-images/server/e2e

Also exercised by hand over HTTP at scale — published ~1000 events and paged the full stream at limit=50:

pages walked: 19
total_records=950  min_seq=1  max_seq=950  duplicates=0  gaps=0
RESULT: PASS — full stream 1..950 returned exactly once, contiguous, in order

limit=1 terminates cleanly (the feedback-loop fix), and a windowed (until) read terminates even while the stream is actively appended (one benign empty boundary page).

🤖 Generated with Claude Code


Note

Medium Risk
New external read surface and S2 pagination semantics (especially category-filtered empty pages) need correct client cursor handling; middleware skip fixes a real pagination bug on read paths.

Overview
Adds paginated read access to telemetry archived in S2 via GET /telemetry/events, so clients can recover events after they leave the live SSE ring buffer. When S2 is not configured, the handler returns 200 with [] (never null). Pages use X-Has-More / X-Next-Offset (S2 stream cursor, not envelope seq), with since/until, limit (default 100, max 1000), and post-read category filtering.

events.S2Reader performs bounded, reusable-client S2 reads; ApiService takes an optional reader and buildReadOptions maps query params (including negative-bound clamping and offset over since).

S2 append batching is tunable via S2_BATCHER_LINGER / S2_BATCHER_MAX_RECORDS env vars passed into the existing writer.

OpenAPI promotes a shared TelemetryEventCategory for publish and read paths (replacing PublishEventRequestCategory).

Telemetry HTTP middleware skips ReadTelemetryEvents and StreamTelemetryEvents so reads do not append api_call events into the stream being consumed.

E2e tests cover round-trip, pagination, filtered empty pages, and read side-effect-freedom against real S2 when creds are set.

Reviewed by Cursor Bugbot for commit 4c4e6ff. Bugbot is set up for automated code reviews on this repo. Configure here.

@archandatta archandatta force-pushed the archand/kernel-1338/add-s2-events-get branch from bb9b965 to 0f44c61 Compare June 11, 2026 12:55
archandatta and others added 13 commits June 15, 2026 16:55
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… limit

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reading telemetry emitted its own api_call event back into the stream,
a feedback loop that prevents a paginated read from catching the tail.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A fresh s2.New() per request re-handshakes TLS/HTTP2 on every page of a
paginated walk; build the client once in NewS2Reader and reuse it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pins the telemetryReadOps skip against operationId drift; without it a
rename silently revives the read feedback loop with all fast tests green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- unify the category enum: PublishEventRequest now references the shared
  TelemetryEventCategory schema instead of inlining it (no drift on add)
- clamp negative since/until/offset to 0 rather than wrapping to a huge
  uint64; add minimum:0 to the since/until schema
- drop the comment claiming a pagination "convention" that does not exist,
  and stop leaking the internal ring-buffer size into the replay doc
- cover category-filtered pagination (empty intermediate pages) e2e and
  the negative-bound clamping unit case

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@archandatta archandatta force-pushed the archand/kernel-1338/add-s2-events-get branch from 49d0a07 to 48e3e12 Compare June 15, 2026 16:56
@archandatta archandatta marked this pull request as ready for review June 15, 2026 16:56
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Browser telemetry events that age out of the live SSE stream are now accessible via a new paginated read endpoint, so consumers can retrieve the full history of archived events rather than only live-streamed ones. Also fixes a feedback loop where the per-request api_call middleware was self-emitting events for telemetry read endpoints, which could prevent paginated reads from ever catching the stream tail at small page sizes.

Intended effect:

  • GET /telemetry/events first traffic: baseline 0 requests (new endpoint); confirmed if first requests land with HTTP 200 and no sustained 5xx.
  • StreamTelemetryEvents feedback-loop fix: "copy telemetry stream" ERROR log baseline is 25–47/hr; confirmed if it stays at or below that level after deploy.

Risks:

  • New endpoint 500s from S2 read failure — HTTP 500s on /browsers/{id}/telemetry/events; alert if any sustained 5xx appear on this path.
  • Feedback-loop regression — "copy telemetry stream" ERROR log count; alert if it exceeds 100/hr (baseline 25–47/hr active hours).
  • Middleware skips unintended operations — Railway HTTP 5xx rate; alert if it exceeds 0.1% sustained (baseline 0–0.018%).
  • S2Reader connection errors — any new "s2storage" ERROR log lines that weren't present pre-deploy; alert on first occurrence.
  • Category-filtered empty-page loop — if a consumer exits on empty page rather than checking X-Has-More, they stop paginating early; manual checkpoint on first consumer integration.

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

@archandatta archandatta requested review from Sayan- and rgarcia June 15, 2026 18:14
Reflection-assert the skip-set keys are real StrictServerInterface methods
so an operationId rename can't silently revive the read feedback loop, and
skip-with-warn an unparseable archived record so one bad record can't wedge
every paginated read past its seqnum.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@rgarcia rgarcia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed — solid PR. core pagination mechanics are correct (i traced the S2 cursor/has_more termination through the SDK) and match the main API's offset_pagination convention; unit + e2e coverage is thorough. main things are spec-surface wording and a couple of consistency items before this gets relayed to the public API. no correctness bugs.

API surface — keep implementation out of the spec (this gets relayed to the public API)

  • server/openapi.yaml:1383 (summary) — "Read archived telemetry events from durable storage" is implementation-heavy → something like "Read telemetry events for a browser session"
  • server/openapi.yaml:1385-1390 (description) — drop the S2 / "age out of the live SSE ring buffer" / "durable storage is not configured" detail → e.g. "Reads telemetry event data for the browser session."
  • server/openapi.yaml:1398 (offset param) — "This is a stream position" leaks implementation; reword as an opaque pagination cursor. keep the useful guidance (pass back X-Next-Offset; don't derive it from the response body / it's not the envelope seq).

Pagination consistency (vs kernel/kernel packages/api/openapi.yaml)

  • core mechanics MATCH: header-based offset pagination, top-level array body, offset/limit params, X-Has-More/X-Next-Offset headers — i.e. the Stainless offset_pagination model. the planned SDK wiring should work as-is.
  • server/openapi.yaml:1428 (limit max) — max is 1000; every main-API list endpoint caps limit at 100 (default 20). align to 100, or call out the higher cap as a deliberate telemetry exception.
  • server/openapi.yaml:1406 / :1414 (since/until) — typed as int64 unix ms. the main API's since is a string (RFC-3339 or a duration like 5m, e.g. packages/api/openapi.yaml:8346) and there's no until anywhere. prefer RFC-3339/duration for since to stay consistent for the public relay, and treat until as a documented telemetry-specific addition.
  • offset taking precedence over since is new behavior (no main-API endpoint combines a cursor with a time filter) — worth an explicit doc note.

Handler

  • server/cmd/api/api/events.go:142 (defaultReadWindow) — only sets the start (now−5m); the end is bounded by count, not a window. reword the name/comment to "default start" to avoid implying an end bound.
  • server/cmd/api/api/events.go:190-195 (buildReadOptions) — until-only requests (no since/offset) still anchor the start at now−5m via the default branch, so until < now−5m silently yields an empty page. consistent with the documented since default, but a mild footgun — confirm intended, or special-case until-only to start from the beginning.

Test coverage

  • the core S2Reader.Read + has_more/cursor logic (the subtlest code here) is exercised only by the e2e tests, which auto-skip without S2 creds + docker. worth confirming CI runs them with creds; otherwise an interface seam over the S2 read session for a mocked unit test would give fast-feedback coverage of the cursor math on every run.
  • windowed (until) pagination termination on an actively-appended stream is hand-verified only (the "one benign empty boundary page" in the description), not in a committed test. my read of the SDK agrees it terminates (a zero-record page → NextReadPosition()==nilhas_more=false), but consider an e2e guard.

Nits (optional)

  • duplicate "is S2 configured?" predicate — the explicit three-field if config.S2Basin != "" && … in main.go for the writer vs NewS2Reader's internal copy of the same check; a small config.S2Enabled() helper would keep them from drifting.
  • config defaults (100ms/50) live in both the config struct tags and the s2storage.go zero-value fallbacks (already acknowledged in a comment) — just duplication to be aware of.
  • S2Reader.Read unbounded-read guard (Count==nil && Until==nil → Until=now) is dead from the only caller (the handler always sets Count) and untested — fine as defensive library code, just noting it's unreachable from this path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants