Fix kmsg timestamps#288
Conversation
|
Created a monitoring plan for this PR. What this PR does: Fixes OOM kill events ( Intended effect:
Risks:
Status updates will be posted automatically on this PR as monitoring progresses. |
rgarcia
left a comment
There was a problem hiding this comment.
approving — clean, low-risk fix. traced the chain end to end: time.Now() (kmsg_linux.go:48) → KmsgMessage.Timestamp → feed() (sysmon.go:116) → OomInstance.TimeOfDeath (kmsg.go:190) → Event.Ts (sysmon.go:178), so the wall-clock stamp genuinely reaches the published OOM event. the SeekEnd() precondition the comment relies on holds (kmsg_linux.go:23), and the new test passes locally.
a few doc/test nits, none blocking:
nits
AGENTS.md:52(gotcha #13) — lists the supervisord shim as an in-process producer that "must stamptime.Now()themselves", but it's actually an HTTP producer:cmd/supervisord-shimPOSTs aPublishEventRequestwith nots(main.go:236), so it gets stamped at ingest byTelemetrySession.publishLocked(telemetry.go:76-77) — i.e. the doc's other clause. consider moving it to the "HTTP-published … stamped by the API handler" side, or swapping in a genuine in-process example (e.g. cdpmonitor, which stampstime.Now().UnixMicro()itself).AGENTS.md:52— sincepublishLockedalready defaults a zeroTsto wall-clock, the precise hazard #13 guards against is a producer settingTsto a non-zero, non-wall-clock value (exactly the envelope bug here). wording it that way would make the "why" land harder.kmsg_linux_test.go:52— the!Equal(envelopeTime)assert is redundant with the[before, after]bracket (beforeis decades after the envelope time, so!Before(before)already excludes it). harmless and documents intent — optional to drop.
nice touch documenting the invariant on Event.Ts itself (event.go) so the next producer author sees it at the field.
Overview
system_oom_killevents were stamped with the kmsg envelope timestamp, which isCLOCK_MONOTONIC-derived. That clock freezes while a unikraft VM is suspended (scale-to-zero), so on any VM that had standbyed the OOM timestamps skewed backward by the accumulated suspended duration — sometimes by hours.sysmonkmsg reader now stamps each record with wall-clock read time (time.Now()) instead of the envelope timestamp. We only ever read live records (the source seeks to end first), so read time is an accurate event time.events.Event.Tsand added a gotcha (WebRTC OSS launch #13) inAGENTS.md: all event timestamps must be wall-clock at emit/observe, never a monotonic/source clock.Test
kmsgparserSourcestamps observation time, not the envelope's monotonic timeNote
Low Risk
Localized change to kmsg timestamp stamping and documentation; behavior improves correctness on suspended VMs with minimal blast radius.
Overview
Fixes
system_oom_kill(and related OOM parsing) timestamps that could appear hours in the past on scale-to-zero VMs because kmsg envelope times are monotonic-derived and freeze while the VM is suspended.kmsgparserSourcenow sets eachKmsgMessage.Timestampto wall-clock read time (time.Now()at forward), not the parser’s envelope timestamp; comments onOomInstance.TimeOfDeathandKmsgMessagedescribe that contract.Documents the same rule on
events.Event.Tsand adds AGENTS.md gotcha #13 so in-process telemetry producers stamp at emit/observe. Adds a linux-gated unit test that the forwarded stamp is observation time, not the envelope time.Reviewed by Cursor Bugbot for commit 0b909ed. Bugbot is set up for automated code reviews on this repo. Configure here.