Skip to content

test(qwp): fix flaky cold-start timing in async close-drain test#54

Open
mtopolnik wants to merge 1 commit into
mainfrom
mt_fix-closedrain-coldstart-flake
Open

test(qwp): fix flaky cold-start timing in async close-drain test#54
mtopolnik wants to merge 1 commit into
mainfrom
mt_fix-closedrain-coldstart-flake

Conversation

@mtopolnik

Copy link
Copy Markdown
Contributor

Problem

CloseDrainTest.testAsyncCloseDrainSucceedsWhenServerWasUpAllAlong runs 20
iterations. Each iteration constructs a store-and-forward async Sender, writes
and flushes one row, then closes the sender and asserts that close completes
within a 2500ms budget. That budget targets the latency of close()'s drain,
but the test timed the whole try (Sender sender = Sender.fromConfig(cfg)) { ...; flush(); } block — construction and close — against it.

Iteration 0 builds the first store-and-forward async sender in the JVM, so
fromConfig pays a one-time cold-start cost: class loading, JIT warmup, and the
store-and-forward buffer mmap. On a loaded CI runner that cold start reached
~3.26s and tripped the 2500ms assertion, even though close() itself took
~12ms. The failure is load-dependent, so it surfaced as a rare flake on busy
runners rather than deterministically.

Fix

Move the t0 = System.nanoTime() start inside the try block so the timed region
wraps only sender.close(). This matches the sibling test
testAsyncCloseDrainSucceedsWhenServerStartsDuringDrain, which already times
close() alone. The cold-start cost now sits outside the measured window, where
it belongs: the assertion was always about drain latency, not sender
construction.

QwpWebSocketSender.close() is idempotent — it guards on a closed flag and
returns immediately on the second call — so the try-with-resources auto-close at
the end of the block is a no-op.

Tradeoffs

The test no longer bounds fromConfig/construction time at all. That was never
its intent — the 2500ms budget targets the close/drain path — and construction
runs in every other test in the suite, so this is not a coverage loss here. A
genuine construction-time regression would need a dedicated test rather than
being caught incidentally by this one.

Test plan

  • Ran CloseDrainTest#testAsyncCloseDrainSucceedsWhenServerWasUpAllAlong and
    #testAsyncCloseDrainSucceedsWhenServerStartsDuringDrain locally: 2 tests
    run, 0 failures, 0 errors
  • Confirmed the 20-iteration WasUpAllAlong test passes with the close-only
    timer; every iteration's close() completed within the 2500ms budget
  • Verified QwpWebSocketSender.close() guards on its closed flag, so the
    implicit try-with-resources close is a no-op

testAsyncCloseDrainSucceedsWhenServerWasUpAllAlong timed the whole
try (Sender = fromConfig(...)) { write; flush; } close block against a
2500ms budget, but the assertion budgets close()'s drain latency, not
sender construction. Iteration 0 is the first store-and-forward async
sender in the JVM, so fromConfig pays a one-time cold-start cost (class
loading, JIT, sf buffer mmap). On a loaded CI runner that cold start
reached ~3.26s and tripped the assertion, even though close() itself
took ~12ms.

Move t0 to wrap only sender.close(), matching the sibling test
testAsyncCloseDrainSucceedsWhenServerStartsDuringDrain. close() is
idempotent, so the try-with-resources auto-close is a no-op. Measured
close() latency stays in the single-digit milliseconds across all 20
iterations, well within the 2500ms budget.
@mtopolnik mtopolnik added the bug Something isn't working label Jun 19, 2026
@mtopolnik

Copy link
Copy Markdown
Contributor Author

Level 2. Test-only change: one 7-line move of the timing window in CloseDrainTest.testAsyncCloseDrainSucceedsWhenServerWasUpAllAlong, plus an explicit sender.close() inside the try-with-resources. No production code is touched.

I verified the single load-bearing claim — that QwpWebSocketSender.close() is idempotent — against source, and ran an independent fresh-context adversarial pass. Both confirm the change is correct.

Critical

None.

Moderate

None.

Minor

2. Residual flake surface: 2500ms assertion vs 3000ms drain timeout (in-diff, CloseDrainTest.java:360,373). The assertion threshold (< 2500) sits only 500ms below the configured close_flush_timeout_millis=3000. The PR correctly removes the dominant flake cause (the ~3.26s construction cold-start, now outside the timed window). The remaining theoretical window is a close() whose own drain takes 2500–3000ms — i.e. close() succeeds but trips the assertion. With DelayingAckHandler(0) the drain round-trip is ~12ms, so the practical margin is ~200×; this is not blocking. Worth a one-line awareness note only because the two constants are coupled — if the timeout is ever lowered toward 2500ms, that 500ms gap becomes the flake.

3. (Pre-existing, not introduced by this PR) sf_dir temp directories leak. sfDirOpt() (CloseDrainTest.java:379) mints a fresh qdb-close-drain-<nanotime> dir under java.io.tmpdir each of the 20 iterations, and nothing deletes them. This diff neither adds nor changes that code, so it's out of scope for #54 — flagging only so it isn't mistaken for clean. The sibling test does the same.

Summary

Verdict: approve. The fix is correct and the diff is minimal. The 2500ms budget always targeted drain latency; moving t0 inside the try so it brackets only sender.close() matches that intent and matches the sibling test. The idempotency claim that makes the redundant auto-close safe is verified against QwpWebSocketSender.close(). The PR description is level-headed and honestly states the coverage tradeoff (construction time is no longer bounded — acceptable, as it never was the assertion's purpose). Commit message is well-formed (45-char title, full body, active voice).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant