test(qwp): fix flaky cold-start timing in async close-drain test by mtopolnik · Pull Request #54 · questdb/java-questdb-client

mtopolnik · 2026-06-19T13:43:21Z

Problem

CloseDrainTest.testAsyncCloseDrainSucceedsWhenServerWasUpAllAlong runs 20
iterations. Each iteration constructs a store-and-forward async Sender, writes
and flushes one row, then closes the sender and asserts that close completes
within a 2500ms budget. That budget targets the latency of close()'s drain,
but the test timed the whole try (Sender sender = Sender.fromConfig(cfg)) { ...; flush(); } block — construction and close — against it.

Iteration 0 builds the first store-and-forward async sender in the JVM, so
fromConfig pays a one-time cold-start cost: class loading, JIT warmup, and the
store-and-forward buffer mmap. On a loaded CI runner that cold start reached
~3.26s and tripped the 2500ms assertion, even though close() itself took
~12ms. The failure is load-dependent, so it surfaced as a rare flake on busy
runners rather than deterministically.

Fix

Move the t0 = System.nanoTime() start inside the try block so the timed region
wraps only sender.close(). This matches the sibling test
testAsyncCloseDrainSucceedsWhenServerStartsDuringDrain, which already times
close() alone. The cold-start cost now sits outside the measured window, where
it belongs: the assertion was always about drain latency, not sender
construction.

QwpWebSocketSender.close() is idempotent — it guards on a closed flag and
returns immediately on the second call — so the try-with-resources auto-close at
the end of the block is a no-op.

Tradeoffs

The test no longer bounds fromConfig/construction time at all. That was never
its intent — the 2500ms budget targets the close/drain path — and construction
runs in every other test in the suite, so this is not a coverage loss here. A
genuine construction-time regression would need a dedicated test rather than
being caught incidentally by this one.

Test plan

Ran CloseDrainTest#testAsyncCloseDrainSucceedsWhenServerWasUpAllAlong and
#testAsyncCloseDrainSucceedsWhenServerStartsDuringDrain locally: 2 tests
run, 0 failures, 0 errors
Confirmed the 20-iteration WasUpAllAlong test passes with the close-only
timer; every iteration's close() completed within the 2500ms budget
Verified QwpWebSocketSender.close() guards on its closed flag, so the
implicit try-with-resources close is a no-op

testAsyncCloseDrainSucceedsWhenServerWasUpAllAlong timed the whole try (Sender = fromConfig(...)) { write; flush; } close block against a 2500ms budget, but the assertion budgets close()'s drain latency, not sender construction. Iteration 0 is the first store-and-forward async sender in the JVM, so fromConfig pays a one-time cold-start cost (class loading, JIT, sf buffer mmap). On a loaded CI runner that cold start reached ~3.26s and tripped the assertion, even though close() itself took ~12ms. Move t0 to wrap only sender.close(), matching the sibling test testAsyncCloseDrainSucceedsWhenServerStartsDuringDrain. close() is idempotent, so the try-with-resources auto-close is a no-op. Measured close() latency stays in the single-digit milliseconds across all 20 iterations, well within the 2500ms budget.

mtopolnik · 2026-06-19T13:56:51Z

Level 2. Test-only change: one 7-line move of the timing window in CloseDrainTest.testAsyncCloseDrainSucceedsWhenServerWasUpAllAlong, plus an explicit sender.close() inside the try-with-resources. No production code is touched.

I verified the single load-bearing claim — that QwpWebSocketSender.close() is idempotent — against source, and ran an independent fresh-context adversarial pass. Both confirm the change is correct.

Critical

None.

Moderate

None.

Minor

2. Residual flake surface: 2500ms assertion vs 3000ms drain timeout (in-diff, CloseDrainTest.java:360,373). The assertion threshold (< 2500) sits only 500ms below the configured close_flush_timeout_millis=3000. The PR correctly removes the dominant flake cause (the ~3.26s construction cold-start, now outside the timed window). The remaining theoretical window is a close() whose own drain takes 2500–3000ms — i.e. close() succeeds but trips the assertion. With DelayingAckHandler(0) the drain round-trip is ~12ms, so the practical margin is ~200×; this is not blocking. Worth a one-line awareness note only because the two constants are coupled — if the timeout is ever lowered toward 2500ms, that 500ms gap becomes the flake.

3. (Pre-existing, not introduced by this PR) sf_dir temp directories leak. sfDirOpt() (CloseDrainTest.java:379) mints a fresh qdb-close-drain-<nanotime> dir under java.io.tmpdir each of the 20 iterations, and nothing deletes them. This diff neither adds nor changes that code, so it's out of scope for #54 — flagging only so it isn't mistaken for clean. The sibling test does the same.

Summary

Verdict: approve. The fix is correct and the diff is minimal. The 2500ms budget always targeted drain latency; moving t0 inside the try so it brackets only sender.close() matches that intent and matches the sibling test. The idempotency claim that makes the redundant auto-close safe is verified against QwpWebSocketSender.close(). The PR description is level-headed and honestly states the coverage tradeoff (construction time is no longer bounded — acceptable, as it never was the assertion's purpose). Commit message is well-formed (45-char title, full body, active voice).

mtopolnik added the bug Something isn't working label Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(qwp): fix flaky cold-start timing in async close-drain test#54

test(qwp): fix flaky cold-start timing in async close-drain test#54
mtopolnik wants to merge 1 commit into
mainfrom
mt_fix-closedrain-coldstart-flake

mtopolnik commented Jun 19, 2026

Uh oh!

mtopolnik commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mtopolnik commented Jun 19, 2026

Problem

Fix

Tradeoffs

Test plan

Uh oh!

mtopolnik commented Jun 19, 2026

Critical

Moderate

Minor

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant