stream: reduce allocations on WHATWG streams hot paths#63876
Conversation
MattiasBuelens
left a comment
There was a problem hiding this comment.
Looks pretty sensible to me! 👍
| // No read is in flight. Mirror the buffered fast path of | ||
| // ReadableStreamDefaultReader.read(): when data is already queued | ||
| // in a default controller, resolve immediately without allocating | ||
| // a read request. The result settles synchronously, so leaving | ||
| // state.current undefined matches the state the slow path reaches | ||
| // once its read request callbacks have settled. |
There was a problem hiding this comment.
I wonder if we always have to inline this... 🤔
I ported some of your previous optimizations to web-streams-polyfill, but instead of copying the code around I made defaultReader.read() create a different kind of ReadRequest if it knows that it will be resolved synchronously.
Perhaps we can do the same here, and have nextSteps create a different kind of AsyncIteratorReadRequest if it can be resolved synchronously? (I forgot to do that in my polyfill, it seems. 😅 ) Or would that risk turning readableStreamDefaultReaderRead megamorphic? (Or maybe it already is?)
| queueMicrotask(() => { | ||
| controller[kState].started = true; | ||
| assert(!controller[kState].pulling); | ||
| assert(!controller[kState].pullAgain); | ||
| readableStreamDefaultControllerCallPullIfNeeded(controller); | ||
| }); |
There was a problem hiding this comment.
Nit-pick: maybe pull this callback into a const, so we can reuse it for the PromisePrototypeThen below?
Pure-JavaScript optimizations to lib/internal/webstreams/* that remove per-chunk and per-construction allocations without any observable behavior change. Per-chunk: - Reuse the pull and write promise-reaction closures per controller, created lazily on the first pull/write, instead of allocating two fresh closures per chunk. - Add the buffered fast path to the ReadableStream async iterator, mirroring ReadableStreamDefaultReader.read(): when data is queued in a default controller, dequeue directly and skip the read request, the deferred promise, and the promise chaining on the next call. - Specialize the user-algorithm promise-callback wrappers by arity (0/1/2), replacing the rest-parameter + ReflectApply form that allocated an arguments array per invocation. - Share immutable nil records for the writable closeRequest, inFlightWriteRequest, inFlightCloseRequest and pendingAbortRequest resets; push the PromiseWithResolvers() record directly as the write request instead of rebuilding it. Per-construction: - Run the post-start step through queueMicrotask() when the start algorithm result is not an object, instead of wrapping it in new Promise((r) => r(result)) plus a two-closure reaction. Object and thenable results keep the promise path (adoption timing and .then lookups are observable). - Materialize the reader/writer .closed and writer .ready records, and the stream-level closed-promise interop record, lazily on first observation. Their settlement is fully derivable from the stream/writer state, so construction allocates no promise for them and the settle sites only touch a record once it exists. Identity follows the spec: in-place rejections keep the cached record, the replace cases drop the cache so the next observation derives the new promise. - Remove dead allocations: the never-read close record in the writable stream state, the placeholder reader/writer records and duplicate read-request arrays that setup replaced, the per-stream () => 1 size algorithms, and the kControllerErrorFunction placeholder plus bound function (now a prototype method; byte streams keep their historical no-op there). The benchmark/webstreams/creation.js benchmark is updated to collect garbage before each timed window: at the default n=50000 the window is only ~20-40ms, short enough that garbage from the setup phase is collected inside it and dominates the result (the ReadableStreamBYOBReader row swung from -52% to +13% on the same change purely by varying n). Benchmark results vs main, same-day builds, benchmark/compare.js --runs 10: creation ReadableStreamDefaultReader (n=500k) +164% *** creation ReadableStreamBYOBReader (n=500k) +152% *** creation WritableStream (n=500k) +97% *** creation ReadableStream.tee (n=500k) +61% *** creation TransformStream (n=500k) +59% *** creation ReadableStream (n=500k) +50% *** readable-async-iterator +35% *** pipe-to (16 hwm configs) +4.8..+9.3% *** js_transfer WS / RS / TS +8.0% / +7.9% / +6.5% *** readable-read-buffered bs=1 +8.7% *** readable-read normal / byob +8.0% / parity WPT streams/compression/encoding results are identical to main (1403/338/3822 subtests passed, same 8 expected failures by name) and all webstreams-related parallel tests pass. Assisted-by: Claude Fable 5 <noreply@anthropic.com>
|
Updated benchmarks after the regression fix: |
| // each known call-site arity gets its own wrapper. The exact number of | ||
| // arguments passed through to the user callback is observable and must be | ||
| // preserved. | ||
| function createPromiseCallback0(name, fn, thisArg) { |
There was a problem hiding this comment.
Can we give them more readable names? e.g. createPromiseCallbackNoParams, createPromiseCallback1Param, createPromiseCallback2Params
It would probably makes sense to split this into its own PR
There was a problem hiding this comment.
I would actually prefer not to. This is relatively small, that fighting CI only once would save a significant amount of time.
6d140aa to
92194f3
Compare
Pure-JavaScript allocation reductions on the WHATWG streams hot paths partially based on the findings of #63872 : reused promise-reaction closures per controller (pull/write), a buffered fast path in the async iterator,
queueMicrotask()for non-thenable start results, arity-specialized algorithm wrappers, shared nil state records, and removal of several dead per-instance allocations. No observable behavior change: WPT streams/compression/encoding results are identical tomain(same subtests passing, same 8 expected failures by name).Benchmark results (
benchmark/compare.js --runs 10, both binaries built the same day from the same toolchain):The
creation.jsrows at the stockn=50000measure a 20-40ms window and are unreliable; re-run at--set n=500000: