buffer: more copy optimizations#63892
Open
ronag wants to merge 2 commits into
Open
Conversation
Add v8::ArrayBufferView::CopyArrayBufferViewBytes, which copies a byte range from one ArrayBufferView to another. Unlike the existing v8::ArrayBuffer::CopyArrayBufferBytes, it operates on the *views*: it resolves each view's data pointer directly (JSTypedArray::DataPtr / DataView::data_pointer) and reads the backing buffer's shared/immutable/ detached flags as plain field loads, without ever materializing or fetching the views' ArrayBuffers. This exists because materializing the ArrayBuffer is expensive. Profiling Buffer.prototype.copy (perf, AMD EPYC 9135, x86-64) showed that for small copies the dominant native cost is ArrayBufferView::Buffer() -> JSTypedArray::GetBuffer() -- ~25% of total runtime, paid on every call (not just first materialization) and incurred twice (source and target). Add ByteOffset() (~7%) and IsSharedArrayBuffer() (~6%) and roughly 38% of a small copy is spent turning two views into ArrayBuffers and querying them piecemeal, while the actual memmove is ~4%. Routing the node binding through CopyArrayBufferBytes forces all of that onto the embedder side; a view-level entry point folds it into a single call of cheap field reads. Byte ranges are clamped to both views' byte lengths. Nothing is copied when the source is detached/out-of-bounds or the target is detached/ out-of-bounds or backed by an immutable ArrayBuffer; the number of bytes actually copied is returned. When both views are backed by a SharedArrayBuffer a relaxed-atomic memmove is used, honoring the SharedArrayBuffer memory model; any other combination performs a plain memmove on the backing store (matching CopyArrayBufferBytes). Carried as a floating patch; v8_embedder_string is bumped to -node.22 accordingly. It is the natural sibling of the CopyArrayBufferBytes API added in the preceding floating patch and touches nothing but the public ArrayBuffer/ArrayBufferView API. Refs: nodejs#55422 Signed-off-by: Robert Nagy <ronagy@icloud.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
|
Review requested:
|
Member
Author
|
@targos @addaleax @joyeecheung This is a relatively trivial v8 patch. Would you be comfortable with floating this? I've already spent money on funding work to upstream |
Member
Author
|
I've tried the |
Route the native backing of Buffer.prototype.copy (CopyImpl, the `_copy` binding) through the new v8::ArrayBufferView::CopyArrayBufferViewBytes API instead of v8::ArrayBuffer::CopyArrayBufferBytes. The previous binding had to convert both views to ArrayBuffers (ArrayBufferView::Buffer()), read their byte offsets (ByteOffset()) and test shared-ness (IsSharedArrayBuffer()) before the copy -- around half a dozen separate V8 API calls per copy. The view-level API does all of that internally from the views' own fields in a single call, so the binding now just forwards the two views, the view-relative offsets and the length. Profiling on AMD EPYC 9135 (x86-64) attributed the small-copy cost almost entirely to that view->buffer conversion: ArrayBufferView::Buffer() / JSTypedArray::GetBuffer() alone was ~25% of runtime, paid every call and twice per copy. Resolving the buffer in JS instead (passing source.buffer/target.buffer to the binding) was measured and is worse: the typed-array `.buffer` getter is not JIT-inlined and dispatches through the CEntry trampoline to a C++ builtin, costing ~36%. The view-level copy keeps all existing semantics: byte-range clamping, no-op (0 bytes) on a detached or immutable target, relaxed-atomic memmove when both sides are SharedArrayBuffer-backed, plain memmove otherwise. The JS-side view clamping in copyImpl is retained: V8 clamps to the underlying backing store, which for pooled Buffers is the whole shared pool rather than the individual view. buffer-copy.js, median of 30 interleaved runs, AMD EPYC 9135 x86-64 (all changes p < 0.001, Welch t = 19-50): partial=false bytes=8: 42.2 -> 62.4 Mops/s (+48%) partial=true bytes=8: 42.0 -> 62.8 Mops/s (+49%) partial=false bytes=128: 42.2 -> 61.7 Mops/s (+46%) partial=true bytes=128: 42.0 -> 63.1 Mops/s (+50%) partial=false bytes=1024: 35.1 -> 47.3 Mops/s (+35%) partial=true bytes=1024: 37.8 -> 55.1 Mops/s (+46%) The gain is largest for small/medium copies, where per-call overhead dominates, and tapers for 1024-byte copies as the memmove itself grows. Also inlines the former _copyActual helper (only caller was copyImpl) into copyImpl, folding a redundant target.byteLength read. Refs: nodejs#55422 Signed-off-by: Robert Nagy <ronagy@icloud.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f511275 to
afe87da
Compare
Member
Author
|
So this is basically what we tried to upstream to V8 but was rejected and turned into the slower CopyArrayBufferBytes. |
Member
|
Can you explain why the upstream rejected this patch and why the reason wouldn't apply here? |
Member
Author
|
https://chromium-review.googlesource.com/c/v8/v8/+/7735151 They preferred an implementation that matched the spec. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add v8::ArrayBufferView::CopyArrayBufferViewBytes, which copies a byte
range from one ArrayBufferView to another. Unlike the existing
v8::ArrayBuffer::CopyArrayBufferBytes, it operates on the views: it
resolves each view's data pointer directly (JSTypedArray::DataPtr /
DataView::data_pointer) and reads the backing buffer's shared/immutable/
detached flags as plain field loads, without ever materializing or
fetching the views' ArrayBuffers.
This exists because materializing the ArrayBuffer is expensive. Profiling
Buffer.prototype.copy (perf, AMD EPYC 9135, x86-64) showed that for small
copies the dominant native cost is ArrayBufferView::Buffer() ->
JSTypedArray::GetBuffer() -- ~25% of total runtime, paid on every call
(not just first materialization) and incurred twice (source and target).
Add ByteOffset() (~7%) and IsSharedArrayBuffer() (~6%) and roughly 38% of
a small copy is spent turning two views into ArrayBuffers and querying
them piecemeal, while the actual memmove is ~4%. Routing the node binding
through CopyArrayBufferBytes forces all of that onto the embedder side; a
view-level entry point folds it into a single call of cheap field reads.
Byte ranges are clamped to both views' byte lengths. Nothing is copied
when the source is detached/out-of-bounds or the target is detached/
out-of-bounds or backed by an immutable ArrayBuffer; the number of bytes
actually copied is returned. When both views are backed by a
SharedArrayBuffer a relaxed-atomic memmove is used, honoring the
SharedArrayBuffer memory model; any other combination performs a plain
memmove on the backing store (matching CopyArrayBufferBytes).
Carried as a floating patch; v8_embedder_string is bumped to -node.22
accordingly. It is the natural sibling of the CopyArrayBufferBytes API
added in the preceding floating patch and touches nothing but the public
ArrayBuffer/ArrayBufferView API.