This repository was archived by the owner on Aug 18, 2023. It is now read-only.
Manually initialize GcBox contents post-allocation to reduce memory copying #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ideally, when calling
You'd assume
datato be constructed in-place or moved into newly allocated memory. (the first one isn't really common as stable Rust lacks placement-new-like features). And with the struct being relatively big, you'd expect the compiler to generate amemcpycall to simply move the structure's bytes into place.The issue currently is that due to either rustc not being smart enough or the gc-arena code not being optimizer friendly (or both), the compiler can
memcpyyour Data object several times before actually moving it into its final place.For example here:
The generated code will firstly do
memcpyto movetinto thegc_boxobject on stack, then allocate memory, and then do the secondmemcpyto move thegc_boxobject onto heap memory. For some reason, on wasm target the compiler is even worse at optimizing this; at the worst case, I've seen fourmemcpycalls for a single GC allocation. This can obviously cause unnecessary overhead.My patch helps the compiler by simplifying the initialization - first we allocate the uninitialized memory, then we manually build the
GcBoxby moving its fields into place. This way the objecttis moved straight into its final place without being moved into intermediate stack variablegc_box.I was trying to show a comparison on godbolt, but as soon as I drop some layers of abstractions, rustc catches on and generates better code. This is my best attempt: https://godbolt.org/z/aaK75W . You can see that in
old()there is onememcpybefore allocation and one after, but innew()there is only onememcpy.Here's a comparison on "production" code, with a decompiled wasm build of https://github.com/ruffle-rs/ruffle/ . In practice, I've seen this cause up to 15-20% speedups in some edge cases.
Before, 4x
memcpy:After, just two:
And when rust-lang/rust#82806 gets merged into Rustc , with my patch it'll become just one, how it's supposed to work :)
I made sure the patch passes tests with miri.