This adds support for a newly defined tail-call ABI for s390x
as well as supporting tail calls themselves.
We now use the tail-call ABI by default for Wasmtime, and
enable tail calls by default.
This also allows getting rid of a number of special case
and test exclusions for s390x.
Fixes: https://github.com/bytecodealliance/wasmtime/issues/6530
* Cranelift: Stop sign-extending `Imm64` immediates in builder methods
Originally, we sign-extended `Imm64` immediates from their controlling type
var's width to `i64` in the builder methods for `BinaryImm64` and a couple other
instruction formats: https://github.com/bytecodealliance/wasmtime/pull/1687
At some point, we decided that the unused bits in the `Imm64` when the
controlling type var's width is less than 64 should be zero. And we started
asserting that in various places, for example:
87817f38a1/cranelift/codegen/src/machinst/lower.rs (L1237-L1243)
However, we never removed or updated the sign-extension code in the builder
methods. This commit switches the builder methods over to masking the `Imm64` to
just the valid bits, effectively doing a zero extend from the controlling type
var's width instead of a sign extend.
To avoid too much churn in tests, I made `display_inst` print the
sign-extended-from-the-controlling-type-variable's-width value of `Imm64`
immediates. I also made the `Display` implementation for `Imm64` always print in
decimal form with a `-` for negative numbers, rather than the hex it would
previously print for large negative numbers, so that `iconst.i32 0x8000_0000`
doesn't display as `iconst.i32 0xffff_ffff_8000_0000`, which is otherwise pretty
confusing.
* Rename condition
* return new immediates instead of mutating
* Update some test expectations
* Add v128.const support to Winch
* Remove next_vr and vector_reg_for methods
* Adjust alignment and slot size for v128
* Forgot to update disas tests
* Update unit tests
* Use 8 byte stack slot sizes
* Fix broken unit tests, add tests for vecs, and use ty size for vecs
Fixes: https://github.com/bytecodealliance/wasmtime/issues/8848
Similar to all the control instructions, any state must be explicitly
saved before emitting the code for `br_if`.
This commit ensures that live locals and registers are explicilty saved
before emitting the code for `br_if`. Prior to this commit, live
locals and registers were not saved every time causing incorrect
behavior in cases where the calculation of the conditional argument
didn't trigger a spill.
This change introduces the explicit spill after calculating the branch
condition argument to minimize memory traffic in case the conditional is
already in a register.
In the original development of this feature, guided by JS AOT
compilation to Wasm of a microbenchmark heavily focused on IC sites, I
was seeing a ~20% speedup. However, in more recent measurements, on full
programs (e.g., the Octane benchmark suite), the benefit is more like
5%.
Moreover, in #8870, I attempted to switch over to a direct-mapped cache,
to address a current shortcoming of the design, namely that it has a
hard-capped number of callsites it can apply to (50k) to limit impact on
VMContext struct size. With all of the needed checks for correctness,
though, that change results in a 2.5% slowdown relative to no caching at
all, so it was dropped.
In the process of thinking through that, I discovered the current design
on `main` incorrectly handles null funcrefs: it invokes a null code pointer,
rather than loading a field from a null struct pointer. The latter was
specifically designed to cause the necessary Wasm trap in #8159, but I
had missed that the call to a null code pointer would not have the same
effect. As a result, we actually can crash the VM (safely at least, but
still no good vs. a proper Wasm trap!) with the feature enabled. (It's
off by default still.) That could be fixed too, but at this point with
the small benefit on real programs, together with the limitation on
module size for full benefit, I think I'd rather opt for simplicity and
remove the cache entirely.
Thus, this PR removes call-indirect caching. It's not a direct revert
because the original PR refactored the call-indirect generation into
smaller helpers and IMHO it's a bit nicer to keep that. But otherwise
all traces of the setting, code pre-scan during compilation and special
conditions tracked on tables, and codegen changes are gone.
The epoch interruption implementation caches the current deadline in a
register, and avoids reloading that cache until the cached deadline has
passed.
However, the first epoch check happens immediately after the cache has
been populated on function entry, so there's never a reason to reload
the cache at that point. It only needs to be reloaded in loops. So this
commit eliminates the double-check on function entry.
When Cranelift optimizations are enabled, the alias analysis pass
correctly detected that this load was redundant, and the egraph pass
optimized away the `icmp` as well. However, since we don't do
conditional constant propagation, the branch couldn't be optimized away.
On x86 this lowered to a redundant `cmp`/`jae` pair of instructions in
every function entry, which this commit eliminates.
To keep track of what code we're generating for epoch interruptions,
I've also added disas tests with a trivial infinite loop.
This commit removes the `simm32` extractor from lowerings as it's not as
useful as it was when it was first introduced. Nowadays an `Imm64` needs
to be interpreted with the type known as well to understand whether bits
being masked off is significant or not. The old `simm32` extractor only
took `Imm64` meaning that it was unable to do this and wouldn't match
negative numbers. This is because the high 32 bits of `Imm64` were
always zero and `simm64` would take the `i64` value from `Imm64` and try
to convert it to an `i32`.
This commit replaces `simm32`, and uses of it, with a new extractor
`i32_from_iconst`. This matches the preexisting `i64_from_iconst` and is
able to take the type of the value into account and produce a correctly
sign-extended value.
cc #8706
* Add tests for patterns I'm about to optimize
* x64: Optimize vector compare-and-branch
This commit implements lowering optimizations for the `vall_true` and
`vany_true` CLIF instructions when combined with `brif`. This is in the
same manner as `icmp` and `fcmp` combined with `brif` where the result
of the comparison is never materialized into a general purpose register
which helps lower register pressure and remove some instructions.
* x64: Optimize `vconst` with an all-ones pattern
This has a single-instruction lowering which doesn't load from memory so
it's probably cheaper than loading all-ones from memory.
* Wasmtime: Implement the custom-page-sizes proposal
This commit adds support for the custom-page-sizes proposal to Wasmtime:
https://github.com/WebAssembly/custom-page-sizes
I've migrated, fixed some bugs within, and extended the `*.wast` tests for this
proposal from the `wasm-tools` repository. I intend to upstream them into the
proposal shortly.
There is a new `wasmtime::Config::wasm_custom_page_sizes_proposal` method to
enable or disable the proposal. It is disabled by default.
Our fuzzing config has been updated to turn this feature on/off as dictated by
the arbitrary input given to us from the fuzzer.
Additionally, there were getting to be so many constructors for
`wasmtime::MemoryType` that I added a builder rather than add yet another
constructor.
In general, we store the `log2(page_size)` rather than the page size
directly. This helps cut down on invalid states and properties we need to
assert.
I've also intentionally written this code such that supporting any power of two
page size (rather than just the exact values `1` and `65536` that are currently
valid) will essentially just involve updating `wasmparser`'s validation and
removing some debug asserts in Wasmtime.
* Update error string expectation
* Remove debug logging
* Use a right shift instead of a division
* fix error message expectation again
* remove page size from VMMemoryDefinition
* fix size of VMMemoryDefinition again
* Only dynamically check for `-1` sentinel for 1-byte page sizes
* Import functions that are used a few times
* Better handle overflows when rounding up to the host page size
Propagate errors instead of returning a value that is not actually a rounded up
version of the input.
Delay rounding up various config sizes until runtime instead of eagerly doing it
at config time (which isn't even guaranteed to work, so we already had to have a
backup plan to round up at runtime, since we might be cross-compiling wasm or
not have the runtime feature enabled).
* Fix some anyhow and nostd errors
* Add missing rounding up to host page size at runtime
* Add validate feature to wasmparser dep
* Add some new rounding in a few places, due to no longer rounding in config methods
* Avoid actually trying to allocate the whole address space in the `massive_64_bit_still_limited` test
The point of the test is to ensure that we hit the limiter, so just cancel the
allocation from the limiter, and otherwise avoid MIRI attempting to allocate a
bunch of memory after we hit the limiter.
* prtest:full
* Revert "Avoid actually trying to allocate the whole address space in the `massive_64_bit_still_limited` test"
This reverts commit ccfa34a78dd3d53e49a6158ca03077d42ce8bcd7.
* miri: don't attempt to allocate more than 4GiB of memory
It seems that rather than returning a null pointer from `std::alloc::alloc`,
miri will sometimes choose to simply crash the whole program.
* remove duplicate prelude import after rebasing
* Fix tail calls being turned on by default
Logic in `Config` to conditionally enable tail calls wasn't handling the
case where the configured compiler strategy was `Strategy::Auto` meaning
that by default tail calls weren't actually enabled. This commit
refactors handling of `Strategy` to avoid storing `Strategy::Auto` in
`CompilerConfig` so tests against it can use either cranelift or winch.
* Update disas tests
* Update Winch tests to using winch
* Remove the native ABI calling convention from Wasmtime
This commit proposes removing the "native abi" calling convention used
in Wasmtime. For background this ABI dates back to the origins of
Wasmtime. Originally Wasmtime only had `Func::call` and eventually I
added `TypedFunc` with `TypedFunc::call` and `Func::wrap` for a faster
path. At the time given the state of trampolines it was easiest to call
WebAssembly code directly without any trampolines using the native ABI
that wasm used at the time. This is the original source of the native
ABI and it's persisted over time under the assumption that it's faster
than the array ABI due to keeping arguments in registers rather than
spilling them to the stack.
Over time, however, this design decision of using the native ABI has not
aged well. Trampolines have changed quite a lot in the meantime and it's
no longer possible for the host to call wasm without a trampoline, for
example. Compilations nowadays maintain both native and array
trampolines for wasm functions in addition to host functions. There's a
large split between `Func::new` and `Func::wrap`. Overall, there's quite
a lot of weight that we're pulling for the design decision of using the
native ABI.
Functionally this hasn't ever really been the end of the world.
Trampolines aren't a known issue in terms of performance or code size.
There's no known faster way to invoke WebAssembly from the host (or
vice-versa). One major downside of this design, however, is that
`Func::new` requires Cranelift as a backend to exist. This is due to the
fact that it needs to synthesize various entries in the matrix of ABIs
we have that aren't available at any other time. While this is itself
not the worst of issues it means that the C API cannot be built without
a compiler because the C API does not have access to `Func::wrap`.
Overall I'd like to reevaluate given where Wasmtime is today whether it
makes sense to keep the native ABI trampolines. Sure they're supposed to
be fast, but are they really that much faster than the array-call ABI as
an alternative? This commit is intended to measure this.
This commit removes the native ABI calling convention entirely. For
example `VMFuncRef` is now one pointer smaller. All of `TypedFunc` now
uses `*mut ValRaw` for loads/stores rather than dealing with ABI
business. The benchmarks with this PR are:
* `sync/no-hook/core - host-to-wasm - typed - nop` - 5% faster
* `sync/no-hook/core - host-to-wasm - typed - nop-params-and-results` - 10% slower
* `sync/no-hook/core - wasm-to-host - typed - nop` - no change
* `sync/no-hook/core - wasm-to-host - typed - nop-params-and-results` - 7% faster
These numbers are a bit surprising as I would have suspected no change
in both "nop" benchmarks as well as both being slower in the
params-and-results benchmarks. Regardless it is apparent that this is
not a major change in terms of performance given Wasmtime's current
state. In general my hunch is that there are more expensive sources of
overhead than reads/writes from the stack when dealing with wasm values
(e.g. trap handling, store management, etc).
Overall this commit feels like a large simplification of what we
currently do in `TypedFunc`:
* The number of ABIs that Wasmtime deals with is reduced by one. ABIs
are pretty much always tricky and having fewer moving parts should
help improve the understandability of the system.
* All of the `WasmTy` trait methods and `TypedFunc` infrastructure is
simplified. Traits now work with simple `load`/`store` methods rather
than various other flavors of conversion.
* The multi-return-value handling of the native ABI is all gone now
which gave rise to significant complexity within Wasmtime's Cranelift
translation layer in addition to the `TypedFunc` backing traits.
* This aligns components and core wasm where components always use the
array ABI and now core wasm additionally will always use the array ABI
when communicating with the host.
I'll note that this still leaves a major ABI "complexity" with respect
to native functions do not have a wasm ABI function pointer until
they're "attached" to a `Store` with a `Module`. That's required to
avoid needing Cranelift for creating host functions and that property is
still true today. This is a bit simpler to understand though now that
`Func::new` and `Func::wrap` are treated uniformly rather than one being
special-cased.
* Fix miri unsafety
prtest:full
* Use WASM function names in compiled objects
Instead of generating symbol names in the format
"wasm[$MODULE_ID]::function[$FUNCTION_INDEX]", generate (if possible)
something more readable, such as "wasm[$MODULE_ID]::$FUNCTION_NAME".
This helps when debugging or profiling the generated code.
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
* Ensure symbol names are cleaned up and have function indexes
Filter symbol names to include only characters that are usually used
for function names, and that might be produced by name mangling.
Replace everything else with a question mark (and all repeated question
marks by a single one), and then truncate to a length of 96 characters.
This should be enough to not only avoid passing user-controlled strings
to tools such as "perf" and "objdump", and make it easier to
disambiguate symbols that might have the same name but different
indices.
* Make symbol cleaning slightly more efficient
* Update symbol names to be closer to what tests expect
* Ensure only alphanumeric ASCII characters are allowed in a symbol name
* Ensure sliced symbol name is within its bounds
* Update test expectations after adding function name to symbol name
---------
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
* wasmtime: Make table lazy-init configurable
Lazy initialization of tables has trade-offs that we haven't explored in
a while. Making it configurable makes it easier to test the effects of
these trade-offs on a variety of WebAssembly programs, and allows
embedders to decide whether the trade-offs are worth-while for their use
cases.
* Review comments
As per [this comment], the call-indirect caching options should really
be in the `-O` option namespace (for optimizations), not `-W` (for
semantically-visible Wasm standards options); the original PR simply put
them in the wrong place, and this PR moves them.
[this comment]: https://github.com/bytecodealliance/wasmtime/pull/8509#discussion_r1589594152
* Wasmtime: add one-entry call-indirect caching.
In WebAssembly, an indirect call is somewhat slow, because of the
indirection required by CFI (control-flow integrity) sandboxing. In
particular, a "function pointer" in most source languages compiled to
Wasm is represented by an index into a table of funcrefs. The
`call_indirect` instruction then has to do the following steps to invoke
a function pointer:
- Load the funcref table's base and length values from the vmctx.
- Bounds-check the invoked index against the actual table size; trap if
out-of-bounds.
- Spectre mitigation (cmove) on that bounds-check.
- Load the `vmfuncref` from the table given base and index.
- For lazy table init, check if this is a non-initialized funcref
pointer, and initialize the entry.
- Load the signature from the funcref struct and compare it against the
`call_indirect`'s expected signature; trap if wrong.
- Load the actual code pointer for the callee's Wasm-ABI entry point.
- Load the callee vmctx (which may be different for a cross-module
call).
- Put that vmctx in arg 0, our vmctx in arg 1, and invoke the loaded
code pointer with an indirect call instruction.
Compare and contrast to the process involved in invoking a native
function pointer:
- Invoke the code pointer with an indirect call instruction.
This overhead buys us something -- it is part of the SFI sandbox
boundary -- but it is very repetitive and unnecessary work in *most*
cases when indirect function calls are performed repeatedly (such as
within an inner loop).
This PR introduces the idea of *caching*: if we know that the result of
all the above checks won't change, then if we use the same index as "the
last time" (for some definition), we can skip straight to the "invoke
the code pointer" step, with a cached code pointer from that last time.
Concretely, it introduces a two-word struct inlined into the vmctx for
each `call_indirect` instruction in the module (up to a limit):
- The last invoked index;
- The code pointer that index corresponded to.
When compiling the module, we check whether the table could possibly be
mutable at a given index once read: any instructions like `table.set`,
or the whole table exported thus writable from the outside. We also
check whether index 0 is a non-null funcref. If neither of these things
are true, then we know we can cache an index-to-code-pointer mapping,
and we know we can use index 0 as a sentinel for "no cached value".
We then make use of the struct for each indirect call site and generate
code to check if the index matches; if so, call cached pointer; if not,
load the vmfuncref, check the signature, check that the callee vmctx is
the same as caller (intra-module case), and stash the code pointer and
index away (fill the cache), then make the call.
On an in-development branch of SpiderMonkey-in-Wasm with ICs (using
indirect calls), this is about a 20% speedup; I haven't yet measured on
other benchmarks. It is expected that this might be an
instantiation-time slowdown due to a larger vmctx (but we could use
madvise to zero if needed).
This feature is off by default right now.
* Addressed review feedback.
* Added some more comments.
* Allow unused VMCallIndirectCache struct (defined for parity with other bits but not needed in actual runtime).
* Add a limit to the number of call-indirect cache slots.
* Fix merge conflict: handle ConstOp element offset.
* Review feedback.
Before this, Cranelift ABI code would emit a stack-load instruction for
every stack argument and add all register arguments to the `args`
pseudo-instruction, whether those arguments were used or not.
However, we already know which arguments are used at that point because
we need the analysis for load-sinking, so it's easy to filter the unused
arguments out.
This avoids generating loads that are immediately dead, which is good
for the generated code. It also slightly reduces the size of the
register allocation problem, which is a small win in compile time.
This also changes which registers RA2 chooses in some cases because it
no longer considers unused defs from the `args` pseudo-instruction.
There was an existing method named `arg_is_needed_in_body` which sounded
like it should be the right place to implement this. However, that
method was only used for Baldrdash integration and has been a stub since
that integration was removed in #4571. Also it didn't have access to the
`value_ir_uses` map needed here. But the place where that method was
called does have access to that map and was perfect for this.
Also, don't emit a `dummy_use` pseudo-instruction for the vmctx if it's
otherwise unused everywhere, as we want to drop it from the `args`
instruction in that case and then RA2 complains that it's used without
being defined.
Furthermore, don't emit debug info specially for the vmctx parameter,
because it's already emitted for all block parameters including vmctx.
Thanks to @elliottt for doing the initial investigation of this change
with me, and to @cfallin for helping me track down the `dummy_use` false
dependency.
When we compute the amount of space that we need in a stack frame for
the stack limit check, we were only counting spill-slots and explicit
stack-slots. However, we need to account for all uses of the stack which
occur before the next stack limit check. That includes clobbers and any
stack arguments we want to pass to callees.
The maximum amount that we could have missed by is essentially bounded
by the number of arguments which could be passed to a function. In
Wasmtime, that is limited by `MAX_WASM_FUNCTION_PARAMS` in
`wasmparser::limits`, which is set to 1,000, and the largest arguments
are 16-byte vectors, so this could undercount by about 16kB.
This is not a security issue according to Wasmtime's security policy
(https://docs.wasmtime.dev/security-what-is-considered-a-security-vulnerability.html)
because it's the embedder's responsibility to ensure that the stack
where Wasmtime is running has enough extra space on top of the
configured `max_wasm_stack` size, and getting within 16kB of the host
stack size is too small to be safe even with this fixed.
However, this was definitely not the intended behavior when stack limit
checks or stack probes are enabled, and anyone with non-default
configurations or non-Wasmtime uses of Cranelift should evaluate whether
this bug impacts your use case.
(For reference: When Wasmtime is used in async mode or on Linux, the
default stack size is 1.5MB larger than the default WebAssembly stack
limit, so such configurations are typically safe regardless. On the
other hand, on macOS the default non-async stack size for threads other
than the main thread is the same size as the default for
`max_wasm_stack`, so that is too small with or without this bug fix.)
\### The `GcRuntime` and `GcCompiler` Traits
This commit factors out the details of the garbage collector away from the rest
of the runtime and the compiler. It does this by introducing two new traits,
very similar to a subset of [those proposed in the Wasm GC RFC], although not
all equivalent functionality has been added yet because Wasmtime doesn't
support, for example, GC structs yet:
[those proposed in the Wasm GC RFC]: https://github.com/bytecodealliance/rfcs/blob/main/accepted/wasm-gc.md#defining-the-pluggable-gc-interface
1. The `GcRuntime` trait: This trait defines how to create new GC heaps, run
collections within them, and execute the various GC barriers the collector
requires.
Rather than monomorphize all of Wasmtime on this trait, we use it
as a dynamic trait object. This does imply some virtual call overhead and
missing some inlining (and resulting post-inlining) optimization
opportunities. However, it is *much* less disruptive to the existing embedder
API, results in a cleaner embedder API anyways, and we don't believe that VM
runtime/embedder code is on the hot path for working with the GC at this time
anyways (that would be the actual Wasm code, which has inlined GC barriers
and direct calls and all of that). In the future, once we have optimized
enough of the GC that such code is ever hot, we have options we can
investigate at that time to avoid these dynamic virtual calls, like only
enabling one single collector at build time and then creating a static type
alias like `type TheOneGcImpl = ...;` based on the compile time
configuration, and using this type alias in the runtime rather than a dynamic
trait object.
The `GcRuntime` trait additionally defines a method to reset a GC heap, for
use by the pooling allocator. This allows reuse of GC heaps across different
stores. This integration is very rudimentary at the moment, and is missing
all kinds of configuration knobs that we should have before deploying Wasm GC
in production. This commit is large enough as it is already! Ideally, in the
future, I'd like to make it so that GC heaps receive their memory region,
rather than allocate/reserve it themselves, and let each slot in the pooling
allocator's memory pool be *either* a linear memory or a GC heap. This would
unask various capacity planning questions such as "what percent of memory
capacity should we dedicate to linear memories vs GC heaps?". It also seems
like basically all the same configuration knobs we have for linear memories
apply equally to GC heaps (see also the "Indexed Heaps" section below).
2. The `GcCompiler` trait: This trait defines how to emit CLIF that implements
GC barriers for various operations on GC-managed references. The Rust code
calls into this trait dynamically via a trait object, but since it is
customizing the CLIF that is generated for Wasm code, the Wasm code itself is
not making dynamic, indirect calls for GC barriers. The `GcCompiler`
implementation can inline the parts of GC barrier that it believes should be
inline, and leave out-of-line calls to rare slow paths.
All that said, there is still only a single implementation of each of these
traits: the existing deferred reference-counting (DRC) collector. So there is a
bunch of code motion in this commit as the DRC collector was further isolated
from the rest of the runtime and moved to its own submodule. That said, this was
not *purely* code motion (see "Indexed Heaps" below) so it is worth not simply
skipping over the DRC collector's code in review.
\### Indexed Heaps
This commit does bake in a couple assumptions that must be shared across all
collector implementations, such as a shared `VMGcHeader` that all objects
allocated within a GC heap must begin with, but the most notable and
far-reaching of these assumptions is that all collectors will use "indexed
heaps".
What we are calling indexed heaps are basically the three following invariants:
1. All GC heaps will be a single contiguous region of memory, and all GC objects
will be allocated within this region of memory. The collector may ask the
system allocator for additional memory, e.g. to maintain its free lists, but
GC objects themselves will never be allocated via `malloc`.
2. A pointer to a GC-managed object (i.e. a `VMGcRef`) is a 32-bit offset into
the GC heap's contiguous region of memory. We never hold raw pointers to GC
objects (although, of course, we have to compute them and use them
temporarily when actually accessing objects). This means that deref'ing GC
pointers is equivalent to deref'ing linear memory pointers: we need to add a
base and we also check that the GC pointer/index is within the bounds of the
GC heap. Furthermore, compressing 64-bit pointers into 32 bits is a fairly
common technique among high-performance GC
implementations[^compressed-oops][^v8-ptr-compression] so we are in good
company.
3. Anything stored inside the GC heap is untrusted. Even each GC reference that
is an element of an `(array (ref any))` is untrusted, and bounds checked on
access. This means that, for example, we do not store the raw pointer to an
`externref`'s host object inside the GC heap. Instead an `externref` now
stores an ID that can be used to index into a side table in the store that
holds the actual `Box<dyn Any>` host object, and accessing that side table is
always checked.
[^compressed-oops]: See ["Compressed OOPs" in
OpenJDK.](https://wiki.openjdk.org/display/HotSpot/CompressedOops)
[^v8-ptr-compression]: See [V8's pointer
compression](https://v8.dev/blog/pointer-compression).
The good news with regards to all the bounds checking that this scheme implies
is that we can use all the same virtual memory tricks that linear memories use
to omit explicit bounds checks. Additionally, (2) means that the sizes of GC
objects is that much smaller (and therefore that much more cache friendly)
because they are only holding onto 32-bit, rather than 64-bit, references to
other GC objects. (We can, in the future, support GC heaps up to 16GiB in size
without losing 32-bit GC pointers by taking advantage of `VMGcHeader` alignment
and storing aligned indices rather than byte indices, while still leaving the
bottom bit available for tagging as an `i31ref` discriminant. Should we ever
need to support even larger GC heap capacities, we could go to full 64-bit
references, but we would need explicit bounds checks.)
The biggest benefit of indexed heaps is that, because we are (explicitly or
implicitly) bounds checking GC heap accesses, and because we are not otherwise
trusting any data from inside the GC heap, we greatly reduce how badly things
can go wrong in the face of collector bugs and GC heap corruption. We are
essentially sandboxing the GC heap region, the same way that linear memory is a
sandbox. GC bugs could lead to the guest program accessing the wrong GC object,
or getting garbage data from within the GC heap. But only garbage data from
within the GC heap, never outside it. The worse that could happen would be if we
decided not to zero out GC heaps between reuse across stores (which is a valid
trade off to make, since zeroing a GC heap is a defense-in-depth technique
similar to zeroing a Wasm stack and not semantically visible in the absence of
GC bugs) and then a GC bug would allow the current Wasm guest to read old GC
data from the old Wasm guest that previously used this GC heap. But again, it
could never access host data.
Taken altogether, this allows for collector implementations that are nearly free
from `unsafe` code, and unsafety can otherwise be targeted and limited in scope,
such as interactions with JIT code. Most importantly, we do not have to maintain
critical invariants across the whole system -- invariants which can't be nicely
encapsulated or abstracted -- to preserve memory safety. Such holistic
invariants that refuse encapsulation are otherwise generally a huge safety
problem with GC implementations.
\### `VMGcRef` is *NOT* `Clone` or `Copy` Anymore
`VMGcRef` used to be `Clone` and `Copy`. It is not anymore. The motivation here
was to be sure that I was actually calling GC barriers at all the correct
places. I couldn't be sure before. Now, you can still explicitly copy a raw GC
reference without running GC barriers if you need to and understand why that's
okay (aka you are implementing the collector), but that is something you have to
opt into explicitly by calling `unchecked_copy`. The default now is that you
can't just copy the reference, and instead call an explicit `clone` method (not
*the* `Clone` trait, because we need to pass in the GC heap context to run the
GC barriers) and it is hard to forget to do that accidentally. This resulted in
a pretty big amount of churn, but I am wayyyyyy more confident that the correct
GC barriers are called at the correct times now than I was before.
\### `i31ref`
I started this commit by trying to add `i31ref` support. And it grew into the
whole traits interface because I found that I needed to abstract GC barriers
into helpers anyways to avoid running them for `i31ref`s, so I figured that I
might as well add the whole traits interface. In comparison, `i31ref` support is
much easier and smaller than that other part! But it was also difficult to pull
apart from this commit, sorry about that!
---------------------
Overall, I know this is a very large commit. I am super happy to have some
synchronous meetings to walk through this all, give an overview of the
architecture, answer questions directly, etc... to make review easier!
prtest:full
* Switch Winch tests to ATT syntax
* Update all test expectations
* Move all winch tests to `disas` folder
* Add `test = "winch"` to `disas`
* Add `test = "winch"` to all winch test files
* Stub out bits to get AArch64 Winch tests working
* Update expectations for all aarch64 winch tests
* Update flags in Winch tests
Use CLI syntax as that's what `flags` was repurposes as in the new test
suite.
* Update all test expectations for x64 winch
* Omit more offsets by default
* Delete now-dead code
* Update an error message
* Update non-winch test expectations
* Disassemble `*.cwasm` for `compile` disas tests
This commit changes how the `compile` mode of the `disas` test suite
works. Previously this would use `--emit-clif` and run the Cranelift
pipeline for each individual function and use the custom VCode-based
disassembly for instruction output. This commit instead uses the raw
binary coming out of Wasmtime. The ELF file itself is parsed and is
disassembled in a manner similar to Winch tests.
The goal of this commit is somewhat twofold:
* Lay the groundwork to migrate all Winch-based filetests to
`tests/disas`.
* Test the raw output from Cranelift/Wasmtime which includes
optimizations like branch chomping in the `MachBuffer`.
This commit doesn't itself move the Winch tests yet, that's left for a
future commit.
* Update all test expectations for new output
* Fix PR-based CI when too many files are changed
This reverts commit 6c5184809d,
"Cranelift: resolve value aliases when printing CLIF functions (#8214)",
then applies the same effect a different way.
In discussion on #8223, we decided to handle this a different way. So
I've introduced a method on DataFlowGraph which eliminates all value
aliases, and we can call it when necessary. If that function is not
called then the CLIF printer should print value aliases the same way it
did before #8214.
In this PR, I've specifically called it in disas tests. The changes to
write.rs and frontend.rs are from the revert, while the changes to
disas.rs are new.
In these tests, value aliases are just clutter that distracts from
understanding what code will actually be generated. They may change due
to changes in legalization, but that doesn't signal anything about
whatever the test was intended to check.
Because the new helper deletes aliases after it's done resolving them,
those aliases now disappear from the test expectations. Aside from that,
the test expectations are unchanged compared to what #8214 did.
* Exit through Cranelift-generated trampolines for builtins
This commit changes how builtin functions in Wasmtime (think
`memory.grow`) are implemented. These functions are required to exit
through some manner of trampoline to handle runtime requirements for
backtracing right now. Currently this is done via inline assembly for
each architecture (or external assembly for s390x). This is a bit
unfortunate as it's a lot of hand-coding and making sure everything is
right, and it's not easy to update as it's multiple platforms to update.
The change in this commit is to instead use Cranelift-generated
trampolines for this purpose instead. The path for invoking a builtin
function now looks like:
* Wasm code calls a statically known symbol for each builtin.
* The statically known symbol will perform exit trampoline duties (e.g.
pc/fp/etc) and then load a function pointer to the host
implementation.
* The host implementation is invoked and then proceeds as usual.
The main new piece for this PR is that all wasm modules and functions
are compiled in parallel but an output of this compilation phase is what
builtin functions are required. All builtin functions are then unioned
together into one set and then anything required is generated just
afterwards. That means that only one builtin-trampoline per-module is
generated per-builtin.
This work is inspired by #8135 and my own personal desire to have as
much about our ABI details flowing through Cranelift as we can. This in
theory makes it more flexible to deal with future improvements to our
ABI.
prtest:full
* Fix some build issues
* Update winch test expectations
* Update Winch to use new builtin shims.
This commit refactors the Winch compiler to use the new trampolines for
all Wasmtime builtins created in the previous commits. This required a
fair bit of refactoring to handle plumbing through a new kind of
relocation and function call.
Winch's `FuncEnv` now contains a `PrimaryMap` from `UserExternalNameRef`
to `UserExternalName`. This is because there's now more than one kind of
name than just wasm function relocations, so the raw index space of
`UserExternalNameRef` is no longer applicable. This required threading
`FuncEnv` to more locations along with some refactorings to ensure that
lifetimes work out ok.
The `CompiledFunction` no longer stores a trait object of how to map
name refs to names and now directly has a `Primarymap`. This also means
that Winch's return value from its `TargetIsa` is a `CompiledFunction`
as opposed to the previous just-a-`MachBuffer` so it can also package up
all the relocation information. This ends up having `winch-codegen`
depend on `wasmtime-cranelift-shared` as a new dependency.
* Review feedback
* PCC: x64: insertlane instructions read only scalar-sized values.
Also fix `clamp_range` on greater-than-64-bit values: no range fact is
possible in this case (propagate `Option` a bit deeper to represent
this).
Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=67538.
* Rebase to latest main with leaf-function changes and update test expectations.
Currently even when the `wmemcheck` Cargo feature is disabled the
various related libcalls are still compiled into `wasmtime-runtime`.
Additionally their signatures are translated when lowering functions,
although the signatures are never used. This commit adds `#[cfg]`
annotations to compile these all out when they're not enabled.
Applying this change, however, uncovered a subtle bug in our libcalls.
Libcalls are numbered in-order as-listed in the macro ignoring `#[cfg]`,
but they're assigned a runtime slot in a `VMBuiltinFunctionsArray`
structure which does respect `#[cfg]`. This meant, for example, that if
`gc` was enabled and `wmemcheck` was disabled, as is the default for our
tests, then there was a hole in the numbering where libcall numbers were
mismatched at runtime and compile time.
To fix this I've first added a const assertion that the runtime-number
of libcalls equals the build-time number of libcalls. I then updated the
macro a bit to plumb the `#[cfg]` differently and now libcalls are
unconditionally defined regardless of cfgs but the implementation is
`std::process::abort()` if it's compiled out.
This ended up having a large-ish impact on the `disas` test suite. Lots
of functions have fewer signatures translation because wmemcheck, even
when disabled, was translating a few signatures. This also had some
assembly changes, too, because I believe functions are considered leaves
based on whether they declare a signature or not, so declaring an unused
signature was preventing all wasm functions from being considered leaves.
* cranelift: Optimize select_spectre_guard, carefully
This commit makes two changes to our treatment of
`select_spectre_guard`.
First, stop annotating this instruction as having any side effects. We
only care that if its value result is used, then it's computed without
branching on the condition input. We don't otherwise care when the value
is computed, or if it's computed at all.
Second, introduce some carefully selected ISLE egraph rewrites for this
instruction. These particular rewrites are those where we can statically
determine which SSA value will be the result of the instruction. Since
there is no actual choice involved, there's no way to accidentally
introduce speculation on the condition input.
* Add filetests
* PCC: support imported memories as well.
Exposed by a fuzzbug
(https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=67429); rather
than exclude from fuzzing, it seemed easier to just implement. We need
to define a new memory type to describe the memory definition struct
pointed to by vmctx, and set up points-to facts appropriately.
* Review feedback: abstract out a bit of the pointer-and-memtype handling logic.
Currently, every access to a table element does a bounds-check with a
conditional branch to a block that explicitly traps.
Instead, when Spectre mitigations are enabled, let's change the address
computation to return a null pointer for out-of-bounds accesses, and
then allow the subsequent load or store to trap.
This is less code in that case since we can reuse instructions we needed
anyway.
We return the MemFlags that the memory access should use, in addition to
the address it should access. That way we don't record trap metadata on
memory access instructions which can't actually trap due to being
preceded by a `trapnz`-based bounds check, when Spectre mitigations are
disabled.
In addition, when the table has constant size and the element index is a
constant and mid-end optimization is enabled, this allows the
bounds-check to be constant folded away. Later, #8139 will let us
optimize away the select_spectre_guard instruction in this case too.
Once we also implement #8160, `tests/disas/typed-funcrefs.wat` should be
almost as fast as native indirect function calls.
* Enhance `typed-funcrefs.wast` test with more cases
Have the same function with slightly different variations to compare
codegen between the possible strategies.
* Skip type checks on tables that don't need it
This commit implements an optimization to skip type checks in
`call_indirect` for tables that don't require it. With the
function-references proposal it's possible to have tables of a single
type of function as opposed to today's default `funcref` which is a
heterogenous set of functions. In this situation it's possible that a
`call_indirect`'s type tag matches the type tag of a
`table`-of-typed-`funcref`-values, meaning that it's impossible for the
type check to fail.
The type check of a function pointer in `call_indirect` is refactored
here to take the table's type into account. Various things are shuffled
around to ensure that the right traps still show up in the right places
but the important part is that, when possible, the type check is omitted
entirely.
* Update crates/cranelift/src/func_environ.rs
Co-authored-by: Jamey Sharp <jamey@minilop.net>
---------
Co-authored-by: Jamey Sharp <jamey@minilop.net>
This commit uses the support from #8162 to skip null function pointer
checks when performing an indirect call. Instead of an explicit check
the segfault from accessing the null function pointer is caught and
annotated with the appropriate trap.
Closes#5291
This is based on discussion in #8158: as noted in #8160, if we use a
nullable typed funcref table instead (and given that we know we'll
initialize a particular slot before use on the application side, so we
won't actually call a null ref), and if we have a null-ref default
value, we should be able to avoid the lazy table-init mechanism
entirely.
(Ignore the part where this module doesn't actually have any update
logic that would set non-null refs anywhere; it's a compile-test, not a
runtest!)
Once #8159 is merged and #8160 is implemented, we should see zero
branches in this test.
* Wasm tests: add typed-funcref test showing example of desirable optimizations.
In order to have fast IC (inline cache) chains in AOT-compiled dynamic
language Wasms, it would be great if we could make the "call to a typed
funcref at a constant table index" pattern fast.
This use-case was discussed at the most recent Wasmtime biweekly and
@jameysharp is working on some optimizations; the intent of this PR is
to provide a concrete test-case whose blessed output we can see improve
over time.
In particular, the following opts are still desirable:
- With the use of non-nullable typed funcrefs, there shouldn't be a null
check (there currently is, as noted by a comment in the code due to
lack of type information at the right spot).
- With the use of a constant table size and a constant index to the
`table.get`, we should be able to load from the table without a
bounds-check or any Spectre masking.
Other further optimizations for this pattern might be possible if we
rearrange the table and function-reference data structures, and the
lazy-initialization scheme thereof, but the above should be agnostic to
that.
* Add comments to clarify typed funcrefs usage.
* Move remaining `*.wat` tests out of cranelift-wasm/wasmtests
Move these up to Wasmtime's misc testsuite to get translated and
instantiated by Wasmtime.
Note that the max-function-index-in-name-section test was removed here
as that's tested by the support added in #3509.
* Remove cranelift-wasm test for name section
This is pretty thoroughly tested elsewhere in Wasmtime that we respect
the name section, for example many of the trap tests assert that the
name of the function comes from the text format.
* Move reachability tests out of cranelift-wasm
Instead add them to the disassembly test suite to ensure we don't
generate dead code. Additionally this has a lot of coverage via fuzzing
too.
* Move more tests out of cranelift-wasm
Move them into `tests/disas` so we can easily see the CLIF.
* Turn most cranelift-wasm wasmtests into disas tests
These tests mostly verify compiler behavior for specific WebAssembly
constructs or small combinations.
Currently they're only tested to make sure that cranelift-wasm doesn't
crash or report an error. But they'd be even more useful if we track
what CLIF we actually produce from Wasmtime.
So I'm moving them to the new disas test framework and adding basic CLIF
test expectations for them.
I've excluded the .wat files which were produced by Emscripten or Rust,
because they're very large and their compiled CLIF is not very
interesting to look at.
* Add test expectations
* Support hex numbers in CLI arguments
This enables `-O static-memory-maximum-size=0x10`, for example.
Previously this was a parse error.
* Discover `disas` tests in subdirectories
* Deny unknown fields when parsing test configs
Make sure we don't accidentally have other configuration. Useful when
porting tests over.
* Update `make-load-store-tests.sh` to new syntax
Pass various flags and such.
* Move all `*.wat` tests from `cranelift/filetests` into `tests/disas`
* Migrate all load/store tests
* Migrate all other `*.wat` tests to `tests/disas`
* Use opt-level=0 with table instructions
* Add an `asm` test suite for Wasmtime
This commit adds a suite of tests at `tests/asm/*.wat` which is intended
to replace the tests in `cranelift/filetests/filetests/wasm`. Tests are
configured differently than before using Wasmtime CLI flags rather than
a custom TOML-based configuration scheme. Otherwise though the same
shape of tests is supported.
This commit migrates a small handful of tests as a showcase and bulk
migration is left for a follow-up.
* Organize the asm.rs test with methods/functions
* Switch back to TOML for config parsing
* Disable disassembly tests on miri
Takes a bit too long in cranelift