cranelift

Commit Graph

Author	SHA1	Message	Date
Ulrich Weigand	a7272b8249	s390x: Support tail calls (#9052 ) This adds support for a newly defined tail-call ABI for s390x as well as supporting tail calls themselves. We now use the tail-call ABI by default for Wasmtime, and enable tail calls by default. This also allows getting rid of a number of special case and test exclusions for s390x. Fixes: https://github.com/bytecodealliance/wasmtime/issues/6530	3 months ago
Vulcain	c682d23d9b	Winch aarch64 jmp (#9051 ) * winch aarch64 jmp * winch aarch64 branch * winch aarch64 jmp_table * winch aarch64 branch tests * winch aarch64 jmp: add islands for long range branches	3 months ago
Nick Fitzgerald	0683b84b40	Cranelift: Stop sign-extending `Imm64` immediates in builder methods (#9046 ) * Cranelift: Stop sign-extending `Imm64` immediates in builder methods Originally, we sign-extended `Imm64` immediates from their controlling type var's width to `i64` in the builder methods for `BinaryImm64` and a couple other instruction formats: https://github.com/bytecodealliance/wasmtime/pull/1687 At some point, we decided that the unused bits in the `Imm64` when the controlling type var's width is less than 64 should be zero. And we started asserting that in various places, for example: `87817f38a1/cranelift/codegen/src/machinst/lower.rs (L1237-L1243)` However, we never removed or updated the sign-extension code in the builder methods. This commit switches the builder methods over to masking the `Imm64` to just the valid bits, effectively doing a zero extend from the controlling type var's width instead of a sign extend. To avoid too much churn in tests, I made `display_inst` print the sign-extended-from-the-controlling-type-variable's-width value of `Imm64` immediates. I also made the `Display` implementation for `Imm64` always print in decimal form with a `-` for negative numbers, rather than the hex it would previously print for large negative numbers, so that `iconst.i32 0x8000_0000` doesn't display as `iconst.i32 0xffff_ffff_8000_0000`, which is otherwise pretty confusing. * Rename condition * return new immediates instead of mutating * Update some test expectations	3 months ago
Vulcain	8c8d1551e9	Winch aarch64 clz & ctz (#9033 ) * winch aarch64 clz & ctz * winch aarch64 clz & ctz tests	3 months ago
Vulcain	c549e7776e	Winch aarch64 cmp (#9018 ) * winch icmp * winch icmp tests * winch fcmp * winch fcmp tests * winch aarch64 cmp: no destination register for subs	3 months ago
Jeffrey Charles	488e50563c	Winch: Add v128 load and store (#9006 ) * Winch: Add `v128.load` and `v128.store` support * Add additional Wast tests	4 months ago
Jeffrey Charles	fa9a948db0	Add v128.const support to Winch (#8990 ) * Add v128.const support to Winch * Remove next_vr and vector_reg_for methods * Adjust alignment and slot size for v128 * Forgot to update disas tests * Update unit tests * Use 8 byte stack slot sizes * Fix broken unit tests, add tests for vecs, and use ty size for vecs	4 months ago
Edoardo Vacchi	c69ab340ca	winch(arm64): and, or, xor, shifts (#8921 ) * winch(arm64): and, or, xor, shifts Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * fmt Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * docs Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * shift Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * shifts: test cases Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * cleanup Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * apply first batch of suggestions Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * refactor shift operations Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * fix doc strings Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * Minor cleanups This commit includes very minor clean-ups, these are mechanical changes only. * Rename `shift_rr` to `shift`, as we're still passing the context in, the callee can decide what to do with the instruction arguments. * Delete the un-used `Into<AluOp> for ShiftKind` --------- Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>	4 months ago
Saúl Cabrera	8fc418664e	Save state before emitting `br_if` (#8886 ) Fixes: https://github.com/bytecodealliance/wasmtime/issues/8848 Similar to all the control instructions, any state must be explicitly saved before emitting the code for `br_if`. This commit ensures that live locals and registers are explicilty saved before emitting the code for `br_if`. Prior to this commit, live locals and registers were not saved every time causing incorrect behavior in cases where the calculation of the conditional argument didn't trigger a spill. This change introduces the explicit spill after calculating the branch condition argument to minimize memory traffic in case the conditional is already in a register.	4 months ago
Chris Fallin	7bf2b8fb2f	Wasmtime: remove indirect-call caching. (#8881 ) In the original development of this feature, guided by JS AOT compilation to Wasm of a microbenchmark heavily focused on IC sites, I was seeing a ~20% speedup. However, in more recent measurements, on full programs (e.g., the Octane benchmark suite), the benefit is more like 5%. Moreover, in #8870, I attempted to switch over to a direct-mapped cache, to address a current shortcoming of the design, namely that it has a hard-capped number of callsites it can apply to (50k) to limit impact on VMContext struct size. With all of the needed checks for correctness, though, that change results in a 2.5% slowdown relative to no caching at all, so it was dropped. In the process of thinking through that, I discovered the current design on `main` incorrectly handles null funcrefs: it invokes a null code pointer, rather than loading a field from a null struct pointer. The latter was specifically designed to cause the necessary Wasm trap in #8159, but I had missed that the call to a null code pointer would not have the same effect. As a result, we actually can crash the VM (safely at least, but still no good vs. a proper Wasm trap!) with the feature enabled. (It's off by default still.) That could be fixed too, but at this point with the small benefit on real programs, together with the limitation on module size for full benefit, I think I'd rather opt for simplicity and remove the cache entirely. Thus, this PR removes call-indirect caching. It's not a direct revert because the original PR refactored the call-indirect generation into smaller helpers and IMHO it's a bit nicer to keep that. But otherwise all traces of the setting, code pre-scan during compilation and special conditions tracked on tables, and codegen changes are gone.	4 months ago
Jamey Sharp	e29d56e430	wasmtime: Remove redundant epoch check on function entry (#8853 ) The epoch interruption implementation caches the current deadline in a register, and avoids reloading that cache until the cached deadline has passed. However, the first epoch check happens immediately after the cache has been populated on function entry, so there's never a reason to reload the cache at that point. It only needs to be reloaded in loops. So this commit eliminates the double-check on function entry. When Cranelift optimizations are enabled, the alias analysis pass correctly detected that this load was redundant, and the egraph pass optimized away the `icmp` as well. However, since we don't do conditional constant propagation, the branch couldn't be optimized away. On x86 this lowered to a redundant `cmp`/`jae` pair of instructions in every function entry, which this commit eliminates. To keep track of what code we're generating for epoch interruptions, I've also added disas tests with a trivial infinite loop.	5 months ago
Alex Crichton	d4c5bccc27	Replace `simm32` extractor with `i32_from_iconst` (#8842 ) This commit removes the `simm32` extractor from lowerings as it's not as useful as it was when it was first introduced. Nowadays an `Imm64` needs to be interpreted with the type known as well to understand whether bits being masked off is significant or not. The old `simm32` extractor only took `Imm64` meaning that it was unable to do this and wouldn't match negative numbers. This is because the high 32 bits of `Imm64` were always zero and `simm64` would take the `i64` value from `Imm64` and try to convert it to an `i32`. This commit replaces `simm32`, and uses of it, with a new extractor `i32_from_iconst`. This matches the preexisting `i64_from_iconst` and is able to take the type of the value into account and produce a correctly sign-extended value. cc #8706	5 months ago
Alex Crichton	ee9e1ca545	x64: Implement some minor optimizations related to SIMD lowerings (#8839 ) * Add tests for patterns I'm about to optimize * x64: Optimize vector compare-and-branch This commit implements lowering optimizations for the `vall_true` and `vany_true` CLIF instructions when combined with `brif`. This is in the same manner as `icmp` and `fcmp` combined with `brif` where the result of the comparison is never materialized into a general purpose register which helps lower register pressure and remove some instructions. * x64: Optimize `vconst` with an all-ones pattern This has a single-instruction lowering which doesn't load from memory so it's probably cheaper than loading all-ones from memory.	5 months ago
Nick Fitzgerald	bdd7842263	Wasmtime: Implement the custom-page-sizes proposal (#8763 ) * Wasmtime: Implement the custom-page-sizes proposal This commit adds support for the custom-page-sizes proposal to Wasmtime: https://github.com/WebAssembly/custom-page-sizes I've migrated, fixed some bugs within, and extended the `.wast` tests for this proposal from the `wasm-tools` repository. I intend to upstream them into the proposal shortly. There is a new `wasmtime::Config::wasm_custom_page_sizes_proposal` method to enable or disable the proposal. It is disabled by default. Our fuzzing config has been updated to turn this feature on/off as dictated by the arbitrary input given to us from the fuzzer. Additionally, there were getting to be so many constructors for `wasmtime::MemoryType` that I added a builder rather than add yet another constructor. In general, we store the `log2(page_size)` rather than the page size directly. This helps cut down on invalid states and properties we need to assert. I've also intentionally written this code such that supporting any power of two page size (rather than just the exact values `1` and `65536` that are currently valid) will essentially just involve updating `wasmparser`'s validation and removing some debug asserts in Wasmtime. Update error string expectation * Remove debug logging * Use a right shift instead of a division * fix error message expectation again * remove page size from VMMemoryDefinition * fix size of VMMemoryDefinition again * Only dynamically check for `-1` sentinel for 1-byte page sizes * Import functions that are used a few times * Better handle overflows when rounding up to the host page size Propagate errors instead of returning a value that is not actually a rounded up version of the input. Delay rounding up various config sizes until runtime instead of eagerly doing it at config time (which isn't even guaranteed to work, so we already had to have a backup plan to round up at runtime, since we might be cross-compiling wasm or not have the runtime feature enabled). * Fix some anyhow and nostd errors * Add missing rounding up to host page size at runtime * Add validate feature to wasmparser dep * Add some new rounding in a few places, due to no longer rounding in config methods * Avoid actually trying to allocate the whole address space in the `massive_64_bit_still_limited` test The point of the test is to ensure that we hit the limiter, so just cancel the allocation from the limiter, and otherwise avoid MIRI attempting to allocate a bunch of memory after we hit the limiter. * prtest:full * Revert "Avoid actually trying to allocate the whole address space in the `massive_64_bit_still_limited` test" This reverts commit ccfa34a78dd3d53e49a6158ca03077d42ce8bcd7. * miri: don't attempt to allocate more than 4GiB of memory It seems that rather than returning a null pointer from `std::alloc::alloc`, miri will sometimes choose to simply crash the whole program. * remove duplicate prelude import after rebasing	5 months ago
Alex Crichton	4896b6630d	Fix tail calls being turned on by default (#8682 ) * Fix tail calls being turned on by default Logic in `Config` to conditionally enable tail calls wasn't handling the case where the configured compiler strategy was `Strategy::Auto` meaning that by default tail calls weren't actually enabled. This commit refactors handling of `Strategy` to avoid storing `Strategy::Auto` in `CompilerConfig` so tests against it can use either cranelift or winch. * Update disas tests * Update Winch tests to using winch	6 months ago
hungryzzz	45a12a1c0a	Change the function alignment of x86 to 32-bytes (#8633 ) * change the function alignment of x86 to 32-bytes * Update disassembly tests --------- Co-authored-by: ringz <ringz1108@gmail.com>	6 months ago
Alex Crichton	1d11b265f9	Remove the native ABI calling convention from Wasmtime (#8629 ) * Remove the native ABI calling convention from Wasmtime This commit proposes removing the "native abi" calling convention used in Wasmtime. For background this ABI dates back to the origins of Wasmtime. Originally Wasmtime only had `Func::call` and eventually I added `TypedFunc` with `TypedFunc::call` and `Func::wrap` for a faster path. At the time given the state of trampolines it was easiest to call WebAssembly code directly without any trampolines using the native ABI that wasm used at the time. This is the original source of the native ABI and it's persisted over time under the assumption that it's faster than the array ABI due to keeping arguments in registers rather than spilling them to the stack. Over time, however, this design decision of using the native ABI has not aged well. Trampolines have changed quite a lot in the meantime and it's no longer possible for the host to call wasm without a trampoline, for example. Compilations nowadays maintain both native and array trampolines for wasm functions in addition to host functions. There's a large split between `Func::new` and `Func::wrap`. Overall, there's quite a lot of weight that we're pulling for the design decision of using the native ABI. Functionally this hasn't ever really been the end of the world. Trampolines aren't a known issue in terms of performance or code size. There's no known faster way to invoke WebAssembly from the host (or vice-versa). One major downside of this design, however, is that `Func::new` requires Cranelift as a backend to exist. This is due to the fact that it needs to synthesize various entries in the matrix of ABIs we have that aren't available at any other time. While this is itself not the worst of issues it means that the C API cannot be built without a compiler because the C API does not have access to `Func::wrap`. Overall I'd like to reevaluate given where Wasmtime is today whether it makes sense to keep the native ABI trampolines. Sure they're supposed to be fast, but are they really that much faster than the array-call ABI as an alternative? This commit is intended to measure this. This commit removes the native ABI calling convention entirely. For example `VMFuncRef` is now one pointer smaller. All of `TypedFunc` now uses `mut ValRaw` for loads/stores rather than dealing with ABI business. The benchmarks with this PR are: `sync/no-hook/core - host-to-wasm - typed - nop` - 5% faster * `sync/no-hook/core - host-to-wasm - typed - nop-params-and-results` - 10% slower * `sync/no-hook/core - wasm-to-host - typed - nop` - no change * `sync/no-hook/core - wasm-to-host - typed - nop-params-and-results` - 7% faster These numbers are a bit surprising as I would have suspected no change in both "nop" benchmarks as well as both being slower in the params-and-results benchmarks. Regardless it is apparent that this is not a major change in terms of performance given Wasmtime's current state. In general my hunch is that there are more expensive sources of overhead than reads/writes from the stack when dealing with wasm values (e.g. trap handling, store management, etc). Overall this commit feels like a large simplification of what we currently do in `TypedFunc`: * The number of ABIs that Wasmtime deals with is reduced by one. ABIs are pretty much always tricky and having fewer moving parts should help improve the understandability of the system. * All of the `WasmTy` trait methods and `TypedFunc` infrastructure is simplified. Traits now work with simple `load`/`store` methods rather than various other flavors of conversion. * The multi-return-value handling of the native ABI is all gone now which gave rise to significant complexity within Wasmtime's Cranelift translation layer in addition to the `TypedFunc` backing traits. * This aligns components and core wasm where components always use the array ABI and now core wasm additionally will always use the array ABI when communicating with the host. I'll note that this still leaves a major ABI "complexity" with respect to native functions do not have a wasm ABI function pointer until they're "attached" to a `Store` with a `Module`. That's required to avoid needing Cranelift for creating host functions and that property is still true today. This is a bit simpler to understand though now that `Func::new` and `Func::wrap` are treated uniformly rather than one being special-cased. * Fix miri unsafety prtest:full	6 months ago
L. Pereira	cd8bd0d5b9	Use WASM function names in compiled objects (#8627 ) * Use WASM function names in compiled objects Instead of generating symbol names in the format "wasm[$MODULE_ID]::function[$FUNCTION_INDEX]", generate (if possible) something more readable, such as "wasm[$MODULE_ID]::$FUNCTION_NAME". This helps when debugging or profiling the generated code. Co-authored-by: Jamey Sharp <jsharp@fastly.com> * Ensure symbol names are cleaned up and have function indexes Filter symbol names to include only characters that are usually used for function names, and that might be produced by name mangling. Replace everything else with a question mark (and all repeated question marks by a single one), and then truncate to a length of 96 characters. This should be enough to not only avoid passing user-controlled strings to tools such as "perf" and "objdump", and make it easier to disambiguate symbols that might have the same name but different indices. * Make symbol cleaning slightly more efficient * Update symbol names to be closer to what tests expect * Ensure only alphanumeric ASCII characters are allowed in a symbol name * Ensure sliced symbol name is within its bounds * Update test expectations after adding function name to symbol name --------- Co-authored-by: Jamey Sharp <jsharp@fastly.com>	6 months ago
Jamey Sharp	7ea9201658	wasmtime: Make table lazy-init configurable (#8531 ) * wasmtime: Make table lazy-init configurable Lazy initialization of tables has trade-offs that we haven't explored in a while. Making it configurable makes it easier to test the effects of these trade-offs on a variety of WebAssembly programs, and allows embedders to decide whether the trade-offs are worth-while for their use cases. * Review comments	6 months ago
Edoardo Vacchi	96909c9755	winch(arm64): float_copysign (#8507 ) * winch(arm64): float_copysign Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * Add doc comments Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> --------- Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>	6 months ago
Chris Fallin	7e145376d6	Indirect call caching: put options under `-O` namespace. (#8543 ) As per [this comment], the call-indirect caching options should really be in the `-O` option namespace (for optimizations), not `-W` (for semantically-visible Wasm standards options); the original PR simply put them in the wrong place, and this PR moves them. [this comment]: https://github.com/bytecodealliance/wasmtime/pull/8509#discussion_r1589594152	6 months ago
Chris Fallin	01c07a1714	Wasmtime: add one-entry call-indirect caching. (#8509 ) * Wasmtime: add one-entry call-indirect caching. In WebAssembly, an indirect call is somewhat slow, because of the indirection required by CFI (control-flow integrity) sandboxing. In particular, a "function pointer" in most source languages compiled to Wasm is represented by an index into a table of funcrefs. The `call_indirect` instruction then has to do the following steps to invoke a function pointer: - Load the funcref table's base and length values from the vmctx. - Bounds-check the invoked index against the actual table size; trap if out-of-bounds. - Spectre mitigation (cmove) on that bounds-check. - Load the `vmfuncref` from the table given base and index. - For lazy table init, check if this is a non-initialized funcref pointer, and initialize the entry. - Load the signature from the funcref struct and compare it against the `call_indirect`'s expected signature; trap if wrong. - Load the actual code pointer for the callee's Wasm-ABI entry point. - Load the callee vmctx (which may be different for a cross-module call). - Put that vmctx in arg 0, our vmctx in arg 1, and invoke the loaded code pointer with an indirect call instruction. Compare and contrast to the process involved in invoking a native function pointer: - Invoke the code pointer with an indirect call instruction. This overhead buys us something -- it is part of the SFI sandbox boundary -- but it is very repetitive and unnecessary work in most cases when indirect function calls are performed repeatedly (such as within an inner loop). This PR introduces the idea of caching: if we know that the result of all the above checks won't change, then if we use the same index as "the last time" (for some definition), we can skip straight to the "invoke the code pointer" step, with a cached code pointer from that last time. Concretely, it introduces a two-word struct inlined into the vmctx for each `call_indirect` instruction in the module (up to a limit): - The last invoked index; - The code pointer that index corresponded to. When compiling the module, we check whether the table could possibly be mutable at a given index once read: any instructions like `table.set`, or the whole table exported thus writable from the outside. We also check whether index 0 is a non-null funcref. If neither of these things are true, then we know we can cache an index-to-code-pointer mapping, and we know we can use index 0 as a sentinel for "no cached value". We then make use of the struct for each indirect call site and generate code to check if the index matches; if so, call cached pointer; if not, load the vmfuncref, check the signature, check that the callee vmctx is the same as caller (intra-module case), and stash the code pointer and index away (fill the cache), then make the call. On an in-development branch of SpiderMonkey-in-Wasm with ICs (using indirect calls), this is about a 20% speedup; I haven't yet measured on other benchmarks. It is expected that this might be an instantiation-time slowdown due to a larger vmctx (but we could use madvise to zero if needed). This feature is off by default right now. * Addressed review feedback. * Added some more comments. * Allow unused VMCallIndirectCache struct (defined for parity with other bits but not needed in actual runtime). * Add a limit to the number of call-indirect cache slots. * Fix merge conflict: handle ConstOp element offset. * Review feedback.	6 months ago
Edoardo Vacchi	a4a4c94538	winch(aarch64): abs, ceil, floor, nearest, trunc, neg, sqrt (#8425 ) * winch(amd64): more fpu Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * Add tests Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> --------- Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>	6 months ago
Saúl Cabrera	a54d34fd65	winch(aarch64): Use single precision fmov (#8460 ) * winch: Use single precision fmov This commit is a follow-up to https://github.com/bytecodealliance/wasmtime/pull/8365/. In https://github.com/bytecodealliance/wasmtime/pull/8453 single precision fmov was introduced in Cranelift, so now we can make use that instruction in Winch. * Remove stale `TODO` comment	7 months ago
Jamey Sharp	3befbe52c9	cranelift: Drop unused arguments before regalloc (#8438 ) Before this, Cranelift ABI code would emit a stack-load instruction for every stack argument and add all register arguments to the `args` pseudo-instruction, whether those arguments were used or not. However, we already know which arguments are used at that point because we need the analysis for load-sinking, so it's easy to filter the unused arguments out. This avoids generating loads that are immediately dead, which is good for the generated code. It also slightly reduces the size of the register allocation problem, which is a small win in compile time. This also changes which registers RA2 chooses in some cases because it no longer considers unused defs from the `args` pseudo-instruction. There was an existing method named `arg_is_needed_in_body` which sounded like it should be the right place to implement this. However, that method was only used for Baldrdash integration and has been a stub since that integration was removed in #4571. Also it didn't have access to the `value_ir_uses` map needed here. But the place where that method was called does have access to that map and was perfect for this. Also, don't emit a `dummy_use` pseudo-instruction for the vmctx if it's otherwise unused everywhere, as we want to drop it from the `args` instruction in that case and then RA2 complains that it's used without being defined. Furthermore, don't emit debug info specially for the vmctx parameter, because it's already emitted for all block parameters including vmctx. Thanks to @elliottt for doing the initial investigation of this change with me, and to @cfallin for helping me track down the `dummy_use` false dependency.	7 months ago
Edoardo Vacchi	d36d4708b9	winch(arm64): fpu arithmetics (add, sub, mul, div, min, max) and regalloc (#8365 ) * winch(arm64): fpu arithmetics (add, sub, mul, div, min, max) Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * disas: add fpu test cases (add) Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * winch(arm64): regalloc for fpu Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * disas: add fpu test cases (add64) Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * disas: add fpu test cases Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com> * Apply suggestions from code review Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com> * add todo for f32 mov Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> --------- Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com> Co-authored-by: Saúl Cabrera <saulecabrera@gmail.com>	7 months ago
Trevor Elliott	aebcbb4dfa	Always reserve space for outgoing args (#8319 ) * Reserve space for outgoing arguments up-front, instead of per-call Co-authored-by: Jamey Sharp <jsharp@fastly.com> * Update filetests * Update cranelift/codegen/src/machinst/abi.rs Co-authored-by: Jamey Sharp <jamey@minilop.net> * Word-wrap a comment --------- Co-authored-by: Jamey Sharp <jsharp@fastly.com> Co-authored-by: Jamey Sharp <jamey@minilop.net>	7 months ago
Jamey Sharp	980e346dfa	cranelift: Include clobbers and outgoing args in stack limit (#8301 ) When we compute the amount of space that we need in a stack frame for the stack limit check, we were only counting spill-slots and explicit stack-slots. However, we need to account for all uses of the stack which occur before the next stack limit check. That includes clobbers and any stack arguments we want to pass to callees. The maximum amount that we could have missed by is essentially bounded by the number of arguments which could be passed to a function. In Wasmtime, that is limited by `MAX_WASM_FUNCTION_PARAMS` in `wasmparser::limits`, which is set to 1,000, and the largest arguments are 16-byte vectors, so this could undercount by about 16kB. This is not a security issue according to Wasmtime's security policy (https://docs.wasmtime.dev/security-what-is-considered-a-security-vulnerability.html) because it's the embedder's responsibility to ensure that the stack where Wasmtime is running has enough extra space on top of the configured `max_wasm_stack` size, and getting within 16kB of the host stack size is too small to be safe even with this fixed. However, this was definitely not the intended behavior when stack limit checks or stack probes are enabled, and anyone with non-default configurations or non-Wasmtime uses of Cranelift should evaluate whether this bug impacts your use case. (For reference: When Wasmtime is used in async mode or on Linux, the default stack size is 1.5MB larger than the default WebAssembly stack limit, so such configurations are typically safe regardless. On the other hand, on macOS the default non-async stack size for threads other than the main thread is the same size as the default for `max_wasm_stack`, so that is too small with or without this bug fix.)	7 months ago
Nick Fitzgerald	0fa130131d	Add `GcRuntime` and `GcCompiler` traits; `i31ref` support (#8196 ) \### The `GcRuntime` and `GcCompiler` Traits This commit factors out the details of the garbage collector away from the rest of the runtime and the compiler. It does this by introducing two new traits, very similar to a subset of [those proposed in the Wasm GC RFC], although not all equivalent functionality has been added yet because Wasmtime doesn't support, for example, GC structs yet: [those proposed in the Wasm GC RFC]: https://github.com/bytecodealliance/rfcs/blob/main/accepted/wasm-gc.md#defining-the-pluggable-gc-interface 1. The `GcRuntime` trait: This trait defines how to create new GC heaps, run collections within them, and execute the various GC barriers the collector requires. Rather than monomorphize all of Wasmtime on this trait, we use it as a dynamic trait object. This does imply some virtual call overhead and missing some inlining (and resulting post-inlining) optimization opportunities. However, it is much less disruptive to the existing embedder API, results in a cleaner embedder API anyways, and we don't believe that VM runtime/embedder code is on the hot path for working with the GC at this time anyways (that would be the actual Wasm code, which has inlined GC barriers and direct calls and all of that). In the future, once we have optimized enough of the GC that such code is ever hot, we have options we can investigate at that time to avoid these dynamic virtual calls, like only enabling one single collector at build time and then creating a static type alias like `type TheOneGcImpl = ...;` based on the compile time configuration, and using this type alias in the runtime rather than a dynamic trait object. The `GcRuntime` trait additionally defines a method to reset a GC heap, for use by the pooling allocator. This allows reuse of GC heaps across different stores. This integration is very rudimentary at the moment, and is missing all kinds of configuration knobs that we should have before deploying Wasm GC in production. This commit is large enough as it is already! Ideally, in the future, I'd like to make it so that GC heaps receive their memory region, rather than allocate/reserve it themselves, and let each slot in the pooling allocator's memory pool be either a linear memory or a GC heap. This would unask various capacity planning questions such as "what percent of memory capacity should we dedicate to linear memories vs GC heaps?". It also seems like basically all the same configuration knobs we have for linear memories apply equally to GC heaps (see also the "Indexed Heaps" section below). 2. The `GcCompiler` trait: This trait defines how to emit CLIF that implements GC barriers for various operations on GC-managed references. The Rust code calls into this trait dynamically via a trait object, but since it is customizing the CLIF that is generated for Wasm code, the Wasm code itself is not making dynamic, indirect calls for GC barriers. The `GcCompiler` implementation can inline the parts of GC barrier that it believes should be inline, and leave out-of-line calls to rare slow paths. All that said, there is still only a single implementation of each of these traits: the existing deferred reference-counting (DRC) collector. So there is a bunch of code motion in this commit as the DRC collector was further isolated from the rest of the runtime and moved to its own submodule. That said, this was not purely code motion (see "Indexed Heaps" below) so it is worth not simply skipping over the DRC collector's code in review. \### Indexed Heaps This commit does bake in a couple assumptions that must be shared across all collector implementations, such as a shared `VMGcHeader` that all objects allocated within a GC heap must begin with, but the most notable and far-reaching of these assumptions is that all collectors will use "indexed heaps". What we are calling indexed heaps are basically the three following invariants: 1. All GC heaps will be a single contiguous region of memory, and all GC objects will be allocated within this region of memory. The collector may ask the system allocator for additional memory, e.g. to maintain its free lists, but GC objects themselves will never be allocated via `malloc`. 2. A pointer to a GC-managed object (i.e. a `VMGcRef`) is a 32-bit offset into the GC heap's contiguous region of memory. We never hold raw pointers to GC objects (although, of course, we have to compute them and use them temporarily when actually accessing objects). This means that deref'ing GC pointers is equivalent to deref'ing linear memory pointers: we need to add a base and we also check that the GC pointer/index is within the bounds of the GC heap. Furthermore, compressing 64-bit pointers into 32 bits is a fairly common technique among high-performance GC implementations[^compressed-oops][^v8-ptr-compression] so we are in good company. 3. Anything stored inside the GC heap is untrusted. Even each GC reference that is an element of an `(array (ref any))` is untrusted, and bounds checked on access. This means that, for example, we do not store the raw pointer to an `externref`'s host object inside the GC heap. Instead an `externref` now stores an ID that can be used to index into a side table in the store that holds the actual `Box<dyn Any>` host object, and accessing that side table is always checked. [^compressed-oops]: See ["Compressed OOPs" in OpenJDK.](https://wiki.openjdk.org/display/HotSpot/CompressedOops) [^v8-ptr-compression]: See [V8's pointer compression](https://v8.dev/blog/pointer-compression). The good news with regards to all the bounds checking that this scheme implies is that we can use all the same virtual memory tricks that linear memories use to omit explicit bounds checks. Additionally, (2) means that the sizes of GC objects is that much smaller (and therefore that much more cache friendly) because they are only holding onto 32-bit, rather than 64-bit, references to other GC objects. (We can, in the future, support GC heaps up to 16GiB in size without losing 32-bit GC pointers by taking advantage of `VMGcHeader` alignment and storing aligned indices rather than byte indices, while still leaving the bottom bit available for tagging as an `i31ref` discriminant. Should we ever need to support even larger GC heap capacities, we could go to full 64-bit references, but we would need explicit bounds checks.) The biggest benefit of indexed heaps is that, because we are (explicitly or implicitly) bounds checking GC heap accesses, and because we are not otherwise trusting any data from inside the GC heap, we greatly reduce how badly things can go wrong in the face of collector bugs and GC heap corruption. We are essentially sandboxing the GC heap region, the same way that linear memory is a sandbox. GC bugs could lead to the guest program accessing the wrong GC object, or getting garbage data from within the GC heap. But only garbage data from within the GC heap, never outside it. The worse that could happen would be if we decided not to zero out GC heaps between reuse across stores (which is a valid trade off to make, since zeroing a GC heap is a defense-in-depth technique similar to zeroing a Wasm stack and not semantically visible in the absence of GC bugs) and then a GC bug would allow the current Wasm guest to read old GC data from the old Wasm guest that previously used this GC heap. But again, it could never access host data. Taken altogether, this allows for collector implementations that are nearly free from `unsafe` code, and unsafety can otherwise be targeted and limited in scope, such as interactions with JIT code. Most importantly, we do not have to maintain critical invariants across the whole system -- invariants which can't be nicely encapsulated or abstracted -- to preserve memory safety. Such holistic invariants that refuse encapsulation are otherwise generally a huge safety problem with GC implementations. \### `VMGcRef` is NOT `Clone` or `Copy` Anymore `VMGcRef` used to be `Clone` and `Copy`. It is not anymore. The motivation here was to be sure that I was actually calling GC barriers at all the correct places. I couldn't be sure before. Now, you can still explicitly copy a raw GC reference without running GC barriers if you need to and understand why that's okay (aka you are implementing the collector), but that is something you have to opt into explicitly by calling `unchecked_copy`. The default now is that you can't just copy the reference, and instead call an explicit `clone` method (not the `Clone` trait, because we need to pass in the GC heap context to run the GC barriers) and it is hard to forget to do that accidentally. This resulted in a pretty big amount of churn, but I am wayyyyyy more confident that the correct GC barriers are called at the correct times now than I was before. \### `i31ref` I started this commit by trying to add `i31ref` support. And it grew into the whole traits interface because I found that I needed to abstract GC barriers into helpers anyways to avoid running them for `i31ref`s, so I figured that I might as well add the whole traits interface. In comparison, `i31ref` support is much easier and smaller than that other part! But it was also difficult to pull apart from this commit, sorry about that! --------------------- Overall, I know this is a very large commit. I am super happy to have some synchronous meetings to walk through this all, give an overview of the architecture, answer questions directly, etc... to make review easier! prtest:full	7 months ago
Alex Crichton	70b076d3e5	Migrate all Winch filetests to `tests/disas` (#8243 ) * Switch Winch tests to ATT syntax * Update all test expectations * Move all winch tests to `disas` folder * Add `test = "winch"` to `disas` * Add `test = "winch"` to all winch test files * Stub out bits to get AArch64 Winch tests working * Update expectations for all aarch64 winch tests * Update flags in Winch tests Use CLI syntax as that's what `flags` was repurposes as in the new test suite. * Update all test expectations for x64 winch * Omit more offsets by default * Delete now-dead code * Update an error message * Update non-winch test expectations	7 months ago
Alex Crichton	ad308105fb	Disassemble `.cwasm` for `compile` disas tests (#8237 ) Disassemble `.cwasm` for `compile` disas tests This commit changes how the `compile` mode of the `disas` test suite works. Previously this would use `--emit-clif` and run the Cranelift pipeline for each individual function and use the custom VCode-based disassembly for instruction output. This commit instead uses the raw binary coming out of Wasmtime. The ELF file itself is parsed and is disassembled in a manner similar to Winch tests. The goal of this commit is somewhat twofold: Lay the groundwork to migrate all Winch-based filetests to `tests/disas`. * Test the raw output from Cranelift/Wasmtime which includes optimizations like branch chomping in the `MachBuffer`. This commit doesn't itself move the Winch tests yet, that's left for a future commit. * Update all test expectations for new output * Fix PR-based CI when too many files are changed	8 months ago
Jamey Sharp	138148a542	Resolve value aliases in disas tests (#8239 ) This reverts commit `6c5184809d`, "Cranelift: resolve value aliases when printing CLIF functions (#8214)", then applies the same effect a different way. In discussion on #8223, we decided to handle this a different way. So I've introduced a method on DataFlowGraph which eliminates all value aliases, and we can call it when necessary. If that function is not called then the CLIF printer should print value aliases the same way it did before #8214. In this PR, I've specifically called it in disas tests. The changes to write.rs and frontend.rs are from the revert, while the changes to disas.rs are new. In these tests, value aliases are just clutter that distracts from understanding what code will actually be generated. They may change due to changes in legalization, but that doesn't signal anything about whatever the test was intended to check. Because the new helper deletes aliases after it's done resolving them, those aliases now disappear from the test expectations. Aside from that, the test expectations are unchanged compared to what #8214 did.	8 months ago
Jamey Sharp	d1b3d552ae	Make fixed-size table base-address loads read-only (#8206 ) This allows optimization to reorder and combine these loads, even across block boundaries and across other store instructions. Fixes #8200	8 months ago
Alex Crichton	355990b48b	Exit through Cranelift-generated trampolines for builtins (#8152 ) * Exit through Cranelift-generated trampolines for builtins This commit changes how builtin functions in Wasmtime (think `memory.grow`) are implemented. These functions are required to exit through some manner of trampoline to handle runtime requirements for backtracing right now. Currently this is done via inline assembly for each architecture (or external assembly for s390x). This is a bit unfortunate as it's a lot of hand-coding and making sure everything is right, and it's not easy to update as it's multiple platforms to update. The change in this commit is to instead use Cranelift-generated trampolines for this purpose instead. The path for invoking a builtin function now looks like: * Wasm code calls a statically known symbol for each builtin. * The statically known symbol will perform exit trampoline duties (e.g. pc/fp/etc) and then load a function pointer to the host implementation. * The host implementation is invoked and then proceeds as usual. The main new piece for this PR is that all wasm modules and functions are compiled in parallel but an output of this compilation phase is what builtin functions are required. All builtin functions are then unioned together into one set and then anything required is generated just afterwards. That means that only one builtin-trampoline per-module is generated per-builtin. This work is inspired by #8135 and my own personal desire to have as much about our ABI details flowing through Cranelift as we can. This in theory makes it more flexible to deal with future improvements to our ABI. prtest:full * Fix some build issues * Update winch test expectations * Update Winch to use new builtin shims. This commit refactors the Winch compiler to use the new trampolines for all Wasmtime builtins created in the previous commits. This required a fair bit of refactoring to handle plumbing through a new kind of relocation and function call. Winch's `FuncEnv` now contains a `PrimaryMap` from `UserExternalNameRef` to `UserExternalName`. This is because there's now more than one kind of name than just wasm function relocations, so the raw index space of `UserExternalNameRef` is no longer applicable. This required threading `FuncEnv` to more locations along with some refactorings to ensure that lifetimes work out ok. The `CompiledFunction` no longer stores a trait object of how to map name refs to names and now directly has a `Primarymap`. This also means that Winch's return value from its `TargetIsa` is a `CompiledFunction` as opposed to the previous just-a-`MachBuffer` so it can also package up all the relocation information. This ends up having `winch-codegen` depend on `wasmtime-cranelift-shared` as a new dependency. * Review feedback	8 months ago
Nick Fitzgerald	6c5184809d	Cranelift: resolve value aliases when printing CLIF functions (#8214 ) This makes the CLIF much easier to follow.	8 months ago
Chris Fallin	a79cf76fe0	PCC: x64: insertlane instructions read only scalar-sized values. (#8207 ) * PCC: x64: insertlane instructions read only scalar-sized values. Also fix `clamp_range` on greater-than-64-bit values: no range fact is possible in this case (propagate `Option` a bit deeper to represent this). Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=67538. * Rebase to latest main with leaf-function changes and update test expectations.	8 months ago
Alex Crichton	ead7f735b4	Compile out wmemcheck-related libcalls when not enabled (#8203 ) Currently even when the `wmemcheck` Cargo feature is disabled the various related libcalls are still compiled into `wasmtime-runtime`. Additionally their signatures are translated when lowering functions, although the signatures are never used. This commit adds `#[cfg]` annotations to compile these all out when they're not enabled. Applying this change, however, uncovered a subtle bug in our libcalls. Libcalls are numbered in-order as-listed in the macro ignoring `#[cfg]`, but they're assigned a runtime slot in a `VMBuiltinFunctionsArray` structure which does respect `#[cfg]`. This meant, for example, that if `gc` was enabled and `wmemcheck` was disabled, as is the default for our tests, then there was a hole in the numbering where libcall numbers were mismatched at runtime and compile time. To fix this I've first added a const assertion that the runtime-number of libcalls equals the build-time number of libcalls. I then updated the macro a bit to plumb the `#[cfg]` differently and now libcalls are unconditionally defined regardless of cfgs but the implementation is `std::process::abort()` if it's compiled out. This ended up having a large-ish impact on the `disas` test suite. Lots of functions have fewer signatures translation because wmemcheck, even when disabled, was translating a few signatures. This also had some assembly changes, too, because I believe functions are considered leaves based on whether they declare a signature or not, so declaring an unused signature was preventing all wasm functions from being considered leaves.	8 months ago
Jamey Sharp	c6d923ae3f	Add more disassembly tests for call_indirect (#8202 ) I found these tests useful in thinking about #8195, #8197, and #8200, so I thought we should track our codegen for these specific cases over time.	8 months ago
Jamey Sharp	f59b324602	cranelift: Optimize select_spectre_guard, carefully (#8139 ) * cranelift: Optimize select_spectre_guard, carefully This commit makes two changes to our treatment of `select_spectre_guard`. First, stop annotating this instruction as having any side effects. We only care that if its value result is used, then it's computed without branching on the condition input. We don't otherwise care when the value is computed, or if it's computed at all. Second, introduce some carefully selected ISLE egraph rewrites for this instruction. These particular rewrites are those where we can statically determine which SSA value will be the result of the instruction. Since there is no actual choice involved, there's no way to accidentally introduce speculation on the condition input. * Add filetests	8 months ago
Chris Fallin	46814f22b3	PCC: support imported memories as well. (#8176 ) * PCC: support imported memories as well. Exposed by a fuzzbug (https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=67429); rather than exclude from fuzzing, it seemed easier to just implement. We need to define a new memory type to describe the memory definition struct pointed to by vmctx, and set up points-to facts appropriately. * Review feedback: abstract out a bit of the pointer-and-memtype handling logic.	8 months ago
Chris Fallin	afaf1c73f6	PCC: x64: 32- and 64-bit XMM loads/stores are 32 and 64 bits, respectively. (#8177 ) These are definitely not 128 bits wide; let's make the PCC model aware of that! Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=67427.	8 months ago
Jamey Sharp	310e97ee82	cranelift-wasm: Attach table OOB traps to loads/stores (#8171 ) Currently, every access to a table element does a bounds-check with a conditional branch to a block that explicitly traps. Instead, when Spectre mitigations are enabled, let's change the address computation to return a null pointer for out-of-bounds accesses, and then allow the subsequent load or store to trap. This is less code in that case since we can reuse instructions we needed anyway. We return the MemFlags that the memory access should use, in addition to the address it should access. That way we don't record trap metadata on memory access instructions which can't actually trap due to being preceded by a `trapnz`-based bounds check, when Spectre mitigations are disabled. In addition, when the table has constant size and the element index is a constant and mid-end optimization is enabled, this allows the bounds-check to be constant folded away. Later, #8139 will let us optimize away the select_spectre_guard instruction in this case too. Once we also implement #8160, `tests/disas/typed-funcrefs.wat` should be almost as fast as native indirect function calls.	8 months ago
Alex Crichton	0dee5a7f4d	Skip type checks on tables that don't need it (#8172 ) * Enhance `typed-funcrefs.wast` test with more cases Have the same function with slightly different variations to compare codegen between the possible strategies. * Skip type checks on tables that don't need it This commit implements an optimization to skip type checks in `call_indirect` for tables that don't require it. With the function-references proposal it's possible to have tables of a single type of function as opposed to today's default `funcref` which is a heterogenous set of functions. In this situation it's possible that a `call_indirect`'s type tag matches the type tag of a `table`-of-typed-`funcref`-values, meaning that it's impossible for the type check to fail. The type check of a function pointer in `call_indirect` is refactored here to take the table's type into account. Various things are shuffled around to ensure that the right traps still show up in the right places but the important part is that, when possible, the type check is omitted entirely. * Update crates/cranelift/src/func_environ.rs Co-authored-by: Jamey Sharp <jamey@minilop.net> --------- Co-authored-by: Jamey Sharp <jamey@minilop.net>	8 months ago
Alex Crichton	fbbeaf758c	Don't explicitly check for null function pointers (#8159 ) This commit uses the support from #8162 to skip null function pointer checks when performing an indirect call. Instead of an explicit check the segfault from accessing the null function pointer is caught and annotated with the appropriate trap. Closes #5291	8 months ago
Chris Fallin	82c26c04d8	Wasm ICs / typed-funcrefs test: switch to nullable table. (#8161 ) This is based on discussion in #8158: as noted in #8160, if we use a nullable typed funcref table instead (and given that we know we'll initialize a particular slot before use on the application side, so we won't actually call a null ref), and if we have a null-ref default value, we should be able to avoid the lazy table-init mechanism entirely. (Ignore the part where this module doesn't actually have any update logic that would set non-null refs anywhere; it's a compile-test, not a runtest!) Once #8159 is merged and #8160 is implemented, we should see zero branches in this test.	8 months ago
Chris Fallin	7fa71a4f9e	Wasm tests: add typed-funcref test showing example of desirable optimizations. (#8158 ) * Wasm tests: add typed-funcref test showing example of desirable optimizations. In order to have fast IC (inline cache) chains in AOT-compiled dynamic language Wasms, it would be great if we could make the "call to a typed funcref at a constant table index" pattern fast. This use-case was discussed at the most recent Wasmtime biweekly and @jameysharp is working on some optimizations; the intent of this PR is to provide a concrete test-case whose blessed output we can see improve over time. In particular, the following opts are still desirable: - With the use of non-nullable typed funcrefs, there shouldn't be a null check (there currently is, as noted by a comment in the code due to lack of type information at the right spot). - With the use of a constant table size and a constant index to the `table.get`, we should be able to load from the table without a bounds-check or any Spectre masking. Other further optimizations for this pattern might be possible if we rearrange the table and function-reference data structures, and the lazy-initialization scheme thereof, but the above should be agnostic to that. * Add comments to clarify typed funcrefs usage.	8 months ago
Alex Crichton	27153d0795	Move all tests out of `cranelift-wasm` (#8147 ) * Move remaining `.wat` tests out of cranelift-wasm/wasmtests Move these up to Wasmtime's misc testsuite to get translated and instantiated by Wasmtime. Note that the max-function-index-in-name-section test was removed here as that's tested by the support added in #3509. Remove cranelift-wasm test for name section This is pretty thoroughly tested elsewhere in Wasmtime that we respect the name section, for example many of the trap tests assert that the name of the function comes from the text format. * Move reachability tests out of cranelift-wasm Instead add them to the disassembly test suite to ensure we don't generate dead code. Additionally this has a lot of coverage via fuzzing too. * Move more tests out of cranelift-wasm Move them into `tests/disas` so we can easily see the CLIF.	8 months ago
Jamey Sharp	b0f13dc350	Turn most cranelift-wasm wasmtests into disas tests (#8144 ) * Turn most cranelift-wasm wasmtests into disas tests These tests mostly verify compiler behavior for specific WebAssembly constructs or small combinations. Currently they're only tested to make sure that cranelift-wasm doesn't crash or report an error. But they'd be even more useful if we track what CLIF we actually produce from Wasmtime. So I'm moving them to the new disas test framework and adding basic CLIF test expectations for them. I've excluded the .wat files which were produced by Emscripten or Rust, because they're very large and their compiled CLIF is not very interesting to look at. * Add test expectations	8 months ago
Alex Crichton	8c8725e50c	Migrate all existing cranelift-filetest-based wasm tests to `tests/disas/.wat` (#8140 ) Support hex numbers in CLI arguments This enables `-O static-memory-maximum-size=0x10`, for example. Previously this was a parse error. * Discover `disas` tests in subdirectories * Deny unknown fields when parsing test configs Make sure we don't accidentally have other configuration. Useful when porting tests over. * Update `make-load-store-tests.sh` to new syntax Pass various flags and such. * Move all `.wat` tests from `cranelift/filetests` into `tests/disas` Migrate all load/store tests * Migrate all other `.wat` tests to `tests/disas` Use opt-level=0 with table instructions	8 months ago
Alex Crichton	b2c28babf3	Add a `disas` test suite for Wasmtime (#8129 ) * Add an `asm` test suite for Wasmtime This commit adds a suite of tests at `tests/asm/.wat` which is intended to replace the tests in `cranelift/filetests/filetests/wasm`. Tests are configured differently than before using Wasmtime CLI flags rather than a custom TOML-based configuration scheme. Otherwise though the same shape of tests is supported. This commit migrates a small handful of tests as a showcase and bulk migration is left for a follow-up. Organize the asm.rs test with methods/functions * Switch back to TOML for config parsing * Disable disassembly tests on miri Takes a bit too long in cranelift	8 months ago

50 Commits (69078680785df7330dd967d00f8d82cf590b5179)