This ensures that memories, even with zero contents, still have the
necessary virtual mappings as required by the code generator to report
out-of-bounds reads/writes.
* Fix instruction size test for Rust 1.65.0 (#5188)
Looks like Rust generously shrank our `enum` in 1.65.0, so update the
test assertion to pass CI.
* Use an alternate doxygen download link (#5150)
* Use an alternate doxygen download link
Looks like doxygen.nl is down otherwise.
* Update link
* Fix push tag workflow (#5082)
This commit fixes the `push-tag.yml` workflow to work with the new
`Cargo.toml` manifest since workspace inheritance was added. This
additionally fixes some warnings coming up on CI about our usage of
deprecated features on github actions.
* Reduce warnings on CI from GitHub Actions (#5083)
* Upgrade our github actions to "node16"
Each github actions run has a lot of warnings about using node12 so this
upgrades our repository to using node16. I'm hoping no other changes are
needed and I suspect other actions we're using are on node12 and will
need further updates, but this should help pin down what's remaining.
* Update `actions/checkout` workflow to `v3`
* Update to `actions/cache@v3`
* Update to `actions/upload-artifact@v3`
* Drop usage of `actions-rs/toolchain`
* Update to `actions/setup-python@v4`
* Update mdbook version
* Add `package-lock.json` for `github-release` action (#5091)
A local github action we have has been broken for about a month now
meaning that the `dev` tag isn't getting updated or getting new
releases. This appears to be due to the publication of new versions of
these dependencies which are running into issues using one another. I
think I've figured out versions that work and have added a
`package-lock.json` to ensure we keep using the same versions.
* More fixes for publish action (#5110)
Looks like #5091 wasn't enough and some of the APIs needed updating with
changes made in the meantime. I've updated the action here and
additionally made a separate change where the release isn't continually
created and deleted but instead left alone and only the tag is updated.
This should work for the `dev` release and avoids deleting/recreating on
each PR, sending out notifications for new releases.
* Add missing `Win32_Foundation` feature
This is necessary for the `wasmtime-runtime` crate to compile on Windows.
* Add a note for the 2.0.1 release
* Remove rayon dependency of cranelift-isle (#5101)
Using rayon adds a lot of dependencies to Cranelift. The total
unparallelized time the code that uses rayon takes is less than half a
second and it runs at compile time, so there is pretty much no benefit
to parallelizing it.
* Add a note about rayon removal
Co-authored-by: Christopher Serr <christopher.serr@gmail.com>
Co-authored-by: bjorn3 <17426603+bjorn3@users.noreply.github.com>
* Fix StructReturn handling: properly mark the clobber, and offset actual rets. (#5023)
* Fix StructReturn handling: properly mark the clobber, and offset actual rets.
The legalization of `StructReturn` was causing issues in the new
call-handling code: the `StructReturn` ret was included in the `SigData` as
if it were an actual CLIF-level return value, but it is not.
Prior to using regalloc constraints for return values, we
unconditionally included rax (or the architecture's usual return
register) as a def, so it would be properly handled as "clobbered" by
the regalloc. With the new scheme, we include defs on the call only for
CLIF-level outputs. Callees with `StructReturn` args were thus not known
to clobber the return-value register, and values might be corrupted.
This PR updates the code to include a `StructReturn` ret as a clobber
rather than a returned value in the relevant spots. I observed it
causing saves/restores of rax in some CLIF that @bjorn3 provided me, but
I was having difficulty minimizing this into a test-case that I would be
comfortable including as a precise-output case (including the whole
thing verbatim would lock down a bunch of other irrelevant details and
cause test-update noise later). If we can find a more minimized example
I'm happy to include it as a filetest.
Fixes#5018.
* Disable wasi-nn CI tests due to breakage (404'ing package repository).
In #5023 we are seeing a failing CI job (see [1]); after four attempted
restarts, it 404's each time when trying to download OpenVino from the
Intel apt mirrors.
This PR temporarily removes the wasi-nn CI job from our CI configuration
so that we have green CI and can merge other work.
[1] https://github.com/bytecodealliance/wasmtime/actions/runs/3200861896/jobs/5228903240
Besides the standard traits (Copy, Clone, PartialEq and Eq), we also mark
the trait as non-exhaustive so that we can add errors in the future
without breaking API.
Signed-off-by: Nathaniel McCallum <nathaniel@profian.com>
Signed-off-by: Nathaniel McCallum <nathaniel@profian.com>
A minor update of a few other crates drops `semver` and `rustc_version`
from `Cargo.lock`. I've audited the deltas in versions for the other
crates here as well and they all look good.
* Elide redundant sentinel values
The `undef_variables` lists were a binding from Variable to Value, but
the Values were always equal to a suffix of the block's parameters. So
instead of storing another copy, we can just get them back from the
block parameters.
According to DHAT, this decreases total memory allocated and number of
bytes written, and increases number of bytes read and instructions
retired, but all by small fractions of a percent. According to
hyperfine, main is "1.00 ± 0.01 times faster".
* Use entity_impl for cranelift_frontend::Variable
Instead of hand-coding essentially the same thing.
* Keep undefined variables in a ListPool
According to DHAT, this improves every measure of performance
(instructions retired, total memory allocated, max heap size, bytes
read, and bytes written), although by fractions of a percent. According
to hyperfine the difference is nearly zero, but on Spidermonkey this
branch is "1.01 ± 0.00 times faster" than main.
* Elide redundant block IDs
In a list of predecessors, we previously kept both the jump instruction
that points to the current block, and the block where that instruction
resides. But we can look up the block from the instruction as long as we
have access to the current Layout, which we do everywhere that it was
necessary. So don't store the block, just store the instruction.
* Keep predecessor definitions in a ListPool
* Make append_jump_argument independent of self
This makes it easier to reason about borrow-checking issues.
* Reuse `results` instead of re-doing variable lookup
This eliminates three array lookups per predecessor by hanging on to the
results of earlier steps a little longer. This only works now because I
previously removed the need to borrow all of `self`, which otherwise
prevented keeping a borrow of self.results alive.
I had experimented with using `Vec::split_off` to copy the relevant
chunk of results to a temporary heap allocation, but the extra
allocation and copy was measurably slower. So it's important that this
is just a borrow.
* Cache single-predecessor block ID when sealing
Of the code in cranelift_frontend, `use_var` is the second-hottest path,
sitting close behind the `build` function that's used when inserting
every new instruction. This makes sense given that the operands of a new
instruction usually need to be looked up immediately before building the
instruction.
So making the single-predecessor loops in `find_var` and `use_var_local`
do fewer memory accesses and execute fewer instructions turns out to
have a measurable effect. It's still only a small fraction of a percent
overall since cranelift-frontend is only a few percent of total runtime.
This patch keeps a block ID in the SSABlockData, which is None unless
both the block is sealed and it has exactly one predecessor. Doing so
avoids two array lookups on each iteration of the two loops.
According to DHAT, compared with main, at this point this PR uses 0.3%
less memory at max heap, reads 0.6% fewer bytes, and writes 0.2% fewer
bytes.
According to Hyperfine, this PR is "1.01 ± 0.01 times faster" than main
when compiling Spidermonkey. On the other hand, Sightglass says main is
1.01x faster than this PR on the same benchmark by CPU cycles. In short,
actual effects are too small to measure reliably.
Minor thing I noticed from #4990 but I stylistically prefer to keep the
`mod foo;` definitions canonicalized to one location to emphasize how
multiple targets can use the same definition.
* Make wasmtime build for windows-aarch64
* Add check for win arm64 build.
* Fix checks for winarm64 key in workflows.
* Add target in windows arm64 build.
* Add tracking issue for Windows ARM64 trap handling
* cranelift-wasm: Assume block is reachable
In handling the WebAssembly "end" operator, cranelift-wasm had logic to
skip generating a jump instruction if the block was both unreachable and
"pristine", meaning no instructions had been added.
However, `translate_operator` checks first that `state.reachable` is
true, so this logic only runs when cranelift-wasm believes that the
current block _is_ reachable. Therefore the condition should always be
true, whether the block is pristine or not.
I've left a debug_assert in case `state.reachable` ever doesn't agree
with `builder.is_unreachable()`, but the assert doesn't fail in any of
the tests. We'll see if fuzzing finds something.
Anyway, outside of cranelift-frontend, this eliminates the only use of
`is_pristine()`, and there were no uses of `is_filled()`. So I've made
both of those private. They're now only used in a nearby debug assert.
* cranelift-frontend: Clarify pristine/filled states
There was a comment here saying "A filled block cannot be pristine."
Given that the intent was for those two states to be mutually exclusive,
I've replaced the two booleans with a three-state enum.
I also replaced all reads of these two flags with method calls. In all
but one case these are only checked in debug assertions, so I don't even
care whether they get inlined. They're easier to read, and this will
make it easier to replace their implementations, which I hope to do
soon.
Finally, I replaced all assignments to either flag with an appropriate
assignment of the corresponding enum state. Keep in mind this
correspondence between the new enum and the old flags:
- Empty: pristine true, filled false
- Partial: pristine false, filled false
- Filled: pristine false, filled true
Every existing update to these flags could only move to a later state.
(For example, Partial couldn't go back to Empty.) In the old flags that
meant that pristine could only go from true to false, and filled could
only go from false to true.
`fill_current_block` was a weird case because at first glance it looks
like it could allow both pristine and filled to be true at the same
time. However, it's only called from `FuncInstBuilder::build`, which
calls `ensure_inserted_block` before doing anything else, and _that_
cleared the pristine flag.
Similarly, `handle_ssa_side_effects` looks like it could allow both
pristine and filled to be true for anything in `split_blocks_created`.
However, those blocks are created by SSABuilder, so their BlockData is
not initialized by `create_block`, and instead uses BlockData::default.
The `Default` implementation here previously set both flags false, while
`create_block` would instead set pristine to true. So these split blocks
were correctly set to the Filled state, and after this patch they are
still set correctly.
* cranelift-frontend: Separate SSA and user block params
Previously there was a `user_param_count` field in BlockData, used
purely to debug-assert that no user parameters are added to a block
after `use_var` adds SSA parameters.
Instead, this patch enforces a strict phase separation between the
period after a block is created when user parameters can be added to it,
and the period when `use_var` may be called and instructions may be
added.
I'm assuming that calls to `use_var` are _always_ followed by inserting
one or more instructions into the block. (If you don't want to insert an
instruction, why do you need to know where instructions in this block
would get variable definitions from?) This patch has no visible effect
for callers which follow that rule.
However, it was previously legal to call `use_var`, then append a block
parameter before adding instructions, so long as `use_var` didn't
actually need to add a block parameter. That could only happen if the
current block is sealed and has exactly one predecessor. So anyone who
was counting on this behavior was playing a dangerous game anyway.
* cranelift-frontend: Defer initializing block data
Every reference to the func_ctx.status SecondaryMap will automatically
create the appropriate entries on-demand, with the sole exception of
`finalize`. In that function, debug assertions use SecondaryMap::keys to
find out which blocks need to be checked.
However, those assertions always succeed for blocks which never had any
instructions added. So it's okay to skip them for blocks which aren't
touched after `create_block`.
Resolve overlap in the RiscV64 backend by adding priorities to rules. Additionally, one test updated as a result of this work, as a peephole optimization for addition with immediates fires now.
* Cleanups to cranelift-frontend SSA construction
* Encode sealed/undef_variables relationship in type
A block can't have any undef_variables if it is sealed. It's useful to
make that fact explicit in the types so that any time either value is
used, it's clear that we should think about the other one too.
In addition, encoding this fact in an enum type lets Rust apply an
optimization that reduces the size of SSABlockData by 8 bytes, making it
fit in a 64-byte cache line. I haven't taken the extra step of making
SSABlockData be 64-byte aligned because 1) it doesn't seem to have a
performance impact and b) doing so makes other structures quite a bit
bigger.
* Simplify finish_predecessors_lookup
Using Vec::drain is more concise than a combination of
iter().rev().take() followed by Vec::truncate. And in this case it
doesn't matter what order we examine the results in, because we just
want to know if they're all equal, so we might as well iterate forward
instead of in reverse.
There's no need for the ZeroOneOrMore enum. Instead, there are only two
cases: either we have a single value to use for the variable (possibly
synthesized as a constant zero), or we need to add a block parameter in
every predecessor.
Pre-filtering the results iterator to eliminate the sentinel makes it
easy to identify how many distinct definitions this variable has.
iter.next() indicates if there are any definitions at all, and then
iter.all() is a clear way to express that we want to know if the
remaining definitions are the same as the first one.
* Simplify append_jump_argument
* Avoid assigning default() into SecondaryMap
This eliminates some redundant reads and writes.
* cranelift-frontend: Construct with default()
This eliminates a bunch of boilerplate in favor of a built in `derive`
macro.
Also I'm deleting an import that had the comment "FIXME: Remove in
edition2021", which we've been using everywhere since April.
* Fix tests
In the common case where there is a chain of sealed blocks that each
have exactly one predecessor, we can keep track of any sub-sequence of
those blocks in O(1) space. So there's no need to use the state machine
stack to propagate variable definitions back along the chain.
Instead, we can do one loop to find which block to stop at, then either
get the variable definition from that block or introduce a block
parameter there, and finally do one more loop to update variable
definitions in all the intervening blocks.
The existing implementation already had to do a graph traversal to
propagate variable definitions correctly, so this doesn't visit any more
blocks than before. However, this change also makes it possible to
integrate cycle detection with the graph traversal. That eliminates the
need for the in_predecessor_cycle flags, and any possibility of spiky
performance profiles in maintaining those flags.
As far as performance goes, this is all pretty much a wash: Changes to
CPU time and CPU cycles are within noise, according to hyperfine and
Sightglass/perf. But it's a substantially simpler implementation, with
fewer invisible interactions between functions.
Resolve overlap in the ISLE prelude and the x64 inst module by introducing new types that allow better sharing of extractor resuls, or falling back on priorities.
* Update spec test repo
Our submodule was accidentally reverted to an older commit as part
of #4271 and while it could be updated to as it was before I went ahead
and updated it to `main`.
* Update ignore directives and test multi-memory
* Update riscv ignores
At control-flow join points, cranelift-frontend's SSA builder currently
checks to see if only one definition of a variable reaches the current
block. If so, it can eliminate the corresponding block parameter and use
the original def directly. It implements this by turning the block
parameter into an alias for the original value.
However, it didn't resolve aliases during this check, except after it
had already determined that there was only one definition.
Resolving aliases first instead allows it to detect that more block
parameters are redundant. And as more block parameters get converted to
aliases, later blocks can see common definitions from further away, so
this has a compounding effect.
This also merges a special case, where there's exactly one unique
non-sentinel definition but it's actually an alias for the sentinel,
into the general case where all definitions are from the sentinel. As a
result there's only one case that has to introduce a definition of the
variable to zero.
According to `valgrind --tool=dhat`, this is a significant memory
savings. On the pulldown-cmark benchmark from Sightglass:
- 15.3% (1.9MiB) less memory allocated at maximum heap
- 4.1% (6.7MiB) less memory allocated in total
- 9.8% (57MiB) fewer bytes read
- 12.6% (36MiB) fewer bytes written
- 5.4% fewer instructions retired
- 1.04x faster by instructions retired (per Sightglass/perf)
- 1.03x to 1.04x faster by CPU cycles (per Sightglass/perf)
- 1.03 ± 0.01 times faster by CPU time (per hyperfine)
- 1.04x faster by cache accesses (per Sightglass/perf)
On the bz2 benchmark:
- 1.06x faster by instructions retired (per Sightglass/perf)
- 1.05x faster by CPU cycles (per Sightglass/perf)
- 1.04 ± 0.01 times faster by CPU time (per hyperfine)
- 1.02x to 1.03x faster by cache accesses (per Sightglass/perf)
Even on the largest benchmark in Sightglass (spidermonkey.wasm), this is
a measurable improvement:
- 1.03x faster by instructions retired (per Sightglass/perf)
- 1.02x faster by CPU cycles (per Sightglass/perf)
- 1.02 ± 0.00 times faster by CPU time (per hyperfine)
There was no significant difference in cache misses for any benchmark,
according to Sightglass/perf.
* Update wasm-tools dependencies
This update brings in a number of features such as:
* The component model binary format and AST has been slightly adjusted
in a few locations. Names are dropped from parameters/results now in
the internal representation since they were not used anyway. At this
time the ability to bind a multi-return function has not been exposed.
* The `wasmparser` validator pass will now share allocations with prior
functions, providing what's probably a very minor speedup for Wasmtime
itself.
* The text format for many component-related tests now requires named
parameters.
* Some new relaxed-simd instructions are updated to be ignored.
I hope to have a follow-up to expose the multi-return ability to the
embedding API of components.
* Update audit information for new crates
This commit updates the `MIN_STACK_SIZE` constant for Unix platforms
when allocating a sigaltstack from 16k to 64k. The signal handler
captures a wasm `Backtrace` which involves memory allocations and it was
recently discovered that, at least in debug mode, jemalloc can take up
to 16k of stack space for an allocation. To allow running the
sigaltstack size is increased here.
This historically was used to guard against recursive faults but
later refactorings have made this variable somewhat obsolete. The code
that it still protects is not the "meat" of trap handling. Instead the
`jmp_buf_if_trap` is changed to be more like "take" so once a "take"
succeeds it won't be able to recursively call any more "meat".
Overall this shouldn't affect anything, it's just a small internal
cleanup.
* Leverage Cargo's workspace inheritance feature
This commit is an attempt to reduce the complexity of the Cargo
manifests in this repository with Cargo's workspace-inheritance feature
becoming stable in Rust 1.64.0. This feature allows specifying fields in
the root workspace `Cargo.toml` which are then reused throughout the
workspace. For example this PR shares definitions such as:
* All of the Wasmtime-family of crates now use `version.workspace =
true` to have a single location which defines the version number.
* All crates use `edition.workspace = true` to have one default edition
for the entire workspace.
* Common dependencies are listed in `[workspace.dependencies]` to avoid
typing the same version number in a lot of different places (e.g. the
`wasmparser = "0.89.0"` is now in just one spot.
Currently the workspace-inheritance feature doesn't allow having two
different versions to inherit, so all of the Cranelift-family of crates
still manually specify their version. The inter-crate dependencies,
however, are shared amongst the root workspace.
This feature can be seen as a method of "preprocessing" of sorts for
Cargo manifests. This will help us develop Wasmtime but shouldn't have
any actual impact on the published artifacts -- everything's dependency
lists are still the same.
* Fix wasi-crypto tests
* Port branches to ISLE (AArch64)
Ported the existing implementations of the following opcodes for AArch64
to ISLE:
- `Brz`
- `Brnz`
- `Brif`
- `Brff`
- `BrIcmp`
- `Jump`
- `BrTable`
Copyright (c) 2022 Arm Limited
* Remove dead code
Copyright (c) 2022 Arm Limited
This fixes a compile-time error introduced in #4207. The `?` operator
doesn't work inside `Option::map` because it tries to return from the
inner closure, not the outer function.
Apparently our CI doesn't build wasmtime-bench-api so it didn't catch
this issue.
We weren't using the "union" cargo feature for the smallvec crate, which
reduces the size of a SmallVec by one machine word. This feature
requires Rust 1.49 but we already require much newer versions.
When using Wasmtime to compile pulldown-cmark from Sightglass, this
saves a decent amount of memory allocations and writes. According to
`valgrind --tool=dhat`:
- 6.2MiB (3.69%) less memory allocated over the program's lifetime
- 0.5MiB (4.13%) less memory allocated at maximum heap size
- 5.5MiB (1.88%) fewer bytes written to
- 0.44% fewer instructions executed
Sightglass reports a statistically significant runtime improvement too:
compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm
Δ = 24379323.60 ± 20051394.04 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.01x to 1.13x faster than main-be690a468.so!
[227506364 355007998.78 423280514] main-be690a468.so
[227686018 330628675.18 406025344] shrink-abiarg-0406da67c.so
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm
Δ = 360151622.56 ± 278294316.90 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.01x to 1.07x faster than main-be690a468.so!
[8709162212 8911001926.44 9535111576] main-be690a468.so
[5058015392 8550850303.88 9282148438] shrink-abiarg-0406da67c.so
compilation :: cycles :: benchmarks/bz2/benchmark.wasm
Δ = 6936570.28 ± 6897696.38 (confidence = 99%)
shrink-abiarg-0406da67c.so is 1.00x to 1.08x faster than main-be690a468.so!
[155810934 175260571.20 234737344] main-be690a468.so
[119128240 168324000.92 257451074] shrink-abiarg-0406da67c.so
`tracing` crate is already used within the codebase, this change allows
developers to benefit from that functionality when running and debugging
tests
Signed-off-by: Roman Volosatovs <rvolosatovs@riseup.net>
Signed-off-by: Roman Volosatovs <rvolosatovs@riseup.net>