This commit adds LICENSE files to all **published** crates which do
not have it already (most of the crates have it).
Providing the license files is a requiment of the Apache 2.0 License.
I had no idea this was still in the repository, much less building!
There are much different ways to use wasmtime in Rust nowadays, such as
the `wasmtime` crate!
Moves the slow path which resizes the vector out-of-line. The actual
indexing is also done in the out-of-line path which avoids the need for
a second bounds check in the fast path after a potential resize.
Apparently `powershell Compress-Archive` produces zip files with
backslashes in filesnames which makes them unable to be extracted with
some Unix variants of extraction. For example [this failure][build] and
using macOS's built-in unzip feature it creates filenames with
backslashes in them rather than subdirectories.
[build]: https://github.com/bytecodealliance/wasmtime-go/runs/2680596219?check_suite_focus=true
This commit attempts to slim down our CI (more from #2933) by removing
testing both in debug and release mode. I can't actually recall a
concrete issue that this has turned up on CI itself, and otherwise we're
spending quite a lot of time building all of the dev-dependencies in
release mode when testing.
Additionally it removes testing for nightly/beta channels of Rust. One
of the main benefits of this, staying on top of breakage, is already
moot because we pin to a nightly anyway. We have a few nightly
references elsewhere in CI (fuzzing/docs) so we can largely rely on that
(and upstream testing with rust-lang/rust). We in general shouldn't need
to do nightly/beta testing on all builds.
The release builders were actually the only location that MinGW and
AArch64 was tested however. This means that the old nightly/beta
builders are now replaced with AArch64 and MinGW builders. Overall, the
changes made to CI here are:
* Upgrade to QEMU 6.0.0. I thought this would make aarch64 emulation
faster, but it didn't. Seems good to stay up to date though.
* Replace nightly/beta testing in debug mode with MinGW and AArch64 testing.
* Use `-g0` for C compilation on MinGW because otherwise `gcc` as used
on CI generates an ICE (!!)
* Exclude `wasi-crypto` from testing. We already exclude
`wasmtime-wasi-crypto` and it was an accident we were testing the
`wasi-crypto` crate (which isn't even part of this workspace).
* Remove testing DWARF on the old backend step, which nowadays didn't
actually do that.
* Remove testing on release builders, making then purely tasked with
release builds, nothing else.
* Rename `QEMU_VERSION` to `QEMU_BUILD_VERSION` so qemu doesn't just
immediately exit after printing its version.
Timing wise the release builds are ~20-30 minutes faster, depending on
the platform. This is not really because of testing time but rather we
have a huge dependency tree when `dev-dependencies` are considered
(criterion, tokio, proptest, ...).
MinGW tests are pretty fast since we don't run examples (we're not too
interested in doing examples there, just windows/mac/linux coverage).
AArch64 tests are run with optimizations enabled because unoptimized
tests take ~45 minutes to finish while optimized tests take ~20 minutes.
The build is naturally much faster in debug mode but apparently under
QEMU emulation the debug mode binaries are *extremely* slow compared to
the release binaries, which means that extra time we spend compiling
release tests is more than made up by faster test emulation time.
Closes#2938
When AVX512VL or AVX512BITALG are available, Wasm SIMD's `popcnt`
instruction can be lowered to a single x64 instruction, `VPOPCNTB`,
instead of 8+ instructions.
This commit removes the publish step in GitHub actions, insteading
folding all functionality into the release build steps. This avoids
having a separately scheduled job after all the release build jobs which
ends up getting delayed for quite a long time given the current
scheduling algorithm.
This involves refactoring the tarball assembly scripts and refactoring the
github asset upload script too. Tarball assembly now manages everything
internally and does platform-specific bits where necessary. The upload
script is restructured to be run in parallel (in theory) and hopefully
catches various errors and tries to not stomp over everyone else's work.
The main trickiness here is handling `dev`, which is less critical for
correctness than than tags themselves.
As a small tweak build-wise the QEMU build for cross-compiled builders
is now cached unlike before where it was unconditionally built, shaving
a minute or two off build time.
We used to allow at most one instantiation per compilation, but there is no
fundamental reason why that should be the case. Allowing multiple instantiations
per compilation allows us to, for example, benchmark repeated instantiation
within Wasmtime's pooling allocator.
This additionally switches to using host functions for WASI and for
`bench_{start,end}` rather than defining them on the linker, this way we can use
a new store for every instantiation and don't need to keep other instances alive
when instantiating new instances.
Finally, we switch all timing to be done through callback functions, rather than
having the bench API caller implicitly start/end timers around bench API
calls. This allows us to more precisely measure phases and exclude things like
file I/O performed when creating a WASI context.
Also move the gh-pages pushing step from the `publish` phase to just
this singular doc builder.
The motivation for this is to eventually remove the `publish` step since
it interacts badly with GitHub's scheduling of actions. This is
hopefully the first step towards that by removing the doc publish part
of the phase.
Instead of inheriting stdio, pass in explicit file paths that are opened for
reading (stdin) or writing (stderr/stdout). This will allow sightglass to assert
that benchmarks produce the expected output.
* First remove `fail-fast: false` annotations to fail faster. If desired
this could always be added in a on-off fashion to PRs.
* Next use the new `concurrency` feature to try to cancel previous
builds, ideally meaning that if a branch is pushed to multiple times
it only runs CI once.
This is sometimes useful when performing analyses on the generated
machine code: for example, some kinds of code verifiers will want to do
a control-flow analysis, and it is much easier to do this if one does
not have to recover the CFG from the machine code (doing so requires
heavyweight analysis when indirect branches are involved). If one trusts
the control-flow lowering and only needs to verify other properties of
the code, this can be very useful.
Looks like GitHub Actions takes 10m+ to upload the documentation and
nearly 10 minutes to download it. I suspect this has to do with the
creation of thousands of files, and using `tar` here is likely much
faster. Let's test it out!
* wasmtime-wasi: re-exporting this WasiCtxBuilder was shadowing the right one
wasi-common's WasiCtxBuilder is really only useful wasi_cap_std_sync and
wasi_tokio to implement their own Builder on top of.
This re-export of wasi-common's is 1. not useful and 2. shadow's the
re-export of the right one in sync::*.
* wasi-common: eliminate WasiCtxBuilder, make the builder methods on WasiCtx instead
* delete wasi-common::WasiCtxBuilder altogether
just put those methods directly on &mut WasiCtx.
As a bonus, the sync and tokio WasiCtxBuilder::build functions
are no longer fallible!
* bench fixes
* more test fixes
Fixes a few issues that have been cropping up:
* Update `rustup` on Windows to latest to skip over the 1.24.1 installed
on GitHub Actions which can fail to install.
* Remove the no-longer-needed `define-llvm-env` action
* Install generic llvm/lldb packges instead of specific ones that may
migrate in versions over time.
When AVX512VL and AVX512F are available, use a single instruction
(`VCVTUDQ2PS`) instead of a length 9-instruction sequence. This
optimization is a port from the legacy x86 backend.
Previously, the x64 backend's ABI code would generate a sign-extending
load when loading a less-than-64-bit integer from a spillslot. This is
incorrect: e.g., for i32s > 0x80000000, this would result in all high
bits set.
This interacts poorly with another optimization. Normally, the invariant
is that the high bits of a register holding a value of a certain type,
beyond that type's bits, are undefined. However, as an optimization, we
recognize and use the fact that on x86-64, 32-bit instructions zero the
upper 32 bits. This allows us to elide a 32-to-64-bit zero-extend op
(turning it into just a move, which can then sometimes disappear
entirely due to register coalescing).
If a spill and reload happen between the production of a 32-bit value
from an instruction known to zero the upper bits and its use, then we
will rely on zero upper bits that might actually be set by a
sign-extend. This will result in incorrect execution.
As a fix, we stick to a simple invariant: we always spill and reload a
full 64 bits when handling integer registers on x64. This ensures that
no bits are mangled.
This change implements `vselect` using SSE4.1's `BLENDVPS`, `BLENDVPD`,
and `PBLENDVB`. `vselect` is a lane-selecting instruction that is used
by
[simple_preopt.rs](fa1faf5d22/cranelift/codegen/src/simple_preopt.rs (L947-L999))
to lower `bitselect` to a single x86 instruction when the condition mask
is known to be boolean (all 1s or 0s, e.g., from a conversion). This is
better than `bitselect` in general, which lowers to 4-5 instructions.
The old backend had the `vselect` lowering; this simply introduces it to
the new backend.
Previously the inclusion of the `criterion` crate had brought in a
transitive dependency to `cast`, which used old versions of several
libraries. Now that https://github.com/japaric/cast.rs/pull/26 is merged
and a new version published, we can update `cast` and remove the
cargo-deny rules for the duplicated, older versions.
Since `uadd_sat`, `sadd_sat`, `usub_sat`, and `ssub_sat` are now only
available to vector types, this removes the lowering code for the
scalar versions of these instructions in the arm32 and aarch64 backends.
This adds the machinery to encode the VPMULLQ instruction which is
available in AVX512VL and AVX512DQ. When these feature sets are
available, we use this instruction instead of a lengthy 12-instruction
sequence.
Since the lowering of `imul` complicated the other ALU operations it was
matched with and since future commits will alter the multiplication
lowering further, this change moves the `imul` lowering to its own match
block.