You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

218 lines
7.3 KiB

use anyhow::Result;
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
use once_cell::unsync::Lazy;
use std::path::Path;
Tweak parallelism and the instantiation benchmark (#3775) Currently the "sequential" and "parallel" benchmarks reports somewhat different timings. For sequential it's time-to-instantiate, but for parallel it's time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon's work-stealing. For example if rayon doesn't actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it's possible for some threads to finish much earlier as rayon isn't guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre<T>` is now used instead of a `Linker<T>` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there's N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
use std::process::Command;
use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering::SeqCst};
use std::sync::Arc;
use std::thread;
use wasmtime::*;
use wasmtime_wasi::{sync::WasiCtxBuilder, WasiCtx};
Tweak parallelism and the instantiation benchmark (#3775) Currently the "sequential" and "parallel" benchmarks reports somewhat different timings. For sequential it's time-to-instantiate, but for parallel it's time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon's work-stealing. For example if rayon doesn't actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it's possible for some threads to finish much earlier as rayon isn't guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre<T>` is now used instead of a `Linker<T>` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there's N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
fn store(engine: &Engine) -> Store<WasiCtx> {
let wasi = WasiCtxBuilder::new().build();
Tweak parallelism and the instantiation benchmark (#3775) Currently the &#34;sequential&#34; and &#34;parallel&#34; benchmarks reports somewhat different timings. For sequential it&#39;s time-to-instantiate, but for parallel it&#39;s time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon&#39;s work-stealing. For example if rayon doesn&#39;t actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it&#39;s possible for some threads to finish much earlier as rayon isn&#39;t guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre&lt;T&gt;` is now used instead of a `Linker&lt;T&gt;` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there&#39;s N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
Store::new(engine, wasi)
}
fn instantiate(pre: &InstancePre<WasiCtx>, engine: &Engine) -> Result<()> {
let mut store = store(engine);
let _instance = pre.instantiate(&mut store)?;
Ok(())
}
fn benchmark_name<'a>(strategy: &InstanceAllocationStrategy) -> &'static str {
match strategy {
InstanceAllocationStrategy::OnDemand => "default",
InstanceAllocationStrategy::Pooling { .. } => "pooling",
}
}
Tweak parallelism and the instantiation benchmark (#3775) Currently the &#34;sequential&#34; and &#34;parallel&#34; benchmarks reports somewhat different timings. For sequential it&#39;s time-to-instantiate, but for parallel it&#39;s time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon&#39;s work-stealing. For example if rayon doesn&#39;t actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it&#39;s possible for some threads to finish much earlier as rayon isn&#39;t guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre&lt;T&gt;` is now used instead of a `Linker&lt;T&gt;` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there&#39;s N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
fn bench_sequential(c: &mut Criterion, path: &Path) {
let mut group = c.benchmark_group("sequential");
Tweak parallelism and the instantiation benchmark (#3775) Currently the &#34;sequential&#34; and &#34;parallel&#34; benchmarks reports somewhat different timings. For sequential it&#39;s time-to-instantiate, but for parallel it&#39;s time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon&#39;s work-stealing. For example if rayon doesn&#39;t actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it&#39;s possible for some threads to finish much earlier as rayon isn&#39;t guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre&lt;T&gt;` is now used instead of a `Linker&lt;T&gt;` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there&#39;s N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
for strategy in strategies() {
let id = BenchmarkId::new(
benchmark_name(&strategy),
path.file_name().unwrap().to_str().unwrap(),
Tweak parallelism and the instantiation benchmark (#3775) Currently the &#34;sequential&#34; and &#34;parallel&#34; benchmarks reports somewhat different timings. For sequential it&#39;s time-to-instantiate, but for parallel it&#39;s time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon&#39;s work-stealing. For example if rayon doesn&#39;t actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it&#39;s possible for some threads to finish much earlier as rayon isn&#39;t guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre&lt;T&gt;` is now used instead of a `Linker&lt;T&gt;` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there&#39;s N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
);
let state = Lazy::new(|| {
let mut config = Config::default();
config.allocation_strategy(strategy.clone());
let engine = Engine::new(&config).expect("failed to create engine");
let module = Module::from_file(&engine, path).unwrap_or_else(|e| {
panic!("failed to load benchmark `{}`: {:?}", path.display(), e)
});
let mut linker = Linker::new(&engine);
wasmtime_wasi::add_to_linker(&mut linker, |cx| cx).unwrap();
let pre = linker
.instantiate_pre(&mut store(&engine), &module)
.expect("failed to pre-instantiate");
(engine, pre)
});
group.bench_function(id, |b| {
let (engine, pre) = &*state;
b.iter(|| {
instantiate(&pre, &engine).expect("failed to instantiate module");
});
});
}
group.finish();
}
Tweak parallelism and the instantiation benchmark (#3775) Currently the &#34;sequential&#34; and &#34;parallel&#34; benchmarks reports somewhat different timings. For sequential it&#39;s time-to-instantiate, but for parallel it&#39;s time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon&#39;s work-stealing. For example if rayon doesn&#39;t actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it&#39;s possible for some threads to finish much earlier as rayon isn&#39;t guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre&lt;T&gt;` is now used instead of a `Linker&lt;T&gt;` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there&#39;s N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
fn bench_parallel(c: &mut Criterion, path: &Path) {
let mut group = c.benchmark_group("parallel");
Tweak parallelism and the instantiation benchmark (#3775) Currently the &#34;sequential&#34; and &#34;parallel&#34; benchmarks reports somewhat different timings. For sequential it&#39;s time-to-instantiate, but for parallel it&#39;s time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon&#39;s work-stealing. For example if rayon doesn&#39;t actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it&#39;s possible for some threads to finish much earlier as rayon isn&#39;t guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre&lt;T&gt;` is now used instead of a `Linker&lt;T&gt;` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there&#39;s N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
for strategy in strategies() {
let state = Lazy::new(|| {
let mut config = Config::default();
config.allocation_strategy(strategy.clone());
let engine = Engine::new(&config).expect("failed to create engine");
let module =
Module::from_file(&engine, path).expect("failed to load WASI example module");
let mut linker = Linker::new(&engine);
wasmtime_wasi::add_to_linker(&mut linker, |cx| cx).unwrap();
let pre = Arc::new(
linker
.instantiate_pre(&mut store(&engine), &module)
.expect("failed to pre-instantiate"),
);
(engine, pre)
});
for threads in 1..=num_cpus::get_physical().min(16) {
let name = format!(
"{}: with {} thread{}",
path.file_name().unwrap().to_str().unwrap(),
threads,
if threads == 1 { "" } else { "s" }
);
let id = BenchmarkId::new(benchmark_name(&strategy), name);
group.bench_function(id, |b| {
let (engine, pre) = &*state;
// Spin up N-1 threads doing background instantiations to
// simulate concurrent instantiations.
let done = Arc::new(AtomicBool::new(false));
let count = Arc::new(AtomicUsize::new(0));
let workers = (0..threads - 1)
.map(|_| {
let pre = pre.clone();
let done = done.clone();
let engine = engine.clone();
let count = count.clone();
thread::spawn(move || {
count.fetch_add(1, SeqCst);
while !done.load(SeqCst) {
instantiate(&pre, &engine).unwrap();
}
})
})
.collect::<Vec<_>>();
// Wait for our workers to all get started and have
// instantiated their first module, at which point they'll
// all be spinning.
while count.load(SeqCst) != threads - 1 {
thread::yield_now();
}
// Now that our background work is configured we can
// benchmark the amount of time it takes to instantiate this
// module.
b.iter(|| {
instantiate(&pre, &engine).expect("failed to instantiate module");
});
// Shut down this benchmark iteration by signalling to
// worker threads they should exit and then wait for them to
// have reached the exit point.
done.store(true, SeqCst);
for t in workers {
t.join().unwrap();
}
});
}
}
group.finish();
}
fn bench_deserialize_module(c: &mut Criterion, path: &Path) {
let mut group = c.benchmark_group("deserialize");
let name = path.file_name().unwrap().to_str().unwrap();
let tmpfile = tempfile::NamedTempFile::new().unwrap();
let state = Lazy::new(|| {
let engine = Engine::default();
let module = Module::from_file(&engine, path).expect("failed to load WASI example module");
std::fs::write(tmpfile.path(), module.serialize().unwrap()).unwrap();
(engine, tmpfile.path())
});
group.bench_function(BenchmarkId::new("deserialize", name), |b| {
let (engine, path) = &*state;
b.iter(|| unsafe {
Module::deserialize_file(&engine, path).unwrap();
});
});
group.finish();
}
fn build_wasi_example() {
println!("Building WASI example module...");
if !Command::new("cargo")
.args(&[
"build",
"--release",
"-p",
"example-wasi-wasm",
"--target",
"wasm32-wasi",
])
.spawn()
.expect("failed to run cargo to build WASI example")
.wait()
.expect("failed to wait for cargo to build")
.success()
{
panic!("failed to build WASI example for target `wasm32-wasi`");
}
std::fs::copy(
"target/wasm32-wasi/release/wasi.wasm",
"benches/instantiation/wasi.wasm",
)
.expect("failed to copy WASI example module");
}
fn bench_instantiation(c: &mut Criterion) {
build_wasi_example();
for file in std::fs::read_dir("benches/instantiation").unwrap() {
let path = file.unwrap().path();
Tweak parallelism and the instantiation benchmark (#3775) Currently the &#34;sequential&#34; and &#34;parallel&#34; benchmarks reports somewhat different timings. For sequential it&#39;s time-to-instantiate, but for parallel it&#39;s time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon&#39;s work-stealing. For example if rayon doesn&#39;t actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it&#39;s possible for some threads to finish much earlier as rayon isn&#39;t guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre&lt;T&gt;` is now used instead of a `Linker&lt;T&gt;` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there&#39;s N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
bench_sequential(c, &path);
bench_parallel(c, &path);
bench_deserialize_module(c, &path);
Tweak parallelism and the instantiation benchmark (#3775) Currently the &#34;sequential&#34; and &#34;parallel&#34; benchmarks reports somewhat different timings. For sequential it&#39;s time-to-instantiate, but for parallel it&#39;s time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon&#39;s work-stealing. For example if rayon doesn&#39;t actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it&#39;s possible for some threads to finish much earlier as rayon isn&#39;t guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre&lt;T&gt;` is now used instead of a `Linker&lt;T&gt;` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there&#39;s N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
}
}
fn strategies() -> impl Iterator<Item = InstanceAllocationStrategy> {
[
Tweak parallelism and the instantiation benchmark (#3775) Currently the &#34;sequential&#34; and &#34;parallel&#34; benchmarks reports somewhat different timings. For sequential it&#39;s time-to-instantiate, but for parallel it&#39;s time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon&#39;s work-stealing. For example if rayon doesn&#39;t actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it&#39;s possible for some threads to finish much earlier as rayon isn&#39;t guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre&lt;T&gt;` is now used instead of a `Linker&lt;T&gt;` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there&#39;s N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
InstanceAllocationStrategy::OnDemand,
InstanceAllocationStrategy::Pooling {
strategy: Default::default(),
Remove the `ModuleLimits` pooling configuration structure (#3837) * Remove the `ModuleLimits` pooling configuration structure This commit is an attempt to improve the usability of the pooling allocator by removing the need to configure a `ModuleLimits` structure. Internally this structure has limits on all forms of wasm constructs but this largely bottoms out in the size of an allocation for an instance in the instance pooling allocator. Maintaining this list of limits can be cumbersome as modules may get tweaked over time and there&#39;s otherwise no real reason to limit the number of globals in a module since the main goal is to limit the memory consumption of a `VMContext` which can be done with a memory allocation limit rather than fine-tuned control over each maximum and minimum. The new approach taken in this commit is to remove `ModuleLimits`. Some fields, such as `tables`, `table_elements` , `memories`, and `memory_pages` are moved to `InstanceLimits` since they&#39;re still enforced at runtime. A new field `size` is added to `InstanceLimits` which indicates, in bytes, the maximum size of the `VMContext` allocation. If the size of a `VMContext` for a module exceeds this value then instantiation will fail. This involved adding a few more checks to `{Table, Memory}::new_static` to ensure that the minimum size is able to fit in the allocation, since previously modules were validated at compile time of the module that everything fit and that validation no longer happens (it happens at runtime). A consequence of this commit is that Wasmtime will have no built-in way to reject modules at compile time if they&#39;ll fail to be instantiated within a particular pooling allocator configuration. Instead a module must attempt instantiation see if a failure happens. * Fix benchmark compiles * Fix some doc links * Fix a panic by ensuring modules have limited tables/memories * Review comments * Add back validation at `Module` time instantiation is possible This allows for getting an early signal at compile time that a module will never be instantiable in an engine with matching settings. * Provide a better error message when sizes are exceeded Improve the error message when an instance size exceeds the maximum by providing a breakdown of where the bytes are all going and why the large size is being requested. * Try to fix test in qemu * Flag new test as 64-bit only Sizes are all specific to 64-bit right now
3 years ago
instance_limits: InstanceLimits {
memory_pages: 10_000,
..Default::default()
Tweak parallelism and the instantiation benchmark (#3775) Currently the &#34;sequential&#34; and &#34;parallel&#34; benchmarks reports somewhat different timings. For sequential it&#39;s time-to-instantiate, but for parallel it&#39;s time-to-instantiate-10k instances. The parallelism in the parallel benchmark can also theoretically be affected by rayon&#39;s work-stealing. For example if rayon doesn&#39;t actually do any work stealing at all then this ends up being a sequential test again. Otherwise though it&#39;s possible for some threads to finish much earlier as rayon isn&#39;t guaranteed to keep threads busy. This commit applies a few updates to the benchmark: * First an `InstancePre&lt;T&gt;` is now used instead of a `Linker&lt;T&gt;` to front-load type-checking and avoid that on each instantiation (and this is generally the fastest path to instantiate right now). * Next the instantiation benchmark is changed to measure one instantiation-per-iteration to measure per-instance instantiation to better compare with sequential numbers. * Finally rayon is removed in favor of manually creating background threads that infinitely do work until we tell them to stop. These background threads are guaranteed to be working for the entire time the benchmark is executing and should theoretically exhibit what the situation that there&#39;s N units of work all happening at once. I also applied some minor updates here such as having the parallel instantiation defined conditionally for multiple modules as well as upping the limits of the pooling allocator to handle a large module (rustpython.wasm) that I threw at it.
3 years ago
},
},
]
.into_iter()
}
criterion_group!(benches, bench_instantiation);
criterion_main!(benches);