Stabilize bench_black_box
This PR stabilize `feature(bench_black_box)`.
```rust
pub fn black_box<T>(dummy: T) -> T;
```
The FCP was completed in https://github.com/rust-lang/rust/issues/64102.
`@rustbot` label +T-libs-api -T-libs
On later stages, the feature is already stable.
Result of running:
rg -l "feature.let_else" compiler/ src/librustdoc/ library/ | xargs sed -s -i "s#\\[feature.let_else#\\[cfg_attr\\(bootstrap, feature\\(let_else\\)#"
Fix cloning from a BitSet with a different domain size
The previous implementation incorrectly assumed that the
number of words in a bit set is equal to the domain size.
The new implementation delegates to `Vec::clone_from` which
is specialized for `Copy` elements.
Fixes#99006.
The previous implementation incorrectly assumed that the
number of words in a bit set is equal to the domain size.
The new implementation delegates to `Vec::clone_from` which
is specialized for `Copy` elements.
Remove dereferencing of Box from codegen
Through #94043, #94414, #94873, and #95328, I've been fixing issues caused by Box being treated like a pointer when it is not a pointer. However, these PRs just introduced special cases for Box. This PR removes those special cases and instead transforms a deref of Box into a deref of the pointer it contains.
Hopefully, this is the end of the Box<T, A> ICEs.
`BitSet` related perf improvements
This commit makes two changes:
1. Changes `MaybeLiveLocals` to use `ChunkedBitSet`
2. Overrides the `fold` method for the iterator for `ChunkedBitSet`
I have local benchmarks verifying that each of these changes individually yield significant perf improvements to #96451 . I'm hoping this will be true outside of that context too. If that is not the case, I'll try to gate things on where they help as needed
r? `@nnethercote` who I believe was working on closely related things, cc `@tmiasko` because of the destprop pr
optimize `superset` method of `IntervalSet`
Given that intervals in the `IntervalSet` are sorted and strictly separated( it means the `end` of the previous interval will not be equal to the `start` of the next interval), we can reduce the complexity of the `superset` method from O(NMlogN) to O(2N) (N is the number of intervals and M is the length of each interval)
There are two impls of the `Encoder` trait: `opaque::Encoder` and
`opaque::FileEncoder`. The former encodes into memory and is infallible, the
latter writes to file and is fallible.
Currently, standard `Result`/`?`/`unwrap` error handling is used, but this is a
bit verbose and has non-trivial cost, which is annoying given how rare failures
are (especially in the infallible `opaque::Encoder` case).
This commit changes how `Encoder` fallibility is handled. All the `emit_*`
methods are now infallible. `opaque::Encoder` requires no great changes for
this. `opaque::FileEncoder` now implements a delayed error handling strategy.
If a failure occurs, it records this via the `res` field, and all subsequent
encoding operations are skipped if `res` indicates an error has occurred. Once
encoding is complete, the new `finish` method is called, which returns a
`Result`. In other words, there is now a single `Result`-producing method
instead of many of them.
This has very little effect on how any file errors are reported if
`opaque::FileEncoder` has any failures.
Much of this commit is boring mechanical changes, removing `Result` return
values and `?` or `unwrap` from expressions. The more interesting parts are as
follows.
- serialize.rs: The `Encoder` trait gains an `Ok` associated type. The
`into_inner` method is changed into `finish`, which returns
`Result<Vec<u8>, !>`.
- opaque.rs: The `FileEncoder` adopts the delayed error handling
strategy. Its `Ok` type is a `usize`, returning the number of bytes
written, replacing previous uses of `FileEncoder::position`.
- Various methods that take an encoder now consume it, rather than being
passed a mutable reference, e.g. `serialize_query_result_cache`.
Cache more queries on disk
One of the principles of incremental compilation is to allow saving results on disk to avoid recomputing them.
This PR investigates persisting a lot of queries whose result are to be saved into metadata.
Some of the queries are cheap reads from HIR, but we may also want to get rid of these reads for incremental lowering.
The `macro_rules!` implementation was becomng excessively complicated,
and difficult to modify. The new proc macro implementation should make
it much easier to add new features (e.g. skipping certain `#[derive]`s)
This reduces peak memory usage significantly for some programs with very
large functions, such as:
- `keccak`, `unicode_normalization`, and `match-stress-enum`, from
the `rustc-perf` benchmark suite;
- `http-0.2.6` from crates.io.
The new type is used in the analyses where the bitsets can get huge
(e.g. 10s of thousands of bits): `MaybeInitializedPlaces`,
`MaybeUninitializedPlaces`, and `EverInitializedPlaces`.
Some refactoring was required in `rustc_mir_dataflow`. All existing
analysis domains are either `BitSet` or a trivial wrapper around
`BitSet`, and access in a few places is done via `Borrow<BitSet>` or
`BorrowMut<BitSet>`. Now that some of these domains are `ClusterBitSet`,
that no longer works. So this commit replaces the `Borrow`/`BorrowMut`
usage with a new trait `BitSetExt` containing the needed bitset
operations. The impls just forward these to the underlying bitset type.
This required fiddling with trait bounds in a few places.
The commit also:
- Moves `static_assert_size` from `rustc_data_structures` to
`rustc_index` so it can be used in the latter; the former now
re-exports it so existing users are unaffected.
- Factors out some common "clear excess bits in the final word"
functionality in `bit_set.rs`.
- Uses `fill` in a few places instead of loops.
`Decoder` has two impls:
- opaque: this impl is already partly infallible, i.e. in some places it
currently panics on failure (e.g. if the input is too short, or on a
bad `Result` discriminant), and in some places it returns an error
(e.g. on a bad `Option` discriminant). The number of places where
either happens is surprisingly small, just because the binary
representation has very little redundancy and a lot of input reading
can occur even on malformed data.
- json: this impl is fully fallible, but it's only used (a) for the
`.rlink` file production, and there's a `FIXME` comment suggesting it
should change to a binary format, and (b) in a few tests in
non-fundamental ways. Indeed #85993 is open to remove it entirely.
And the top-level places in the compiler that call into decoding just
abort on error anyway. So the fallibility is providing little value, and
getting rid of it leads to some non-trivial performance improvements.
Much of this commit is pretty boring and mechanical. Some notes about
a few interesting parts:
- The commit removes `Decoder::{Error,error}`.
- `InternIteratorElement::intern_with`: the impl for `T` now has the same
optimization for small counts that the impl for `Result<T, E>` has,
because it's now much hotter.
- Decodable impls for SmallVec, LinkedList, VecDeque now all use
`collect`, which is nice; the one for `Vec` uses unsafe code, because
that gave better perf on some benchmarks.
This is a compact, fast storage for variable-sized sets, typically consisting of
larger ranges. It is less efficient than a bitset if ranges are both small and
the domain size is small, but will still perform acceptably. With enormous
domain sizes and large ranges, the interval set performs much better, as it can
be much more densely packed in memory than the uncompressed bit set alternative.
Optimize live point computation
This refactors the live-point computation to lower per-MIR-instruction costs by operating on a largely per-block level. This doesn't fundamentally change the number of operations necessary, but it greatly improves the practical performance by aggregating bit manipulation into ranges rather than single-bit; this scales much better with larger blocks.
On the benchmark provided in #90445, with 100,000 array elements, walltime for a check build is improved from 143 seconds to 15.
I consider the tiny losses here acceptable given the many small wins on real world benchmarks and large wins on stress tests. The new code scales much better, but on some subset of inputs the slightly higher constant overheads decrease performance somewhat. Overall though, this is expected to be a big win for pathological cases (as illustrated by the test case motivating this work) and largely not material for non-pathological cases. I consider the new code somewhat easier to follow, too.
This is just replicating the previous algorithm, but taking advantage of the
bitset structures to optimize into tighter and better optimized loops.
Particularly advantageous on enormous MIR blocks, which are relatively rare in
practice.
The PR had some unforseen perf regressions that are not as easy to find.
Revert the PR for now.
This reverts commit 6ae8912a3e, reversing
changes made to 86d6d2b738.
Fix inherent impl overlap check.
The current implementation of the overlap check was slightly buggy, and unified the wrong connected component in the `ids.len() <= 1` case. This became visible in another PR which changed the iteration order of items.
r? ``@matthewjasper`` since you reviewed the other PR.