Optimize empty provenance range checks.
Currently it gets the pointers in the range and checks if the result is empty, but it can be done faster if you combine those two steps.
r? `@oli-obk`
Fix UB in ThinVec::flat_map_in_place
`thin_vec.as_ptr()` goes through the `Deref` impl of `ThinVec`, which will not allow access to any memory as we did call `set_len(0)` first.
Found in the process of investigating https://github.com/rust-lang/rust/issues/135870.
Change interners to start preallocated with an increased capacity
Inspired by https://github.com/rust-lang/rust/issues/137005.
Added a `with_capacity` function to `InternedSet`. Changed the `CtxtInterners` to start with `InternedSets` preallocated with a capacity.
This *does* increase memory usage at very slightly(by ~1 MB at the start), altough that increase quickly disaperars for larger crates(since they require such capacity anyway).
A local perf run indicates this improves compiletimes for small crates(like `ripgrep`), without a negative effect on larger ones.
Allow `IndexSlice` to be indexed by ranges.
This comes with some annoyances as the index type can no longer inferred from indexing expressions. The biggest offender for this is `IndexVec::from_fn_n(|idx| ..., n)` where the index type won't be inferred from the call site or any index expressions inside the closure.
My main use case for this is mapping a `Place` to `Range<Idx>` for value tracking where the range represents all the values the place contains.
This adds panicking Hash impls for several resolver types that don't
actually satisfy this condition. It's not obvious to me that
rustc_resolve actually upholds the Interned guarantees but fixing that
seems pretty hard (the structures have at minimum some interior
mutability, so it's not really recursively hashable in place...).
tree-wide: parallel: Fully removed all `Lrc`, replaced with `Arc`
tree-wide: parallel: Fully removed all `Lrc`, replaced with `Arc`
This is continuation of https://github.com/rust-lang/rust/pull/132282 .
I'm pretty sure I did everything right. In particular, I searched all occurrences of `Lrc` in submodules and made sure that they don't need replacement.
There are other possibilities, through.
We can define `enum Lrc<T> { Rc(Rc<T>), Arc(Arc<T>) }`. Or we can make `Lrc` a union and on every clone we can read from special thread-local variable. Or we can add a generic parameter to `Lrc` and, yes, this parameter will be everywhere across all codebase.
So, if you think we should take some alternative approach, then don't merge this PR. But if it is decided to stick with `Arc`, then, please, merge.
cc "Parallel Rustc Front-end" ( https://github.com/rust-lang/rust/issues/113349 )
r? SparrowLii
`@rustbot` label WG-compiler-parallel
Upgrade elsa to the newest version.
This was locked to 1.7.1 because of an error in the elsa release process that has since been fixed. Upgrading has the advantage that the elsa code runs properly in miri, at least with tree borrows.
This was spawned from https://github.com/rust-lang/rust/issues/135870#issuecomment-2612470540
When the word "cache" appears in the context of the query system, it often
isn't obvious whether that is referring to the in-memory query cache or the
on-disk incremental cache.
For these types, we can assure the reader that they are for in-memory caching.
It's a function that prints numbers with underscores inserted for
readability (e.g. "1_234_567"), used by `-Zmeta-stats` and
`-Zinput-stats`. It's the only thing in `rustc_middle::util::common`,
which is a bizarre location for it.
This commit:
- moves it to `rustc_data_structures`, a more logical crate for it;
- puts it in a module `thousands`, like the similar crates.io crate;
- renames it `format_with_underscores`, which is a clearer name;
- rewrites it to be more concise;
- slightly improves the testing.
Flip the `rustc-rayon`/`indexmap` dependency order
[`rustc-rayon v0.5.1`](https://github.com/rust-lang/rustc-rayon/pull/14) added `indexmap` implementations that will allow `indexmap` to drop its own "internal-only" implementations.
(This is separate from `indexmap`'s implementation for normal `rayon`.)
[`rustc-rayon v0.5.1`](https://github.com/rust-lang/rustc-rayon/pull/14)
added `indexmap` implementations that will allow `indexmap` to drop its
own "internal-only" implementations.
(This is separate from `indexmap`'s implementation for normal `rayon`.)
This assumes that the set of valid node IDs is exactly `0..num_nodes`.
In practice, we have a lot of graph-algorithm code that already assumes that
nodes are densely numbered, by using `num_nodes` to allocate per-node indexed
data structures.
[cfg_match] Adjust syntax
A year has passed since the creation of #115585 and the feature, as expected, is not moving forward. Let's change that.
This PR proposes changing the arm's syntax from `cfg(SOME_CONDITION) => { ... }` to `SOME_CODITION => {}`.
```rust
match_cfg! {
unix => {
fn foo() { /* unix specific functionality */ }
}
target_pointer_width = "32" => {
fn foo() { /* non-unix, 32-bit functionality */ }
}
_ => {
fn foo() { /* fallback implementation */ }
}
}
```
Why? Because after several manual migrations in https://github.com/rust-lang/rust/pull/116342 it became clear, at least for me, that `cfg` prefixes are unnecessary, verbose and redundant.
Again, everything is just a proposal to move things forward. If the shown syntax isn't ideal, feel free to close this PR or suggest other alternatives.
Improve VecCache under parallel frontend
This replaces the single Vec allocation with a series of progressively larger buckets. With the cfg for parallel enabled but with -Zthreads=1, this looks like a slight regression in i-count and cycle counts (~1%).
With the parallel frontend at -Zthreads=4, this is an improvement (-5% wall-time from 5.788 to 5.4688 on libcore) than our current Lock-based approach, likely due to reducing the bouncing of the cache line holding the lock. At -Zthreads=32 it's a huge improvement (-46%: 8.829 -> 4.7319 seconds).
try-job: i686-gnu-nopt
try-job: dist-x86_64-linux
This replaces the single Vec allocation with a series of progressively
larger buckets. With the cfg for parallel enabled but with -Zthreads=1,
this looks like a slight regression in i-count and cycle counts (<0.1%).
With the parallel frontend at -Zthreads=4, this is an improvement (-5%
wall-time from 5.788 to 5.4688 on libcore) than our current Lock-based
approach, likely due to reducing the bouncing of the cache line holding
the lock. At -Zthreads=32 it's a huge improvement (-46%: 8.829 -> 4.7319
seconds).
Since it's inception a long time ago, the parallel compiler and its cfgs
have been a maintenance burden. This was a necessary evil the allow
iteration while not degrading performance because of synchronization
overhead.
But this time is over. Thanks to the amazing work by the parallel
working group (and the dyn sync crimes), the parallel compiler has now
been fast enough to be shipped by default in nightly for quite a while
now.
Stable and beta have still been on the serial compiler, because they
can't use `-Zthreads` anyways.
But this is quite suboptimal:
- the maintenance burden still sucks
- we're not testing the serial compiler in nightly
Because of these reasons, it's time to end it. The serial compiler has
served us well in the years since it was split from the parallel one,
but it's over now.
Let the knight slay one head of the two-headed dragon!
Dominator-order information is only needed for coverage graphs, and is easy
enough to collect by just traversing the graph again.
This avoids wasted work when computing graph dominators for any other purpose.
stabilize Strict Provenance and Exposed Provenance APIs
Given that [RFC 3559](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html) has been accepted, t-lang has approved the concept of provenance to exist in the language. So I think it's time that we stabilize the strict provenance and exposed provenance APIs, and discuss provenance explicitly in the docs:
```rust
// core::ptr
pub const fn without_provenance<T>(addr: usize) -> *const T;
pub const fn dangling<T>() -> *const T;
pub const fn without_provenance_mut<T>(addr: usize) -> *mut T;
pub const fn dangling_mut<T>() -> *mut T;
pub fn with_exposed_provenance<T>(addr: usize) -> *const T;
pub fn with_exposed_provenance_mut<T>(addr: usize) -> *mut T;
impl<T: ?Sized> *const T {
pub fn addr(self) -> usize;
pub fn expose_provenance(self) -> usize;
pub fn with_addr(self, addr: usize) -> Self;
pub fn map_addr(self, f: impl FnOnce(usize) -> usize) -> Self;
}
impl<T: ?Sized> *mut T {
pub fn addr(self) -> usize;
pub fn expose_provenance(self) -> usize;
pub fn with_addr(self, addr: usize) -> Self;
pub fn map_addr(self, f: impl FnOnce(usize) -> usize) -> Self;
}
impl<T: ?Sized> NonNull<T> {
pub fn addr(self) -> NonZero<usize>;
pub fn with_addr(self, addr: NonZero<usize>) -> Self;
pub fn map_addr(self, f: impl FnOnce(NonZero<usize>) -> NonZero<usize>) -> Self;
}
```
I also did a pass over the docs to adjust them, because this is no longer an "experiment". The `ptr` docs now discuss the concept of provenance in general, and then they go into the two families of APIs for dealing with provenance: Strict Provenance and Exposed Provenance. I removed the discussion of how pointers also have an associated "address space" -- that is not actually tracked in the pointer value, it is tracked in the type, so IMO it just distracts from the core point of provenance. I also adjusted the docs for `with_exposed_provenance` to make it clear that we cannot guarantee much about this function, it's all best-effort.
There are two unstable lints associated with the strict_provenance feature gate; I moved them to a new [strict_provenance_lints](https://github.com/rust-lang/rust/issues/130351) feature since I didn't want this PR to have an even bigger FCP. ;)
`@rust-lang/opsem` Would be great to get some feedback on the docs here. :)
Nominating for `@rust-lang/libs-api.`
Part of https://github.com/rust-lang/rust/issues/95228.
[FCP comment](https://github.com/rust-lang/rust/pull/130350#issuecomment-2395114536)
Use `ThinVec` for PredicateObligation storage
~~I noticed while profiling clippy on a project that a large amount of time is being spent allocating `Vec`s for `PredicateObligation`, and the `Vec`s are often quite small. This is an attempt to optimise this by using SmallVec to avoid heap allocations for these common small Vecs.~~
This PR turns all the `Vec<PredicateObligation>` into a single type alias while avoiding referring to `Vec` around it, then swaps the type over to `ThinVec<PredicateObligation>` and fixes the fallout. This also contains an implementation of `ThinVec::extract_if`, copied from `Vec::extract_if` and currently being upstreamed to https://github.com/Gankra/thin-vec/pull/66.
This leads to a small (0.2-0.7%) performance gain in the latest perf run.
Like #130865 did for the standard library, we can use `&raw` in the
compiler now that stage0 supports it. Also like the other issue, I did
not make any doc or test changes at this time.
Use more slice patterns inside the compiler
Nothing super noteworthy. Just replacing the common 'fragile' pattern of "length check followed by indexing or unwrap" with slice patterns for legibility and 'robustness'.
r? ghost
Optimize SipHash by reordering compress instructions
This PR optimizes hashing by changing the order of instructions in the sip.rs `compress` macro so the CPU can parallelize it better. The new order is taken directly from Fig 2.1 in [the SipHash paper](https://eprint.iacr.org/2012/351.pdf) (but with the xors moved which makes it a little faster). I attempted to optimize it some more after this, but I think this might be the optimal instruction order. Note that this shouldn't change the behavior of hashing at all, only statements that don't depend on each other were reordered.
It appears like the current order hasn't changed since its [original implementation from 2012](fada46c421 (diff-b751133c229259d7099bbbc7835324e5504b91ab1aded9464f0c48cd22e5e420R35)) which doesn't look like it was written with data dependencies in mind.
Running `./x bench library/core --stage 0 --test-args hash` before and after this change shows the following results:
Before:
```
benchmarks:
hash::sip::bench_bytes_4 7.20/iter +/- 0.70
hash::sip::bench_bytes_7 9.01/iter +/- 0.35
hash::sip::bench_bytes_8 8.12/iter +/- 0.10
hash::sip::bench_bytes_a_16 10.07/iter +/- 0.44
hash::sip::bench_bytes_b_32 13.46/iter +/- 0.71
hash::sip::bench_bytes_c_128 37.75/iter +/- 0.48
hash::sip::bench_long_str 121.18/iter +/- 3.01
hash::sip::bench_str_of_8_bytes 11.20/iter +/- 0.25
hash::sip::bench_str_over_8_bytes 11.20/iter +/- 0.26
hash::sip::bench_str_under_8_bytes 9.89/iter +/- 0.59
hash::sip::bench_u32 9.57/iter +/- 0.44
hash::sip::bench_u32_keyed 6.97/iter +/- 0.10
hash::sip::bench_u64 8.63/iter +/- 0.07
```
After:
```
benchmarks:
hash::sip::bench_bytes_4 6.64/iter +/- 0.14
hash::sip::bench_bytes_7 8.19/iter +/- 0.07
hash::sip::bench_bytes_8 8.59/iter +/- 0.68
hash::sip::bench_bytes_a_16 9.73/iter +/- 0.49
hash::sip::bench_bytes_b_32 12.70/iter +/- 0.06
hash::sip::bench_bytes_c_128 32.38/iter +/- 0.20
hash::sip::bench_long_str 102.99/iter +/- 0.82
hash::sip::bench_str_of_8_bytes 10.71/iter +/- 0.21
hash::sip::bench_str_over_8_bytes 11.73/iter +/- 0.17
hash::sip::bench_str_under_8_bytes 10.33/iter +/- 0.41
hash::sip::bench_u32 10.41/iter +/- 0.29
hash::sip::bench_u32_keyed 9.50/iter +/- 0.30
hash::sip::bench_u64 8.44/iter +/- 1.09
```
I ran this on my computer so there's some noise, but you can tell at least `bench_long_str` is significantly faster (~18%).
Also, I noticed the same compress function from the library is used in the compiler as well, so I took the liberty of copy-pasting this change to there as well.
Thanks `@semisol` for porting SipHash for another project which led me to notice this issue in Rust, and for helping investigate. <3