Improve VecCache under parallel frontend
This replaces the single Vec allocation with a series of progressively larger buckets. With the cfg for parallel enabled but with -Zthreads=1, this looks like a slight regression in i-count and cycle counts (~1%).
With the parallel frontend at -Zthreads=4, this is an improvement (-5% wall-time from 5.788 to 5.4688 on libcore) than our current Lock-based approach, likely due to reducing the bouncing of the cache line holding the lock. At -Zthreads=32 it's a huge improvement (-46%: 8.829 -> 4.7319 seconds).
try-job: i686-gnu-nopt
try-job: dist-x86_64-linux
This replaces the single Vec allocation with a series of progressively
larger buckets. With the cfg for parallel enabled but with -Zthreads=1,
this looks like a slight regression in i-count and cycle counts (<0.1%).
With the parallel frontend at -Zthreads=4, this is an improvement (-5%
wall-time from 5.788 to 5.4688 on libcore) than our current Lock-based
approach, likely due to reducing the bouncing of the cache line holding
the lock. At -Zthreads=32 it's a huge improvement (-46%: 8.829 -> 4.7319
seconds).
Since it's inception a long time ago, the parallel compiler and its cfgs
have been a maintenance burden. This was a necessary evil the allow
iteration while not degrading performance because of synchronization
overhead.
But this time is over. Thanks to the amazing work by the parallel
working group (and the dyn sync crimes), the parallel compiler has now
been fast enough to be shipped by default in nightly for quite a while
now.
Stable and beta have still been on the serial compiler, because they
can't use `-Zthreads` anyways.
But this is quite suboptimal:
- the maintenance burden still sucks
- we're not testing the serial compiler in nightly
Because of these reasons, it's time to end it. The serial compiler has
served us well in the years since it was split from the parallel one,
but it's over now.
Let the knight slay one head of the two-headed dragon!
Dominator-order information is only needed for coverage graphs, and is easy
enough to collect by just traversing the graph again.
This avoids wasted work when computing graph dominators for any other purpose.
stabilize Strict Provenance and Exposed Provenance APIs
Given that [RFC 3559](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html) has been accepted, t-lang has approved the concept of provenance to exist in the language. So I think it's time that we stabilize the strict provenance and exposed provenance APIs, and discuss provenance explicitly in the docs:
```rust
// core::ptr
pub const fn without_provenance<T>(addr: usize) -> *const T;
pub const fn dangling<T>() -> *const T;
pub const fn without_provenance_mut<T>(addr: usize) -> *mut T;
pub const fn dangling_mut<T>() -> *mut T;
pub fn with_exposed_provenance<T>(addr: usize) -> *const T;
pub fn with_exposed_provenance_mut<T>(addr: usize) -> *mut T;
impl<T: ?Sized> *const T {
pub fn addr(self) -> usize;
pub fn expose_provenance(self) -> usize;
pub fn with_addr(self, addr: usize) -> Self;
pub fn map_addr(self, f: impl FnOnce(usize) -> usize) -> Self;
}
impl<T: ?Sized> *mut T {
pub fn addr(self) -> usize;
pub fn expose_provenance(self) -> usize;
pub fn with_addr(self, addr: usize) -> Self;
pub fn map_addr(self, f: impl FnOnce(usize) -> usize) -> Self;
}
impl<T: ?Sized> NonNull<T> {
pub fn addr(self) -> NonZero<usize>;
pub fn with_addr(self, addr: NonZero<usize>) -> Self;
pub fn map_addr(self, f: impl FnOnce(NonZero<usize>) -> NonZero<usize>) -> Self;
}
```
I also did a pass over the docs to adjust them, because this is no longer an "experiment". The `ptr` docs now discuss the concept of provenance in general, and then they go into the two families of APIs for dealing with provenance: Strict Provenance and Exposed Provenance. I removed the discussion of how pointers also have an associated "address space" -- that is not actually tracked in the pointer value, it is tracked in the type, so IMO it just distracts from the core point of provenance. I also adjusted the docs for `with_exposed_provenance` to make it clear that we cannot guarantee much about this function, it's all best-effort.
There are two unstable lints associated with the strict_provenance feature gate; I moved them to a new [strict_provenance_lints](https://github.com/rust-lang/rust/issues/130351) feature since I didn't want this PR to have an even bigger FCP. ;)
`@rust-lang/opsem` Would be great to get some feedback on the docs here. :)
Nominating for `@rust-lang/libs-api.`
Part of https://github.com/rust-lang/rust/issues/95228.
[FCP comment](https://github.com/rust-lang/rust/pull/130350#issuecomment-2395114536)
Use `ThinVec` for PredicateObligation storage
~~I noticed while profiling clippy on a project that a large amount of time is being spent allocating `Vec`s for `PredicateObligation`, and the `Vec`s are often quite small. This is an attempt to optimise this by using SmallVec to avoid heap allocations for these common small Vecs.~~
This PR turns all the `Vec<PredicateObligation>` into a single type alias while avoiding referring to `Vec` around it, then swaps the type over to `ThinVec<PredicateObligation>` and fixes the fallout. This also contains an implementation of `ThinVec::extract_if`, copied from `Vec::extract_if` and currently being upstreamed to https://github.com/Gankra/thin-vec/pull/66.
This leads to a small (0.2-0.7%) performance gain in the latest perf run.
Like #130865 did for the standard library, we can use `&raw` in the
compiler now that stage0 supports it. Also like the other issue, I did
not make any doc or test changes at this time.
Use more slice patterns inside the compiler
Nothing super noteworthy. Just replacing the common 'fragile' pattern of "length check followed by indexing or unwrap" with slice patterns for legibility and 'robustness'.
r? ghost
Optimize SipHash by reordering compress instructions
This PR optimizes hashing by changing the order of instructions in the sip.rs `compress` macro so the CPU can parallelize it better. The new order is taken directly from Fig 2.1 in [the SipHash paper](https://eprint.iacr.org/2012/351.pdf) (but with the xors moved which makes it a little faster). I attempted to optimize it some more after this, but I think this might be the optimal instruction order. Note that this shouldn't change the behavior of hashing at all, only statements that don't depend on each other were reordered.
It appears like the current order hasn't changed since its [original implementation from 2012](fada46c421 (diff-b751133c229259d7099bbbc7835324e5504b91ab1aded9464f0c48cd22e5e420R35)) which doesn't look like it was written with data dependencies in mind.
Running `./x bench library/core --stage 0 --test-args hash` before and after this change shows the following results:
Before:
```
benchmarks:
hash::sip::bench_bytes_4 7.20/iter +/- 0.70
hash::sip::bench_bytes_7 9.01/iter +/- 0.35
hash::sip::bench_bytes_8 8.12/iter +/- 0.10
hash::sip::bench_bytes_a_16 10.07/iter +/- 0.44
hash::sip::bench_bytes_b_32 13.46/iter +/- 0.71
hash::sip::bench_bytes_c_128 37.75/iter +/- 0.48
hash::sip::bench_long_str 121.18/iter +/- 3.01
hash::sip::bench_str_of_8_bytes 11.20/iter +/- 0.25
hash::sip::bench_str_over_8_bytes 11.20/iter +/- 0.26
hash::sip::bench_str_under_8_bytes 9.89/iter +/- 0.59
hash::sip::bench_u32 9.57/iter +/- 0.44
hash::sip::bench_u32_keyed 6.97/iter +/- 0.10
hash::sip::bench_u64 8.63/iter +/- 0.07
```
After:
```
benchmarks:
hash::sip::bench_bytes_4 6.64/iter +/- 0.14
hash::sip::bench_bytes_7 8.19/iter +/- 0.07
hash::sip::bench_bytes_8 8.59/iter +/- 0.68
hash::sip::bench_bytes_a_16 9.73/iter +/- 0.49
hash::sip::bench_bytes_b_32 12.70/iter +/- 0.06
hash::sip::bench_bytes_c_128 32.38/iter +/- 0.20
hash::sip::bench_long_str 102.99/iter +/- 0.82
hash::sip::bench_str_of_8_bytes 10.71/iter +/- 0.21
hash::sip::bench_str_over_8_bytes 11.73/iter +/- 0.17
hash::sip::bench_str_under_8_bytes 10.33/iter +/- 0.41
hash::sip::bench_u32 10.41/iter +/- 0.29
hash::sip::bench_u32_keyed 9.50/iter +/- 0.30
hash::sip::bench_u64 8.44/iter +/- 1.09
```
I ran this on my computer so there's some noise, but you can tell at least `bench_long_str` is significantly faster (~18%).
Also, I noticed the same compress function from the library is used in the compiler as well, so I took the liberty of copy-pasting this change to there as well.
Thanks `@semisol` for porting SipHash for another project which led me to notice this issue in Rust, and for helping investigate. <3
Instead of keeping a list of architectures which have native support
for 64-bit atomics, just use #[cfg(target_has_atomic = "64")] and its
inverted counterpart to determine whether we need to use portable
AtomicU64 on the target architecture.
Fixes for 32-bit SPARC on Linux
This PR fixes a number of issues which previously prevented `rustc` from being built
successfully for 32-bit SPARC using the `sparc-unknown-linux-gnu` triplet.
In particular, it adds linking against `libatomic` where necessary, uses portable `AtomicU64`
for `rustc_data_structures` and rewrites the spec for `sparc_unknown_linux_gnu` to use
`TargetOptions` and replaces the previously used `-mv8plus` with the more portable
`-mcpu=v9 -m32`.
To make `rustc` build successfully, support for 32-bit SPARC needs to be added to the `object`
crate as well as the `nix` crate which I will be sending out later as well.
r? nagisa
This is possible now that inline const blocks are stable; the idea was
even mentioned as an alternative when `uninit_array()` was added:
<https://github.com/rust-lang/rust/pull/65580#issuecomment-544200681>
> if it’s stabilized soon enough maybe it’s not worth having a
> standard library method that will be replaceable with
> `let buffer = [MaybeUninit::<T>::uninit(); $N];`
Const array repetition and inline const blocks are now stable (in the
next release), so that circumstance has come to pass, and we no longer
have reason to want `uninit_array()` other than convenience. Therefore,
let’s evaluate the inconvenience by not using `uninit_array()` in
the standard library, before potentially deleting it entirely.
Added an associated `const THIS_IMPLEMENTATION_HAS_BEEN_TRIPLE_CHECKED`
to the `StableOrd` trait to ensure that implementors carefully consider
whether the trait's contract is upheld, as incorrect implementations can
cause miscompilations.