Slightly optimize hash map stable hashing
I was profiling some of the `rustc-perf` benchmarks locally and noticed that quite some time is spent inside the stable hash of hashmaps. I tried to use a `SmallVec` instead of a `Vec` there, which helped very slightly.
Then I tried to remove the sorting, which was a bottleneck, and replaced it with insertion into a binary heap. Locally, it yielded nice improvements in instruction counts and RSS in several benchmarks for incremental builds. The implementation could probably be much nicer and possibly extended to other stable hashes, but first I wanted to test the perf impact properly.
Can I ask someone to do a perf run? Thank you!
This largely avoids remapping from and to the 'real' indices, with the exception
of predecessor lookup and the final merge back, and is conceptually better.
As the paper indicates, the unprocessed vertices in the DFS tree and processed
vertices are disjoint, and we can use them in the same space, tracking only the index
of the split.
This replaces the previous implementation with the simple variant of
Lengauer-Tarjan, which performs better in the general case. Performance on the
keccak benchmark is about equivalent between the two, but we don't see
regressions (and indeed see improvements) on other benchmarks, even on a
partially optimized implementation.
The implementation here follows that of the pseudocode in "Linear-Time
Algorithms for Dominators and Related Problems" thesis by Loukas Georgiadis. The
next few commits will optimize the implementation as suggested in the thesis.
Several related works are cited in the comments within the implementation, as
well.
Implement the simple Lengauer-Tarjan algorithm
This replaces the previous implementation (from #34169), which has not been
optimized since, with the simple variant of Lengauer-Tarjan which performs
better in the general case. A previous attempt -- not kept in commit history --
attempted a replacement with a bitset-based implementation, but this led to
regressions on perf.rust-lang.org benchmarks and equivalent wins for the keccak
benchmark, so was rejected.
The implementation here follows that of the pseudocode in "Linear-Time
Algorithms for Dominators and Related Problems" thesis by Loukas Georgiadis. The
next few commits will optimize the implementation as suggested in the thesis.
Several related works are cited in the comments within the implementation, as
well.
On the keccak benchmark, we were previously spending 15% of our cycles computing
the NCA / intersect function; this function is quite expensive, especially on
modern CPUs, as it chases pointers on every iteration in a tight loop. With this
commit, we spend ~0.05% of our time in dominator computation.
There's a conversation in the tracking issue about possibly unaccepting `in_band_lifetimes`, but it's used heavily in the compiler, and thus there'd need to be a bunch of PRs like this if that were to happen.
So here's one to see how much of an impact it has.
(Oh, and I removed `nll` while I was here too, since it didn't seem needed. Let me know if I should put that back.)
Implement write() method for Box<MaybeUninit<T>>
This adds method similar to `MaybeUninit::write` main difference being
it returns owned `Box`. This can be used to elide copy from stack
safely, however it's not currently tested that the optimization actually
occurs.
Analogous methods are not provided for `Rc` and `Arc` as those need to
handle the possibility of sharing. Some version of them may be added in
the future.
This was discussed in #63291 which this change extends.
This adds method similar to `MaybeUninit::write` main difference being
it returns owned `Box`. This can be used to elide copy from stack
safely, however it's not currently tested that the optimization actually
occurs.
Analogous methods are not provided for `Rc` and `Arc` as those need to
handle the possibility of sharing. Some version of them may be added in
the future.
This was discussed in #63291 which this change extends.
Revert "Add rustc lint, warning when iterating over hashmaps"
Fixes perf regressions introduced in https://github.com/rust-lang/rust/pull/90235 by temporarily reverting the relevant PR.
Particularly for ctfe-stress-4, the hashing of byte slices as part of the
MIR Allocation is quite hot. Previously, we were falling back on byte-by-byte
copying of the slice into the SipHash buffer (64 bytes long) before hashing a 64
byte chunk, and then doing that again and again.
This should hopefully be an improvement for that code.
Stabilize `const_panic`
Closes#51999
FCP completed in #89006
```@rustbot``` label +A-const-eval +A-const-fn +T-lang
cc ```@oli-obk``` for review (not `r?`'ing as not on lang team)