Refactor the code in the `convert_while_ascii` helper function to make
it more suitable for auto-vectorization and also process the full ascii
prefix of the string. The generic case conversion logic will only be
invoked starting from the first non-ascii character.
The runtime on microbenchmarks with ascii-only inputs improves between
1.5x for short and 4x for long inputs on x86_64 and aarch64.
The new implementation also encapsulates all unsafe inside the
`convert_while_ascii` function.
Fixes#123712
detects redundant imports that can be eliminated.
for #117772 :
In order to facilitate review and modification, split the checking code and
removing redundant imports code into two PR.
Implement more methods for `vec_deque::IntoIter`
This implements a couple `Iterator` methods on `vec_deque::IntoIter` (`(try_)fold`, `(try_)rfold` `advance_(back_)by`, `next_chunk`, `count` and `last`) to allow these to be more efficient than their default implementations, also allowing many other `Iterator` methods that use these under the hood to take advantage of these manual implementations. `vec::IntoIter` has similar implementations for many of these methods. This PR does not yet implement `TrustedRandomAccess` and friends, as I'm not very familiar with the required safety guarantees.
r? `@the8472` (since you also took over my last PR)
```
test vec::bench_next_chunk ... bench: 696 ns/iter (+/- 22)
x86_64v1, pr
test vec::bench_next_chunk ... bench: 309 ns/iter (+/- 4)
znver2, default
test vec::bench_next_chunk ... bench: 17,272 ns/iter (+/- 117)
znver2, pr
test vec::bench_next_chunk ... bench: 211 ns/iter (+/- 3)
```
The znver2 default impl seems to be slow due to inlining decisions. It goes through `core::array::iter_next_chunk`
which has a deeper call tree.
These debug assertions are all implemented only at runtime using
`const_eval_select`, and in the error path they execute
`intrinsics::abort` instead of being a normal debug assertion to
minimize the impact of these assertions on code size, when enabled.
Of all these changes, the bounds checks for unchecked indexing are
expected to be most impactful (case in point, they found a problem in
rustc).
Add `into_iter().filter().collect()` as a comparison point since it was reported to be faster than `retain`.
Remove clone inside benchmark loop to reduce allocator noise.
The unsoundness is not in Peekable per se, it rather is due to the
interaction between Peekable being able to hold an extra item
and vec::IntoIter's clone implementation shortening the allocation.
An alternative solution would be to change IntoIter's clone implementation
to keep enough spare capacity available.
Many of the Vec benchmarks assert what values should be produced by the
benchmarked code. In some cases, these asserts dominate the runtime of
the benchmarks they are in, causing the benchmarks to understate the
impact of an optimization or regression.
Avoid useless sift_down when std::collections::binary_heap::PeekMut is never mutably dereferenced
If `deref_mut` is never called then it's not possible for the element to be mutated without internal mutability, meaning there's no need to call `sift_down`.
This could be a little improvement in cases where you want to mutate the biggest element of the heap only if it satisfies a certain predicate that needs only read access to the element.