Specialize `StepBy<Range<{integer}>>`
OLD
iter::bench_range_step_by_fold_u16 700.00ns/iter +/- 10.00ns
iter::bench_range_step_by_fold_usize 519.00ns/iter +/- 6.00ns
iter::bench_range_step_by_loop_u32 555.00ns/iter +/- 7.00ns
iter::bench_range_step_by_sum_reducible 37.00ns/iter +/- 0.00ns
NEW
iter::bench_range_step_by_fold_u16 49.00ns/iter +/- 0.00ns
iter::bench_range_step_by_fold_usize 194.00ns/iter +/- 1.00ns
iter::bench_range_step_by_loop_u32 98.00ns/iter +/- 0.00ns
iter::bench_range_step_by_sum_reducible 1.00ns/iter +/- 0.00ns
NEW + `-Ctarget-cpu=x86-64-v3`
iter::bench_range_step_by_fold_u16 22.00ns/iter +/- 0.00ns
iter::bench_range_step_by_fold_usize 80.00ns/iter +/- 1.00ns
iter::bench_range_step_by_loop_u32 41.00ns/iter +/- 0.00ns
iter::bench_range_step_by_sum_reducible 1.00ns/iter +/- 0.00ns
I have only optimized for walltime of those methods, I haven't tested whether it eliminates bounds checks when indexing into slices via things like `(0..slice.len()).step_by(16)`.
For ranges < usize we determine the number of items
StepBy would yield and then store that in the range.end
instead of the actual end. This significantly
simplifies calculation of the loop induction variable
especially in cases where StepBy::step (an usize)
could overflow the Range's item type
Ignore `core`, `alloc` and `test` tests that require unwinding on `-C panic=abort`
Some of the tests for `core` and `alloc` require unwinding through their use of `catch_unwind`. These tests fail when testing using `-C panic=abort` (in my case through a target without unwinding support, and `-Z panic-abort-tests`), while they should be ignored as they don't indicate a failure.
This PR marks all of these tests with this attribute:
```rust
#[cfg_attr(not(panic = "unwind"), ignore = "test requires unwinding support")]
```
I'm not aware of a way to test this on rust-lang/rust's CI, as we don't test any target with `-C panic=abort`, but I tested this locally on a Ferrocene target and it does indeed make the test suite pass.
* ensuring that offset_of!(Self, ...) works iff inside an impl block
* ensuring that the output type is usize and doesn't coerce. this can be
changed in the future, but if it is done, it should be a conscious descision
* improving the privacy checking test
* ensuring that generics don't let you escape the unsized check
Add midpoint function for all integers and floating numbers
This pull-request adds the `midpoint` function to `{u,i}{8,16,32,64,128,size}`, `NonZeroU{8,16,32,64,size}` and `f{32,64}`.
This new function is analog to the [C++ midpoint](https://en.cppreference.com/w/cpp/numeric/midpoint) function, and basically compute `(a + b) / 2` with a rounding towards ~~`a`~~ negative infinity in the case of integers. Or simply said: `midpoint(a, b)` is `(a + b) >> 1` as if it were performed in a sufficiently-large signed integral type.
Note that unlike the C++ function this pull-request does not implement this function on pointers (`*const T` or `*mut T`). This could be implemented in a future pull-request if desire.
### Implementation
For `f32` and `f64` the implementation in based on the `libcxx` [one](18ab892ff7/libcxx/include/__numeric/midpoint.h (L65-L77)). I originally tried many different approach but all of them failed or lead me with a poor version of the `libcxx`. Note that `libstdc++` has a very similar one; Microsoft STL implementation is also basically the same as `libcxx`. It unfortunately doesn't seems like a better way exist.
For unsigned integers I created the macro `midpoint_impl!`, this macro has two branches:
- The first one take `$SelfT` and is used when there is no unsigned integer with at least the double of bits. The code simply use this formula `a + (b - a) / 2` with the arguments in the correct order and signs to have the good rounding.
- The second branch is used when a `$WideT` (at least double of bits as `$SelfT`) is provided, using a wider number means that no overflow can occur, this greatly improve the codegen (no branch and less instructions).
For signed integers the code basically forwards the signed numbers to the unsigned version of midpoint by mapping the signed numbers to their unsigned numbers (`ex: i8 [-128; 127] to [0; 255]`) and vice versa.
I originally created a version that worked directly on the signed numbers but the code was "ugly" and not understandable. Despite this mapping "overhead" the codegen is better than my most optimized version on signed integers.
~~Note that in the case of unsigned numbers I tried to be smart and used `#[cfg(target_pointer_width = "64")]` to determine if using the wide version was better or not by looking at the assembly on godbolt. This was applied to `u32`, `u64` and `usize` and doesn't change the behavior only the assembly code generated.~~
Spelling library
Split per https://github.com/rust-lang/rust/pull/110392
I can squash once people are happy w/ the changes. It's really uncommon for large sets of changes to be perfectly acceptable w/o at least some changes.
I probably won't have time to respond until tomorrow or the next day
Negating a non-zero integer currently requires unpacking to a
primitive and re-wrapping. Since negation of non-zero signed
integers always produces a non-zero result, it is safe to
implement `Neg` for `NonZeroI{N}`.
The new `impl` is marked as stable because trait implementations
for two stable types can't be marked unstable.
A successful advance is now signalled by returning `0` and other values now represent the remaining number
of steps that couldn't be advanced as opposed to the amount of steps that have been advanced during a partial advance_by.
This simplifies adapters a bit, replacing some `match`/`if` with arithmetic. Whether this is beneficial overall depends
on whether `advance_by` is mostly used as a building-block for other iterator methods and adapters or whether
we also see uses by users where `Result` might be more useful.
Improve the `array::map` codegen
The `map` method on arrays [is documented as sometimes performing poorly](https://doc.rust-lang.org/std/primitive.array.html#note-on-performance-and-stack-usage), and after [a question on URLO](https://users.rust-lang.org/t/try-trait-residual-o-trait-and-try-collect-into-array/88510?u=scottmcm) prompted me to take another look at the core [`try_collect_into_array`](7c46fb2111/library/core/src/array/mod.rs (L865-L912)) function, I had some ideas that ended up working better than I'd expected.
There's three main ideas in here, split over three commits:
1. Don't use `array::IntoIter` when we can avoid it, since that seems to not get SRoA'd, meaning that every step writes things like loop counters into the stack unnecessarily
2. Don't return arrays in `Result`s unnecessarily, as that doesn't seem to optimize away even with `unwrap_unchecked` (perhaps because it needs to get moved into a new LLVM type to account for the discriminant)
3. Don't distract LLVM with all the `Option` dances when we know for sure we have enough items (like in `map` and `zip`). This one's a larger commit as to do it I ended up adding a new `pub(crate)` trait, but hopefully those changes are still straight-forward.
(No libs-api changes; everything should be completely implementation-detail-internal.)
It's still not completely fixed -- I think it needs pcwalton's `memcpy` optimizations still (#103830) to get further -- but this seems to go much better than before. And the remaining `memcpy`s are just `transmute`-equivalent (`[T; N] -> ManuallyDrop<[T; N]>` and `[MaybeUninit<T>; N] -> [T; N]`), so hopefully those will be easier to remove with LLVM16 than the previous subobject copies 🤞
r? `@thomcc`
As a simple example, this test
```rust
pub fn long_integer_map(x: [u32; 64]) -> [u32; 64] {
x.map(|x| 13 * x + 7)
}
```
On nightly <https://rust.godbolt.org/z/xK7548TGj> takes `sub rsp, 808`
```llvm
start:
%array.i.i.i.i = alloca [64 x i32], align 4
%_3.sroa.5.i.i.i = alloca [65 x i32], align 4
%_5.i = alloca %"core::iter::adapters::map::Map<core::array::iter::IntoIter<u32, 64>, [closure@/app/example.rs:2:11: 2:14]>", align 8
```
(and yes, that's a 6**5**-element array `alloca` despite 6**4**-element input and output)
But with this PR it's only `sub rsp, 520`
```llvm
start:
%array.i.i.i.i.i.i = alloca [64 x i32], align 4
%array1.i.i.i = alloca %"core::mem::manually_drop::ManuallyDrop<[u32; 64]>", align 4
```
Similarly, the loop it emits on nightly is scalar-only and horrifying
```nasm
.LBB0_1:
mov esi, 64
mov edi, 0
cmp rdx, 64
je .LBB0_3
lea rsi, [rdx + 1]
mov qword ptr [rsp + 784], rsi
mov r8d, dword ptr [rsp + 4*rdx + 528]
mov edi, 1
lea edx, [r8 + 2*r8]
lea r8d, [r8 + 4*rdx]
add r8d, 7
.LBB0_3:
test edi, edi
je .LBB0_11
mov dword ptr [rsp + 4*rcx + 272], r8d
cmp rsi, 64
jne .LBB0_6
xor r8d, r8d
mov edx, 64
test r8d, r8d
jne .LBB0_8
jmp .LBB0_11
.LBB0_6:
lea rdx, [rsi + 1]
mov qword ptr [rsp + 784], rdx
mov edi, dword ptr [rsp + 4*rsi + 528]
mov r8d, 1
lea esi, [rdi + 2*rdi]
lea edi, [rdi + 4*rsi]
add edi, 7
test r8d, r8d
je .LBB0_11
.LBB0_8:
mov dword ptr [rsp + 4*rcx + 276], edi
add rcx, 2
cmp rcx, 64
jne .LBB0_1
```
whereas with this PR it's unrolled and vectorized
```nasm
vpmulld ymm1, ymm0, ymmword ptr [rsp + 64]
vpaddd ymm1, ymm1, ymm2
vmovdqu ymmword ptr [rsp + 328], ymm1
vpmulld ymm1, ymm0, ymmword ptr [rsp + 96]
vpaddd ymm1, ymm1, ymm2
vmovdqu ymmword ptr [rsp + 360], ymm1
```
(though sadly still stack-to-stack)
avoid mixing accesses of ptrs derived from a mutable ref and parent ptrs
``@Vanille-N`` is working on a successor for Stacked Borrows. It will mostly accept strictly more code than Stacked Borrows did, with one exception: the following pattern no longer works.
```rust
let mut root = 6u8;
let mref = &mut root;
let ptr = mref as *mut u8;
*ptr = 0; // Write
assert_eq!(root, 0); // Parent Read
*ptr = 0; // Attempted Write
```
This worked in Stacked Borrows kind of by accident: when doing the "parent read", under SB we Disable `mref`, but the raw ptrs derived from it remain usable. The fact that we can still use the "children" of a reference that is no longer usable is quite nasty and leads to some undesirable effects (in particular it is the major blocker for resolving https://github.com/rust-lang/unsafe-code-guidelines/issues/257). So in Tree Borrows we no longer do that; instead, reading from `root` makes `mref` and all its children read-only.
Due to other improvements in Tree Borrows, the entire Miri test suite still passes with this new behavior, and even the entire libcore and liballoc test suite, except for these 2 cases this PR fixes. Both of these involve code where the programmer wrote `&mut` but then used pointers derived from that reference in ways that alias with the parent pointer, which arguably is violating uniqueness. They are fixed by properly using raw pointers throughout.