Auto merge of #127226 - mat-1:optimize-siphash-round, r=nnethercote

Optimize SipHash by reordering compress instructions This PR optimizes hashing by changing the order of instructions in the sip.rs `compress` macro so the CPU can parallelize it better. The new order is taken directly from Fig 2.1 in [the SipHash paper](https://eprint.iacr.org/2012/351.pdf) (but with the xors moved which makes it a little faster). I attempted to optimize it some more after this, but I think this might be the optimal instruction order. Note that this shouldn't change the behavior of hashing at all, only statements that don't depend on each other were reordered. It appears like the current order hasn't changed since its [original implementation from 2012](fada46c421 (diff-b751133c229259d7099bbbc7835324e5504b91ab1aded9464f0c48cd22e5e420R35)) which doesn't look like it was written with data dependencies in mind. Running `./x bench library/core --stage 0 --test-args hash` before and after this change shows the following results: Before: ``` benchmarks: hash::sip::bench_bytes_4 7.20/iter +/- 0.70 hash::sip::bench_bytes_7 9.01/iter +/- 0.35 hash::sip::bench_bytes_8 8.12/iter +/- 0.10 hash::sip::bench_bytes_a_16 10.07/iter +/- 0.44 hash::sip::bench_bytes_b_32 13.46/iter +/- 0.71 hash::sip::bench_bytes_c_128 37.75/iter +/- 0.48 hash::sip::bench_long_str 121.18/iter +/- 3.01 hash::sip::bench_str_of_8_bytes 11.20/iter +/- 0.25 hash::sip::bench_str_over_8_bytes 11.20/iter +/- 0.26 hash::sip::bench_str_under_8_bytes 9.89/iter +/- 0.59 hash::sip::bench_u32 9.57/iter +/- 0.44 hash::sip::bench_u32_keyed 6.97/iter +/- 0.10 hash::sip::bench_u64 8.63/iter +/- 0.07 ``` After: ``` benchmarks: hash::sip::bench_bytes_4 6.64/iter +/- 0.14 hash::sip::bench_bytes_7 8.19/iter +/- 0.07 hash::sip::bench_bytes_8 8.59/iter +/- 0.68 hash::sip::bench_bytes_a_16 9.73/iter +/- 0.49 hash::sip::bench_bytes_b_32 12.70/iter +/- 0.06 hash::sip::bench_bytes_c_128 32.38/iter +/- 0.20 hash::sip::bench_long_str 102.99/iter +/- 0.82 hash::sip::bench_str_of_8_bytes 10.71/iter +/- 0.21 hash::sip::bench_str_over_8_bytes 11.73/iter +/- 0.17 hash::sip::bench_str_under_8_bytes 10.33/iter +/- 0.41 hash::sip::bench_u32 10.41/iter +/- 0.29 hash::sip::bench_u32_keyed 9.50/iter +/- 0.30 hash::sip::bench_u64 8.44/iter +/- 1.09 ``` I ran this on my computer so there's some noise, but you can tell at least `bench_long_str` is significantly faster (~18%). Also, I noticed the same compress function from the library is used in the compiler as well, so I took the liberty of copy-pasting this change to there as well. Thanks `@semisol` for porting SipHash for another project which led me to notice this issue in Rust, and for helping investigate. <3
2024-11-21 22:34:05 +00:00 · 2024-07-04 04:03:45 +00:00 · 2024-07-04 04:03:45 +00:00 · f6fa358a18
commit f6fa358a18
parent 66b4f0021b 16fc41cedc
2 changed files with 12 additions and 10 deletions
--- a/compiler/rustc_data_structures/src/sip128.rs
+++ b/compiler/rustc_data_structures/src/sip128.rs
@ -70,18 +70,19 @@ macro_rules! compress {
    ($state:expr) => {{ compress!($state.v0, $state.v1, $state.v2, $state.v3) }};
    ($v0:expr, $v1:expr, $v2:expr, $v3:expr) => {{
        $v0 = $v0.wrapping_add($v1);
+        $v2 = $v2.wrapping_add($v3);
        $v1 = $v1.rotate_left(13);
        $v1 ^= $v0;
-        $v0 = $v0.rotate_left(32);
-        $v2 = $v2.wrapping_add($v3);
        $v3 = $v3.rotate_left(16);
        $v3 ^= $v2;
-        $v0 = $v0.wrapping_add($v3);
-        $v3 = $v3.rotate_left(21);
-        $v3 ^= $v0;
+        $v0 = $v0.rotate_left(32);
+
        $v2 = $v2.wrapping_add($v1);
+        $v0 = $v0.wrapping_add($v3);
        $v1 = $v1.rotate_left(17);
        $v1 ^= $v2;
+        $v3 = $v3.rotate_left(21);
+        $v3 ^= $v0;
        $v2 = $v2.rotate_left(32);
    }};
 }
--- a/library/core/src/hash/sip.rs
+++ b/library/core/src/hash/sip.rs
@ -76,18 +76,19 @@ macro_rules! compress {
    ($state:expr) => {{ compress!($state.v0, $state.v1, $state.v2, $state.v3) }};
    ($v0:expr, $v1:expr, $v2:expr, $v3:expr) => {{
        $v0 = $v0.wrapping_add($v1);
+        $v2 = $v2.wrapping_add($v3);
        $v1 = $v1.rotate_left(13);
        $v1 ^= $v0;
-        $v0 = $v0.rotate_left(32);
-        $v2 = $v2.wrapping_add($v3);
        $v3 = $v3.rotate_left(16);
        $v3 ^= $v2;
-        $v0 = $v0.wrapping_add($v3);
-        $v3 = $v3.rotate_left(21);
-        $v3 ^= $v0;
+        $v0 = $v0.rotate_left(32);
+
        $v2 = $v2.wrapping_add($v1);
+        $v0 = $v0.wrapping_add($v3);
        $v1 = $v1.rotate_left(17);
        $v1 ^= $v2;
+        $v3 = $v3.rotate_left(21);
+        $v3 ^= $v0;
        $v2 = $v2.rotate_left(32);
    }};
 }