rust/compiler/rustc_data_structures/src
bors 8c4fc9d9a4 Auto merge of #94598 - scottmcm:prefix-free-hasher-methods, r=Amanieu
Add a dedicated length-prefixing method to `Hasher`

This accomplishes two main goals:
- Make it clear who is responsible for prefix-freedom, including how they should do it
- Make it feasible for a `Hasher` that *doesn't* care about Hash-DoS resistance to get better performance by not hashing lengths

This does not change rustc-hash, since that's in an external crate, but that could potentially use it in future.

Fixes #94026

r? rust-lang/libs

---

The core of this change is the following two new methods on `Hasher`:

```rust
pub trait Hasher {
    /// Writes a length prefix into this hasher, as part of being prefix-free.
    ///
    /// If you're implementing [`Hash`] for a custom collection, call this before
    /// writing its contents to this `Hasher`.  That way
    /// `(collection![1, 2, 3], collection![4, 5])` and
    /// `(collection![1, 2], collection![3, 4, 5])` will provide different
    /// sequences of values to the `Hasher`
    ///
    /// The `impl<T> Hash for [T]` includes a call to this method, so if you're
    /// hashing a slice (or array or vector) via its `Hash::hash` method,
    /// you should **not** call this yourself.
    ///
    /// This method is only for providing domain separation.  If you want to
    /// hash a `usize` that represents part of the *data*, then it's important
    /// that you pass it to [`Hasher::write_usize`] instead of to this method.
    ///
    /// # Examples
    ///
    /// ```
    /// #![feature(hasher_prefixfree_extras)]
    /// # // Stubs to make the `impl` below pass the compiler
    /// # struct MyCollection<T>(Option<T>);
    /// # impl<T> MyCollection<T> {
    /// #     fn len(&self) -> usize { todo!() }
    /// # }
    /// # impl<'a, T> IntoIterator for &'a MyCollection<T> {
    /// #     type Item = T;
    /// #     type IntoIter = std::iter::Empty<T>;
    /// #     fn into_iter(self) -> Self::IntoIter { todo!() }
    /// # }
    ///
    /// use std:#️⃣:{Hash, Hasher};
    /// impl<T: Hash> Hash for MyCollection<T> {
    ///     fn hash<H: Hasher>(&self, state: &mut H) {
    ///         state.write_length_prefix(self.len());
    ///         for elt in self {
    ///             elt.hash(state);
    ///         }
    ///     }
    /// }
    /// ```
    ///
    /// # Note to Implementers
    ///
    /// If you've decided that your `Hasher` is willing to be susceptible to
    /// Hash-DoS attacks, then you might consider skipping hashing some or all
    /// of the `len` provided in the name of increased performance.
    #[inline]
    #[unstable(feature = "hasher_prefixfree_extras", issue = "88888888")]
    fn write_length_prefix(&mut self, len: usize) {
        self.write_usize(len);
    }

    /// Writes a single `str` into this hasher.
    ///
    /// If you're implementing [`Hash`], you generally do not need to call this,
    /// as the `impl Hash for str` does, so you can just use that.
    ///
    /// This includes the domain separator for prefix-freedom, so you should
    /// **not** call `Self::write_length_prefix` before calling this.
    ///
    /// # Note to Implementers
    ///
    /// The default implementation of this method includes a call to
    /// [`Self::write_length_prefix`], so if your implementation of `Hasher`
    /// doesn't care about prefix-freedom and you've thus overridden
    /// that method to do nothing, there's no need to override this one.
    ///
    /// This method is available to be overridden separately from the others
    /// as `str` being UTF-8 means that it never contains `0xFF` bytes, which
    /// can be used to provide prefix-freedom cheaper than hashing a length.
    ///
    /// For example, if your `Hasher` works byte-by-byte (perhaps by accumulating
    /// them into a buffer), then you can hash the bytes of the `str` followed
    /// by a single `0xFF` byte.
    ///
    /// If your `Hasher` works in chunks, you can also do this by being careful
    /// about how you pad partial chunks.  If the chunks are padded with `0x00`
    /// bytes then just hashing an extra `0xFF` byte doesn't necessarily
    /// provide prefix-freedom, as `"ab"` and `"ab\u{0}"` would likely hash
    /// the same sequence of chunks.  But if you pad with `0xFF` bytes instead,
    /// ensuring at least one padding byte, then it can often provide
    /// prefix-freedom cheaper than hashing the length would.
    #[inline]
    #[unstable(feature = "hasher_prefixfree_extras", issue = "88888888")]
    fn write_str(&mut self, s: &str) {
        self.write_length_prefix(s.len());
        self.write(s.as_bytes());
    }
}
```

With updates to the `Hash` implementations for slices and containers to call `write_length_prefix` instead of `write_usize`.

`write_str` defaults to using `write_length_prefix` since, as was pointed out in the issue, the `write_u8(0xFF)` approach is insufficient for hashers that work in chunks, as those would hash `"a\u{0}"` and `"a"` to the same thing.  But since `SipHash` works byte-wise (there's an internal buffer to accumulate bytes until a full chunk is available) it overrides `write_str` to continue to use the add-non-UTF-8-byte approach.

---

Compatibility:

Because the default implementation of `write_length_prefix` calls `write_usize`, the changed hash implementation for slices will do the same thing the old one did on existing `Hasher`s.
2022-05-06 09:43:57 +00:00
..
base_n mv compiler to compiler/ 2020-08-30 18:45:07 +03:00
binary_search_util Adopt let else in more places 2022-02-19 17:27:43 +01:00
fingerprint Make Fingerprint::combine_commutative associative 2022-01-03 19:07:29 +01:00
flock separate flock implementations into separate modules 2022-04-14 18:30:53 -04:00
graph Avoid exhausting stack space in dominator compression 2022-02-23 16:07:56 -05:00
intern Rename PtrKey as Interned and improve it. 2022-02-15 15:50:29 +11:00
obligation_forest obligation forest docs 2022-02-21 12:00:26 +01:00
owning_ref Also fix “a OwningRef 2021-08-24 02:28:38 +02:00
sip128 SipHasher128: improve constant names and add more comments 2020-10-11 23:48:35 -07:00
small_c_str mv compiler to compiler/ 2020-08-30 18:45:07 +03:00
small_str Add SmallStr 2022-03-04 16:57:34 +01:00
snapshot_map Call the method fork instead of clone and add proper comments 2022-02-14 12:57:20 -03:00
sorted_map Remove invalid #[cfg(tests)] in index_map 2022-03-04 11:34:50 +01:00
sso compiler: fix some typos 2022-03-01 20:02:47 +08:00
stable_hasher Fix isize optimization in StableHasher for big-endian architectures 2022-02-03 11:47:41 +01:00
tagged_ptr Small performance tweaks 2021-12-12 12:35:01 +08:00
thin_vec eplace usages of vec![].into_iter with [].into_iter 2022-01-09 14:09:25 +11:00
tiny_list Move some test-only code to test files 2021-03-17 10:31:30 -04:00
transitive_relation Spellchecking some comments 2022-03-30 01:39:38 -04:00
vec_map eplace usages of vec![].into_iter with [].into_iter 2022-01-09 14:09:25 +11:00
atomic_ref.rs mv compiler to compiler/ 2020-08-30 18:45:07 +03:00
base_n.rs Apply clippy suggestions 2021-10-10 15:38:19 +02:00
captures.rs Remove #[allow(unused_lifetimes)] which is now unnecessary 2021-06-17 08:56:54 +09:00
fingerprint.rs Provide copy-free access to raw Decoder bytes 2022-02-22 18:11:59 -05:00
flock.rs separate flock implementations into separate modules 2022-04-14 18:30:53 -04:00
frozen.rs mv compiler to compiler/ 2020-08-30 18:45:07 +03:00
functor.rs Make IdFunctor::try_map_id panic-safe 2021-12-07 11:11:23 +00:00
fx.rs mv compiler to compiler/ 2020-08-30 18:45:07 +03:00
intern.rs Document and rename the new wrapper type 2022-04-07 13:01:48 +00:00
jobserver.rs datastructures: replace lazy_static by SyncLazy from std 2020-09-01 22:06:47 +01:00
lib.rs Auto merge of #94598 - scottmcm:prefix-free-hasher-methods, r=Amanieu 2022-05-06 09:43:57 +00:00
macros.rs Introduce ChunkedBitSet and use it for some dataflow analyses. 2022-02-23 10:18:49 +11:00
map_in_place.rs Add debug assertions to some unsafe functions 2022-03-29 11:05:24 -04:00
memmap.rs Add safety comment to StableAddress impl for Mmap 2021-04-03 14:51:05 +02:00
profiling.rs add generic_activity_with_arg_recorder to the self-profiler 2022-04-07 15:47:20 +02:00
sharded.rs Move Sharded maps into each QueryCache impl 2022-02-20 12:10:46 -05:00
sip128.rs Add a dedicated length-prefixing method to Hasher 2022-05-06 00:03:38 -07:00
small_c_str.rs Inline SmallCStr::deref 2022-03-04 16:57:34 +01:00
small_str.rs Add SmallStr 2022-03-04 16:57:34 +01:00
sorted_map.rs Use SortedMap in HIR. 2021-10-21 23:08:57 +02:00
stable_hasher.rs Add a dedicated length-prefixing method to Hasher 2022-05-06 00:03:38 -07:00
stable_map.rs mv compiler to compiler/ 2020-08-30 18:45:07 +03:00
stable_set.rs mv compiler to compiler/ 2020-08-30 18:45:07 +03:00
stack.rs Allow inlining of ensure_sufficient_stack() 2022-02-12 11:30:04 +01:00
steal.rs more clippy fixes 2021-11-07 16:59:05 +01:00
svh.rs Make Decodable and Decoder infallible. 2022-01-22 10:38:31 +11:00
sync.rs Fix typos “a”→“an” 2021-08-22 15:35:11 +02:00
tagged_ptr.rs Miscellaneous inlining improvements 2021-06-02 08:49:58 +02:00
temp_dir.rs Capitalize safety comments 2020-09-08 22:37:18 -04:00
thin_vec.rs Rustdoc: use ThinVec for GenericArgs bindings 2022-01-01 11:29:14 +01:00
tiny_list.rs Apply clippy suggestions 2021-10-10 15:38:19 +02:00
transitive_relation.rs add #[rustc_pass_by_value] to more types 2022-03-08 15:39:52 +01:00
unhash.rs Avoid rehashing Fingerprint as a map key 2020-09-01 18:27:02 -07:00
vec_linked_list.rs Stop enabling in_band_lifetimes in rustc_data_structures 2021-12-05 20:17:35 -08:00
vec_map.rs Fix some fallout around type alias impl trait in associated types 2022-04-06 12:56:22 +00:00
work_queue.rs Remove (lots of) dead code 2021-03-27 22:16:33 -04:00