Break out bucket 0 (containing `idx < 4096`) as an early return, which
simplifies the remainder of the function, and allows optimizing the
`checked_ilog2` since it can no longer return `None`.
This reduces the runtime of `vec_cache::tests::slot_index_exhaustive`
(which calls `SlotIndex::from_index` for every `u32`, twice) from ~15.5s
to ~13.3s.
When the word "cache" appears in the context of the query system, it often
isn't obvious whether that is referring to the in-memory query cache or the
on-disk incremental cache.
For these types, we can assure the reader that they are for in-memory caching.
This replaces the single Vec allocation with a series of progressively
larger buckets. With the cfg for parallel enabled but with -Zthreads=1,
this looks like a slight regression in i-count and cycle counts (<0.1%).
With the parallel frontend at -Zthreads=4, this is an improvement (-5%
wall-time from 5.788 to 5.4688 on libcore) than our current Lock-based
approach, likely due to reducing the bouncing of the cache line holding
the lock. At -Zthreads=32 it's a huge improvement (-46%: 8.829 -> 4.7319
seconds).