2023-02-19 04:03:56 +00:00
|
|
|
//! Edit distances.
|
2020-12-24 07:01:03 +00:00
|
|
|
//!
|
2023-02-19 04:03:56 +00:00
|
|
|
//! The [edit distance] is a metric for measuring the difference between two strings.
|
2020-12-24 07:01:03 +00:00
|
|
|
//!
|
2023-02-19 04:03:56 +00:00
|
|
|
//! [edit distance]: https://en.wikipedia.org/wiki/Edit_distance
|
|
|
|
|
|
|
|
// The current implementation is the restricted Damerau-Levenshtein algorithm. It is restricted
|
|
|
|
// because it does not permit modifying characters that have already been transposed. The specific
|
|
|
|
// algorithm should not matter to the caller of the methods, which is why it is not noted in the
|
|
|
|
// documentation.
|
2020-12-24 07:01:03 +00:00
|
|
|
|
Move lev_distance to rustc_ast, make non-generic
rustc_ast currently has a few dependencies on rustc_lexer. Ideally, an AST
would not have any dependency its lexer, for minimizing unnecessarily
design-time dependencies. Breaking this dependency would also have practical
benefits, since modifying rustc_lexer would not trigger a rebuild of rustc_ast.
This commit does not remove the rustc_ast --> rustc_lexer dependency,
but it does remove one of the sources of this dependency, which is the
code that handles fuzzy matching between symbol names for making suggestions
in diagnostics. Since that code depends only on Symbol, it is easy to move
it to rustc_span. It might even be best to move it to a separate crate,
since other tools such as Cargo use the same algorithm, and have simply
contain a duplicate of the code.
This changes the signature of find_best_match_for_name so that it is no
longer generic over its input. I checked the optimized binaries, and this
function was duplicated at nearly every call site, because most call sites
used short-lived iterator chains, generic over Map and such. But there's
no good reason for a function like this to be generic, since all it does
is immediately convert the generic input (the Iterator impl) to a concrete
Vec<Symbol>. This has all of the costs of generics (duplicated method bodies)
with no benefit.
Changing find_best_match_for_name to be non-generic removed about 10KB of
code from the optimized binary. I know it's a drop in the bucket, but we have
to start reducing binary size, and beginning to tame over-use of generics
is part of that.
2020-11-12 19:24:10 +00:00
|
|
|
use crate::symbol::Symbol;
|
2023-02-18 00:13:50 +00:00
|
|
|
use std::{cmp, mem};
|
std: Stabilize the std::str module
This commit starts out by consolidating all `str` extension traits into one
`StrExt` trait to be included in the prelude. This means that
`UnicodeStrPrelude`, `StrPrelude`, and `StrAllocating` have all been merged into
one `StrExt` exported by the standard library. Some functionality is currently
duplicated with the `StrExt` present in libcore.
This commit also currently avoids any methods which require any form of pattern
to operate. These functions will be stabilized via a separate RFC.
Next, stability of methods and structures are as follows:
Stable
* from_utf8_unchecked
* CowString - after moving to std::string
* StrExt::as_bytes
* StrExt::as_ptr
* StrExt::bytes/Bytes - also made a struct instead of a typedef
* StrExt::char_indices/CharIndices - CharOffsets was renamed
* StrExt::chars/Chars
* StrExt::is_empty
* StrExt::len
* StrExt::lines/Lines
* StrExt::lines_any/LinesAny
* StrExt::slice_unchecked
* StrExt::trim
* StrExt::trim_left
* StrExt::trim_right
* StrExt::words/Words - also made a struct instead of a typedef
Unstable
* from_utf8 - the error type was changed to a `Result`, but the error type has
yet to prove itself
* from_c_str - this function will be handled by the c_str RFC
* FromStr - this trait will have an associated error type eventually
* StrExt::escape_default - needs iterators at least, unsure if it should make
the cut
* StrExt::escape_unicode - needs iterators at least, unsure if it should make
the cut
* StrExt::slice_chars - this function has yet to prove itself
* StrExt::slice_shift_char - awaiting conventions about slicing and shifting
* StrExt::graphemes/Graphemes - this functionality may only be in libunicode
* StrExt::grapheme_indices/GraphemeIndices - this functionality may only be in
libunicode
* StrExt::width - this functionality may only be in libunicode
* StrExt::utf16_units - this functionality may only be in libunicode
* StrExt::nfd_chars - this functionality may only be in libunicode
* StrExt::nfkd_chars - this functionality may only be in libunicode
* StrExt::nfc_chars - this functionality may only be in libunicode
* StrExt::nfkc_chars - this functionality may only be in libunicode
* StrExt::is_char_boundary - naming is uncertain with container conventions
* StrExt::char_range_at - naming is uncertain with container conventions
* StrExt::char_range_at_reverse - naming is uncertain with container conventions
* StrExt::char_at - naming is uncertain with container conventions
* StrExt::char_at_reverse - naming is uncertain with container conventions
* StrVector::concat - this functionality may be replaced with iterators, but
it's not certain at this time
* StrVector::connect - as with concat, may be deprecated in favor of iterators
Deprecated
* StrAllocating and UnicodeStrPrelude have been merged into StrExit
* eq_slice - compiler implementation detail
* from_str - use the inherent parse() method
* is_utf8 - call from_utf8 instead
* replace - call the method instead
* truncate_utf16_at_nul - this is an implementation detail of windows and does
not need to be exposed.
* utf8_char_width - moved to libunicode
* utf16_items - moved to libunicode
* is_utf16 - moved to libunicode
* Utf16Items - moved to libunicode
* Utf16Item - moved to libunicode
* Utf16Encoder - moved to libunicode
* AnyLines - renamed to LinesAny and made a struct
* SendStr - use CowString<'static> instead
* str::raw - all functionality is deprecated
* StrExt::into_string - call to_string() instead
* StrExt::repeat - use iterators instead
* StrExt::char_len - use .chars().count() instead
* StrExt::is_alphanumeric - use .chars().all(..)
* StrExt::is_whitespace - use .chars().all(..)
Pending deprecation -- while slicing syntax is being worked out, these methods
are all #[unstable]
* Str - while currently used for generic programming, this trait will be
replaced with one of [], deref coercions, or a generic conversion trait.
* StrExt::slice - use slicing syntax instead
* StrExt::slice_to - use slicing syntax instead
* StrExt::slice_from - use slicing syntax instead
* StrExt::lev_distance - deprecated with no replacement
Awaiting stabilization due to patterns and/or matching
* StrExt::contains
* StrExt::contains_char
* StrExt::split
* StrExt::splitn
* StrExt::split_terminator
* StrExt::rsplitn
* StrExt::match_indices
* StrExt::split_str
* StrExt::starts_with
* StrExt::ends_with
* StrExt::trim_chars
* StrExt::trim_left_chars
* StrExt::trim_right_chars
* StrExt::find
* StrExt::rfind
* StrExt::find_str
* StrExt::subslice_offset
2014-12-10 17:02:31 +00:00
|
|
|
|
2019-08-01 21:26:40 +00:00
|
|
|
#[cfg(test)]
|
|
|
|
mod tests;
|
|
|
|
|
2023-02-19 04:03:56 +00:00
|
|
|
/// Finds the [edit distance] between two strings.
|
|
|
|
///
|
|
|
|
/// Returns `None` if the distance exceeds the limit.
|
2022-01-20 00:00:00 +00:00
|
|
|
///
|
2023-02-19 04:03:56 +00:00
|
|
|
/// [edit distance]: https://en.wikipedia.org/wiki/Edit_distance
|
|
|
|
pub fn edit_distance(a: &str, b: &str, limit: usize) -> Option<usize> {
|
2023-02-18 00:13:50 +00:00
|
|
|
let mut a = &a.chars().collect::<Vec<_>>()[..];
|
|
|
|
let mut b = &b.chars().collect::<Vec<_>>()[..];
|
2022-01-20 00:00:00 +00:00
|
|
|
|
2023-02-18 00:13:50 +00:00
|
|
|
// Ensure that `b` is the shorter string, minimizing memory use.
|
|
|
|
if a.len() < b.len() {
|
|
|
|
mem::swap(&mut a, &mut b);
|
|
|
|
}
|
|
|
|
|
|
|
|
let min_dist = a.len() - b.len();
|
|
|
|
// If we know the limit will be exceeded, we can return early.
|
2022-01-20 00:00:00 +00:00
|
|
|
if min_dist > limit {
|
|
|
|
return None;
|
|
|
|
}
|
2023-02-18 00:13:50 +00:00
|
|
|
|
|
|
|
// Strip common prefix.
|
|
|
|
while let Some(((b_char, b_rest), (a_char, a_rest))) = b.split_first().zip(a.split_first())
|
|
|
|
&& a_char == b_char
|
|
|
|
{
|
|
|
|
a = a_rest;
|
|
|
|
b = b_rest;
|
|
|
|
}
|
|
|
|
// Strip common suffix.
|
|
|
|
while let Some(((b_char, b_rest), (a_char, a_rest))) = b.split_last().zip(a.split_last())
|
|
|
|
&& a_char == b_char
|
|
|
|
{
|
|
|
|
a = a_rest;
|
|
|
|
b = b_rest;
|
|
|
|
}
|
|
|
|
|
|
|
|
// If either string is empty, the distance is the length of the other.
|
|
|
|
// We know that `b` is the shorter string, so we don't need to check `a`.
|
|
|
|
if b.len() == 0 {
|
|
|
|
return Some(min_dist);
|
2015-12-14 17:06:31 +00:00
|
|
|
}
|
std: Stabilize the std::str module
This commit starts out by consolidating all `str` extension traits into one
`StrExt` trait to be included in the prelude. This means that
`UnicodeStrPrelude`, `StrPrelude`, and `StrAllocating` have all been merged into
one `StrExt` exported by the standard library. Some functionality is currently
duplicated with the `StrExt` present in libcore.
This commit also currently avoids any methods which require any form of pattern
to operate. These functions will be stabilized via a separate RFC.
Next, stability of methods and structures are as follows:
Stable
* from_utf8_unchecked
* CowString - after moving to std::string
* StrExt::as_bytes
* StrExt::as_ptr
* StrExt::bytes/Bytes - also made a struct instead of a typedef
* StrExt::char_indices/CharIndices - CharOffsets was renamed
* StrExt::chars/Chars
* StrExt::is_empty
* StrExt::len
* StrExt::lines/Lines
* StrExt::lines_any/LinesAny
* StrExt::slice_unchecked
* StrExt::trim
* StrExt::trim_left
* StrExt::trim_right
* StrExt::words/Words - also made a struct instead of a typedef
Unstable
* from_utf8 - the error type was changed to a `Result`, but the error type has
yet to prove itself
* from_c_str - this function will be handled by the c_str RFC
* FromStr - this trait will have an associated error type eventually
* StrExt::escape_default - needs iterators at least, unsure if it should make
the cut
* StrExt::escape_unicode - needs iterators at least, unsure if it should make
the cut
* StrExt::slice_chars - this function has yet to prove itself
* StrExt::slice_shift_char - awaiting conventions about slicing and shifting
* StrExt::graphemes/Graphemes - this functionality may only be in libunicode
* StrExt::grapheme_indices/GraphemeIndices - this functionality may only be in
libunicode
* StrExt::width - this functionality may only be in libunicode
* StrExt::utf16_units - this functionality may only be in libunicode
* StrExt::nfd_chars - this functionality may only be in libunicode
* StrExt::nfkd_chars - this functionality may only be in libunicode
* StrExt::nfc_chars - this functionality may only be in libunicode
* StrExt::nfkc_chars - this functionality may only be in libunicode
* StrExt::is_char_boundary - naming is uncertain with container conventions
* StrExt::char_range_at - naming is uncertain with container conventions
* StrExt::char_range_at_reverse - naming is uncertain with container conventions
* StrExt::char_at - naming is uncertain with container conventions
* StrExt::char_at_reverse - naming is uncertain with container conventions
* StrVector::concat - this functionality may be replaced with iterators, but
it's not certain at this time
* StrVector::connect - as with concat, may be deprecated in favor of iterators
Deprecated
* StrAllocating and UnicodeStrPrelude have been merged into StrExit
* eq_slice - compiler implementation detail
* from_str - use the inherent parse() method
* is_utf8 - call from_utf8 instead
* replace - call the method instead
* truncate_utf16_at_nul - this is an implementation detail of windows and does
not need to be exposed.
* utf8_char_width - moved to libunicode
* utf16_items - moved to libunicode
* is_utf16 - moved to libunicode
* Utf16Items - moved to libunicode
* Utf16Item - moved to libunicode
* Utf16Encoder - moved to libunicode
* AnyLines - renamed to LinesAny and made a struct
* SendStr - use CowString<'static> instead
* str::raw - all functionality is deprecated
* StrExt::into_string - call to_string() instead
* StrExt::repeat - use iterators instead
* StrExt::char_len - use .chars().count() instead
* StrExt::is_alphanumeric - use .chars().all(..)
* StrExt::is_whitespace - use .chars().all(..)
Pending deprecation -- while slicing syntax is being worked out, these methods
are all #[unstable]
* Str - while currently used for generic programming, this trait will be
replaced with one of [], deref coercions, or a generic conversion trait.
* StrExt::slice - use slicing syntax instead
* StrExt::slice_to - use slicing syntax instead
* StrExt::slice_from - use slicing syntax instead
* StrExt::lev_distance - deprecated with no replacement
Awaiting stabilization due to patterns and/or matching
* StrExt::contains
* StrExt::contains_char
* StrExt::split
* StrExt::splitn
* StrExt::split_terminator
* StrExt::rsplitn
* StrExt::match_indices
* StrExt::split_str
* StrExt::starts_with
* StrExt::ends_with
* StrExt::trim_chars
* StrExt::trim_left_chars
* StrExt::trim_right_chars
* StrExt::find
* StrExt::rfind
* StrExt::find_str
* StrExt::subslice_offset
2014-12-10 17:02:31 +00:00
|
|
|
|
2023-02-18 00:13:50 +00:00
|
|
|
let mut prev_prev = vec![usize::MAX; b.len() + 1];
|
|
|
|
let mut prev = (0..=b.len()).collect::<Vec<_>>();
|
|
|
|
let mut current = vec![0; b.len() + 1];
|
|
|
|
|
|
|
|
// row by row
|
|
|
|
for i in 1..=a.len() {
|
|
|
|
current[0] = i;
|
|
|
|
let a_idx = i - 1;
|
|
|
|
|
|
|
|
// column by column
|
|
|
|
for j in 1..=b.len() {
|
|
|
|
let b_idx = j - 1;
|
std: Stabilize the std::str module
This commit starts out by consolidating all `str` extension traits into one
`StrExt` trait to be included in the prelude. This means that
`UnicodeStrPrelude`, `StrPrelude`, and `StrAllocating` have all been merged into
one `StrExt` exported by the standard library. Some functionality is currently
duplicated with the `StrExt` present in libcore.
This commit also currently avoids any methods which require any form of pattern
to operate. These functions will be stabilized via a separate RFC.
Next, stability of methods and structures are as follows:
Stable
* from_utf8_unchecked
* CowString - after moving to std::string
* StrExt::as_bytes
* StrExt::as_ptr
* StrExt::bytes/Bytes - also made a struct instead of a typedef
* StrExt::char_indices/CharIndices - CharOffsets was renamed
* StrExt::chars/Chars
* StrExt::is_empty
* StrExt::len
* StrExt::lines/Lines
* StrExt::lines_any/LinesAny
* StrExt::slice_unchecked
* StrExt::trim
* StrExt::trim_left
* StrExt::trim_right
* StrExt::words/Words - also made a struct instead of a typedef
Unstable
* from_utf8 - the error type was changed to a `Result`, but the error type has
yet to prove itself
* from_c_str - this function will be handled by the c_str RFC
* FromStr - this trait will have an associated error type eventually
* StrExt::escape_default - needs iterators at least, unsure if it should make
the cut
* StrExt::escape_unicode - needs iterators at least, unsure if it should make
the cut
* StrExt::slice_chars - this function has yet to prove itself
* StrExt::slice_shift_char - awaiting conventions about slicing and shifting
* StrExt::graphemes/Graphemes - this functionality may only be in libunicode
* StrExt::grapheme_indices/GraphemeIndices - this functionality may only be in
libunicode
* StrExt::width - this functionality may only be in libunicode
* StrExt::utf16_units - this functionality may only be in libunicode
* StrExt::nfd_chars - this functionality may only be in libunicode
* StrExt::nfkd_chars - this functionality may only be in libunicode
* StrExt::nfc_chars - this functionality may only be in libunicode
* StrExt::nfkc_chars - this functionality may only be in libunicode
* StrExt::is_char_boundary - naming is uncertain with container conventions
* StrExt::char_range_at - naming is uncertain with container conventions
* StrExt::char_range_at_reverse - naming is uncertain with container conventions
* StrExt::char_at - naming is uncertain with container conventions
* StrExt::char_at_reverse - naming is uncertain with container conventions
* StrVector::concat - this functionality may be replaced with iterators, but
it's not certain at this time
* StrVector::connect - as with concat, may be deprecated in favor of iterators
Deprecated
* StrAllocating and UnicodeStrPrelude have been merged into StrExit
* eq_slice - compiler implementation detail
* from_str - use the inherent parse() method
* is_utf8 - call from_utf8 instead
* replace - call the method instead
* truncate_utf16_at_nul - this is an implementation detail of windows and does
not need to be exposed.
* utf8_char_width - moved to libunicode
* utf16_items - moved to libunicode
* is_utf16 - moved to libunicode
* Utf16Items - moved to libunicode
* Utf16Item - moved to libunicode
* Utf16Encoder - moved to libunicode
* AnyLines - renamed to LinesAny and made a struct
* SendStr - use CowString<'static> instead
* str::raw - all functionality is deprecated
* StrExt::into_string - call to_string() instead
* StrExt::repeat - use iterators instead
* StrExt::char_len - use .chars().count() instead
* StrExt::is_alphanumeric - use .chars().all(..)
* StrExt::is_whitespace - use .chars().all(..)
Pending deprecation -- while slicing syntax is being worked out, these methods
are all #[unstable]
* Str - while currently used for generic programming, this trait will be
replaced with one of [], deref coercions, or a generic conversion trait.
* StrExt::slice - use slicing syntax instead
* StrExt::slice_to - use slicing syntax instead
* StrExt::slice_from - use slicing syntax instead
* StrExt::lev_distance - deprecated with no replacement
Awaiting stabilization due to patterns and/or matching
* StrExt::contains
* StrExt::contains_char
* StrExt::split
* StrExt::splitn
* StrExt::split_terminator
* StrExt::rsplitn
* StrExt::match_indices
* StrExt::split_str
* StrExt::starts_with
* StrExt::ends_with
* StrExt::trim_chars
* StrExt::trim_left_chars
* StrExt::trim_right_chars
* StrExt::find
* StrExt::rfind
* StrExt::find_str
* StrExt::subslice_offset
2014-12-10 17:02:31 +00:00
|
|
|
|
2023-02-18 00:13:50 +00:00
|
|
|
// There is no cost to substitute a character with itself.
|
|
|
|
let substitution_cost = if a[a_idx] == b[b_idx] { 0 } else { 1 };
|
std: Stabilize the std::str module
This commit starts out by consolidating all `str` extension traits into one
`StrExt` trait to be included in the prelude. This means that
`UnicodeStrPrelude`, `StrPrelude`, and `StrAllocating` have all been merged into
one `StrExt` exported by the standard library. Some functionality is currently
duplicated with the `StrExt` present in libcore.
This commit also currently avoids any methods which require any form of pattern
to operate. These functions will be stabilized via a separate RFC.
Next, stability of methods and structures are as follows:
Stable
* from_utf8_unchecked
* CowString - after moving to std::string
* StrExt::as_bytes
* StrExt::as_ptr
* StrExt::bytes/Bytes - also made a struct instead of a typedef
* StrExt::char_indices/CharIndices - CharOffsets was renamed
* StrExt::chars/Chars
* StrExt::is_empty
* StrExt::len
* StrExt::lines/Lines
* StrExt::lines_any/LinesAny
* StrExt::slice_unchecked
* StrExt::trim
* StrExt::trim_left
* StrExt::trim_right
* StrExt::words/Words - also made a struct instead of a typedef
Unstable
* from_utf8 - the error type was changed to a `Result`, but the error type has
yet to prove itself
* from_c_str - this function will be handled by the c_str RFC
* FromStr - this trait will have an associated error type eventually
* StrExt::escape_default - needs iterators at least, unsure if it should make
the cut
* StrExt::escape_unicode - needs iterators at least, unsure if it should make
the cut
* StrExt::slice_chars - this function has yet to prove itself
* StrExt::slice_shift_char - awaiting conventions about slicing and shifting
* StrExt::graphemes/Graphemes - this functionality may only be in libunicode
* StrExt::grapheme_indices/GraphemeIndices - this functionality may only be in
libunicode
* StrExt::width - this functionality may only be in libunicode
* StrExt::utf16_units - this functionality may only be in libunicode
* StrExt::nfd_chars - this functionality may only be in libunicode
* StrExt::nfkd_chars - this functionality may only be in libunicode
* StrExt::nfc_chars - this functionality may only be in libunicode
* StrExt::nfkc_chars - this functionality may only be in libunicode
* StrExt::is_char_boundary - naming is uncertain with container conventions
* StrExt::char_range_at - naming is uncertain with container conventions
* StrExt::char_range_at_reverse - naming is uncertain with container conventions
* StrExt::char_at - naming is uncertain with container conventions
* StrExt::char_at_reverse - naming is uncertain with container conventions
* StrVector::concat - this functionality may be replaced with iterators, but
it's not certain at this time
* StrVector::connect - as with concat, may be deprecated in favor of iterators
Deprecated
* StrAllocating and UnicodeStrPrelude have been merged into StrExit
* eq_slice - compiler implementation detail
* from_str - use the inherent parse() method
* is_utf8 - call from_utf8 instead
* replace - call the method instead
* truncate_utf16_at_nul - this is an implementation detail of windows and does
not need to be exposed.
* utf8_char_width - moved to libunicode
* utf16_items - moved to libunicode
* is_utf16 - moved to libunicode
* Utf16Items - moved to libunicode
* Utf16Item - moved to libunicode
* Utf16Encoder - moved to libunicode
* AnyLines - renamed to LinesAny and made a struct
* SendStr - use CowString<'static> instead
* str::raw - all functionality is deprecated
* StrExt::into_string - call to_string() instead
* StrExt::repeat - use iterators instead
* StrExt::char_len - use .chars().count() instead
* StrExt::is_alphanumeric - use .chars().all(..)
* StrExt::is_whitespace - use .chars().all(..)
Pending deprecation -- while slicing syntax is being worked out, these methods
are all #[unstable]
* Str - while currently used for generic programming, this trait will be
replaced with one of [], deref coercions, or a generic conversion trait.
* StrExt::slice - use slicing syntax instead
* StrExt::slice_to - use slicing syntax instead
* StrExt::slice_from - use slicing syntax instead
* StrExt::lev_distance - deprecated with no replacement
Awaiting stabilization due to patterns and/or matching
* StrExt::contains
* StrExt::contains_char
* StrExt::split
* StrExt::splitn
* StrExt::split_terminator
* StrExt::rsplitn
* StrExt::match_indices
* StrExt::split_str
* StrExt::starts_with
* StrExt::ends_with
* StrExt::trim_chars
* StrExt::trim_left_chars
* StrExt::trim_right_chars
* StrExt::find
* StrExt::rfind
* StrExt::find_str
* StrExt::subslice_offset
2014-12-10 17:02:31 +00:00
|
|
|
|
2023-02-18 00:13:50 +00:00
|
|
|
current[j] = cmp::min(
|
|
|
|
// deletion
|
|
|
|
prev[j] + 1,
|
|
|
|
cmp::min(
|
|
|
|
// insertion
|
|
|
|
current[j - 1] + 1,
|
|
|
|
// substitution
|
|
|
|
prev[j - 1] + substitution_cost,
|
|
|
|
),
|
|
|
|
);
|
|
|
|
|
|
|
|
if (i > 1) && (j > 1) && (a[a_idx] == b[b_idx - 1]) && (a[a_idx - 1] == b[b_idx]) {
|
|
|
|
// transposition
|
|
|
|
current[j] = cmp::min(current[j], prev_prev[j - 2] + 1);
|
std: Stabilize the std::str module
This commit starts out by consolidating all `str` extension traits into one
`StrExt` trait to be included in the prelude. This means that
`UnicodeStrPrelude`, `StrPrelude`, and `StrAllocating` have all been merged into
one `StrExt` exported by the standard library. Some functionality is currently
duplicated with the `StrExt` present in libcore.
This commit also currently avoids any methods which require any form of pattern
to operate. These functions will be stabilized via a separate RFC.
Next, stability of methods and structures are as follows:
Stable
* from_utf8_unchecked
* CowString - after moving to std::string
* StrExt::as_bytes
* StrExt::as_ptr
* StrExt::bytes/Bytes - also made a struct instead of a typedef
* StrExt::char_indices/CharIndices - CharOffsets was renamed
* StrExt::chars/Chars
* StrExt::is_empty
* StrExt::len
* StrExt::lines/Lines
* StrExt::lines_any/LinesAny
* StrExt::slice_unchecked
* StrExt::trim
* StrExt::trim_left
* StrExt::trim_right
* StrExt::words/Words - also made a struct instead of a typedef
Unstable
* from_utf8 - the error type was changed to a `Result`, but the error type has
yet to prove itself
* from_c_str - this function will be handled by the c_str RFC
* FromStr - this trait will have an associated error type eventually
* StrExt::escape_default - needs iterators at least, unsure if it should make
the cut
* StrExt::escape_unicode - needs iterators at least, unsure if it should make
the cut
* StrExt::slice_chars - this function has yet to prove itself
* StrExt::slice_shift_char - awaiting conventions about slicing and shifting
* StrExt::graphemes/Graphemes - this functionality may only be in libunicode
* StrExt::grapheme_indices/GraphemeIndices - this functionality may only be in
libunicode
* StrExt::width - this functionality may only be in libunicode
* StrExt::utf16_units - this functionality may only be in libunicode
* StrExt::nfd_chars - this functionality may only be in libunicode
* StrExt::nfkd_chars - this functionality may only be in libunicode
* StrExt::nfc_chars - this functionality may only be in libunicode
* StrExt::nfkc_chars - this functionality may only be in libunicode
* StrExt::is_char_boundary - naming is uncertain with container conventions
* StrExt::char_range_at - naming is uncertain with container conventions
* StrExt::char_range_at_reverse - naming is uncertain with container conventions
* StrExt::char_at - naming is uncertain with container conventions
* StrExt::char_at_reverse - naming is uncertain with container conventions
* StrVector::concat - this functionality may be replaced with iterators, but
it's not certain at this time
* StrVector::connect - as with concat, may be deprecated in favor of iterators
Deprecated
* StrAllocating and UnicodeStrPrelude have been merged into StrExit
* eq_slice - compiler implementation detail
* from_str - use the inherent parse() method
* is_utf8 - call from_utf8 instead
* replace - call the method instead
* truncate_utf16_at_nul - this is an implementation detail of windows and does
not need to be exposed.
* utf8_char_width - moved to libunicode
* utf16_items - moved to libunicode
* is_utf16 - moved to libunicode
* Utf16Items - moved to libunicode
* Utf16Item - moved to libunicode
* Utf16Encoder - moved to libunicode
* AnyLines - renamed to LinesAny and made a struct
* SendStr - use CowString<'static> instead
* str::raw - all functionality is deprecated
* StrExt::into_string - call to_string() instead
* StrExt::repeat - use iterators instead
* StrExt::char_len - use .chars().count() instead
* StrExt::is_alphanumeric - use .chars().all(..)
* StrExt::is_whitespace - use .chars().all(..)
Pending deprecation -- while slicing syntax is being worked out, these methods
are all #[unstable]
* Str - while currently used for generic programming, this trait will be
replaced with one of [], deref coercions, or a generic conversion trait.
* StrExt::slice - use slicing syntax instead
* StrExt::slice_to - use slicing syntax instead
* StrExt::slice_from - use slicing syntax instead
* StrExt::lev_distance - deprecated with no replacement
Awaiting stabilization due to patterns and/or matching
* StrExt::contains
* StrExt::contains_char
* StrExt::split
* StrExt::splitn
* StrExt::split_terminator
* StrExt::rsplitn
* StrExt::match_indices
* StrExt::split_str
* StrExt::starts_with
* StrExt::ends_with
* StrExt::trim_chars
* StrExt::trim_left_chars
* StrExt::trim_right_chars
* StrExt::find
* StrExt::rfind
* StrExt::find_str
* StrExt::subslice_offset
2014-12-10 17:02:31 +00:00
|
|
|
}
|
|
|
|
}
|
2023-02-18 00:13:50 +00:00
|
|
|
|
|
|
|
// Rotate the buffers, reusing the memory.
|
|
|
|
[prev_prev, prev, current] = [prev, current, prev_prev];
|
2018-10-29 13:39:34 +00:00
|
|
|
}
|
2022-01-20 00:00:00 +00:00
|
|
|
|
2023-02-18 00:13:50 +00:00
|
|
|
// `prev` because we already rotated the buffers.
|
|
|
|
let distance = prev[b.len()];
|
|
|
|
(distance <= limit).then_some(distance)
|
std: Stabilize the std::str module
This commit starts out by consolidating all `str` extension traits into one
`StrExt` trait to be included in the prelude. This means that
`UnicodeStrPrelude`, `StrPrelude`, and `StrAllocating` have all been merged into
one `StrExt` exported by the standard library. Some functionality is currently
duplicated with the `StrExt` present in libcore.
This commit also currently avoids any methods which require any form of pattern
to operate. These functions will be stabilized via a separate RFC.
Next, stability of methods and structures are as follows:
Stable
* from_utf8_unchecked
* CowString - after moving to std::string
* StrExt::as_bytes
* StrExt::as_ptr
* StrExt::bytes/Bytes - also made a struct instead of a typedef
* StrExt::char_indices/CharIndices - CharOffsets was renamed
* StrExt::chars/Chars
* StrExt::is_empty
* StrExt::len
* StrExt::lines/Lines
* StrExt::lines_any/LinesAny
* StrExt::slice_unchecked
* StrExt::trim
* StrExt::trim_left
* StrExt::trim_right
* StrExt::words/Words - also made a struct instead of a typedef
Unstable
* from_utf8 - the error type was changed to a `Result`, but the error type has
yet to prove itself
* from_c_str - this function will be handled by the c_str RFC
* FromStr - this trait will have an associated error type eventually
* StrExt::escape_default - needs iterators at least, unsure if it should make
the cut
* StrExt::escape_unicode - needs iterators at least, unsure if it should make
the cut
* StrExt::slice_chars - this function has yet to prove itself
* StrExt::slice_shift_char - awaiting conventions about slicing and shifting
* StrExt::graphemes/Graphemes - this functionality may only be in libunicode
* StrExt::grapheme_indices/GraphemeIndices - this functionality may only be in
libunicode
* StrExt::width - this functionality may only be in libunicode
* StrExt::utf16_units - this functionality may only be in libunicode
* StrExt::nfd_chars - this functionality may only be in libunicode
* StrExt::nfkd_chars - this functionality may only be in libunicode
* StrExt::nfc_chars - this functionality may only be in libunicode
* StrExt::nfkc_chars - this functionality may only be in libunicode
* StrExt::is_char_boundary - naming is uncertain with container conventions
* StrExt::char_range_at - naming is uncertain with container conventions
* StrExt::char_range_at_reverse - naming is uncertain with container conventions
* StrExt::char_at - naming is uncertain with container conventions
* StrExt::char_at_reverse - naming is uncertain with container conventions
* StrVector::concat - this functionality may be replaced with iterators, but
it's not certain at this time
* StrVector::connect - as with concat, may be deprecated in favor of iterators
Deprecated
* StrAllocating and UnicodeStrPrelude have been merged into StrExit
* eq_slice - compiler implementation detail
* from_str - use the inherent parse() method
* is_utf8 - call from_utf8 instead
* replace - call the method instead
* truncate_utf16_at_nul - this is an implementation detail of windows and does
not need to be exposed.
* utf8_char_width - moved to libunicode
* utf16_items - moved to libunicode
* is_utf16 - moved to libunicode
* Utf16Items - moved to libunicode
* Utf16Item - moved to libunicode
* Utf16Encoder - moved to libunicode
* AnyLines - renamed to LinesAny and made a struct
* SendStr - use CowString<'static> instead
* str::raw - all functionality is deprecated
* StrExt::into_string - call to_string() instead
* StrExt::repeat - use iterators instead
* StrExt::char_len - use .chars().count() instead
* StrExt::is_alphanumeric - use .chars().all(..)
* StrExt::is_whitespace - use .chars().all(..)
Pending deprecation -- while slicing syntax is being worked out, these methods
are all #[unstable]
* Str - while currently used for generic programming, this trait will be
replaced with one of [], deref coercions, or a generic conversion trait.
* StrExt::slice - use slicing syntax instead
* StrExt::slice_to - use slicing syntax instead
* StrExt::slice_from - use slicing syntax instead
* StrExt::lev_distance - deprecated with no replacement
Awaiting stabilization due to patterns and/or matching
* StrExt::contains
* StrExt::contains_char
* StrExt::split
* StrExt::splitn
* StrExt::split_terminator
* StrExt::rsplitn
* StrExt::match_indices
* StrExt::split_str
* StrExt::starts_with
* StrExt::ends_with
* StrExt::trim_chars
* StrExt::trim_left_chars
* StrExt::trim_right_chars
* StrExt::find
* StrExt::rfind
* StrExt::find_str
* StrExt::subslice_offset
2014-12-10 17:02:31 +00:00
|
|
|
}
|
|
|
|
|
2022-03-14 21:07:19 +00:00
|
|
|
/// Provides a word similarity score between two words that accounts for substrings being more
|
2023-02-19 04:03:56 +00:00
|
|
|
/// meaningful than a typical edit distance. The lower the score, the closer the match. 0 is an
|
|
|
|
/// identical match.
|
2022-03-14 21:07:19 +00:00
|
|
|
///
|
2023-02-19 04:03:56 +00:00
|
|
|
/// Uses the edit distance between the two strings and removes the cost of the length difference.
|
|
|
|
/// If this is 0 then it is either a substring match or a full word match, in the substring match
|
|
|
|
/// case we detect this and return `1`. To prevent finding meaningless substrings, eg. "in" in
|
|
|
|
/// "shrink", we only perform this subtraction of length difference if one of the words is not
|
|
|
|
/// greater than twice the length of the other. For cases where the words are close in size but not
|
|
|
|
/// an exact substring then the cost of the length difference is discounted by half.
|
2022-03-14 21:07:19 +00:00
|
|
|
///
|
|
|
|
/// Returns `None` if the distance exceeds the limit.
|
2023-02-19 04:03:56 +00:00
|
|
|
pub fn edit_distance_with_substrings(a: &str, b: &str, limit: usize) -> Option<usize> {
|
2022-03-14 21:07:19 +00:00
|
|
|
let n = a.chars().count();
|
|
|
|
let m = b.chars().count();
|
|
|
|
|
|
|
|
// Check one isn't less than half the length of the other. If this is true then there is a
|
|
|
|
// big difference in length.
|
|
|
|
let big_len_diff = (n * 2) < m || (m * 2) < n;
|
|
|
|
let len_diff = if n < m { m - n } else { n - m };
|
2023-02-19 04:03:56 +00:00
|
|
|
let distance = edit_distance(a, b, limit + len_diff)?;
|
2022-03-14 21:07:19 +00:00
|
|
|
|
|
|
|
// This is the crux, subtracting length difference means exact substring matches will now be 0
|
2023-02-19 04:03:56 +00:00
|
|
|
let score = distance - len_diff;
|
2022-03-14 21:07:19 +00:00
|
|
|
|
|
|
|
// If the score is 0 but the words have different lengths then it's a substring match not a full
|
|
|
|
// word match
|
|
|
|
let score = if score == 0 && len_diff > 0 && !big_len_diff {
|
|
|
|
1 // Exact substring match, but not a total word match so return non-zero
|
|
|
|
} else if !big_len_diff {
|
|
|
|
// Not a big difference in length, discount cost of length difference
|
|
|
|
score + (len_diff + 1) / 2
|
|
|
|
} else {
|
|
|
|
// A big difference in length, add back the difference in length to the score
|
|
|
|
score + len_diff
|
|
|
|
};
|
|
|
|
|
|
|
|
(score <= limit).then_some(score)
|
|
|
|
}
|
|
|
|
|
|
|
|
/// Finds the best match for given word in the given iterator where substrings are meaningful.
|
|
|
|
///
|
2023-02-19 04:03:56 +00:00
|
|
|
/// A version of [`find_best_match_for_name`] that uses [`edit_distance_with_substrings`] as the
|
|
|
|
/// score for word similarity. This takes an optional distance limit which defaults to one-third of
|
|
|
|
/// the given word.
|
2022-03-14 21:07:19 +00:00
|
|
|
///
|
2023-02-19 04:03:56 +00:00
|
|
|
/// We use case insensitive comparison to improve accuracy on an edge case with a lower(upper)case
|
|
|
|
/// letters mismatch.
|
2022-03-14 21:07:19 +00:00
|
|
|
pub fn find_best_match_for_name_with_substrings(
|
|
|
|
candidates: &[Symbol],
|
|
|
|
lookup: Symbol,
|
|
|
|
dist: Option<usize>,
|
|
|
|
) -> Option<Symbol> {
|
|
|
|
find_best_match_for_name_impl(true, candidates, lookup, dist)
|
|
|
|
}
|
|
|
|
|
2020-12-24 07:01:03 +00:00
|
|
|
/// Finds the best match for a given word in the given iterator.
|
2018-09-11 06:31:47 +00:00
|
|
|
///
|
2015-12-14 17:06:31 +00:00
|
|
|
/// As a loose rule to avoid the obviously incorrect suggestions, it takes
|
|
|
|
/// an optional limit for the maximum allowable edit distance, which defaults
|
2017-11-30 21:39:47 +00:00
|
|
|
/// to one-third of the given word.
|
2018-09-11 06:31:47 +00:00
|
|
|
///
|
2023-02-19 04:03:56 +00:00
|
|
|
/// We use case insensitive comparison to improve accuracy on an edge case with a lower(upper)case
|
|
|
|
/// letters mismatch.
|
Move lev_distance to rustc_ast, make non-generic
rustc_ast currently has a few dependencies on rustc_lexer. Ideally, an AST
would not have any dependency its lexer, for minimizing unnecessarily
design-time dependencies. Breaking this dependency would also have practical
benefits, since modifying rustc_lexer would not trigger a rebuild of rustc_ast.
This commit does not remove the rustc_ast --> rustc_lexer dependency,
but it does remove one of the sources of this dependency, which is the
code that handles fuzzy matching between symbol names for making suggestions
in diagnostics. Since that code depends only on Symbol, it is easy to move
it to rustc_span. It might even be best to move it to a separate crate,
since other tools such as Cargo use the same algorithm, and have simply
contain a duplicate of the code.
This changes the signature of find_best_match_for_name so that it is no
longer generic over its input. I checked the optimized binaries, and this
function was duplicated at nearly every call site, because most call sites
used short-lived iterator chains, generic over Map and such. But there's
no good reason for a function like this to be generic, since all it does
is immediately convert the generic input (the Iterator impl) to a concrete
Vec<Symbol>. This has all of the costs of generics (duplicated method bodies)
with no benefit.
Changing find_best_match_for_name to be non-generic removed about 10KB of
code from the optimized binary. I know it's a drop in the bucket, but we have
to start reducing binary size, and beginning to tame over-use of generics
is part of that.
2020-11-12 19:24:10 +00:00
|
|
|
pub fn find_best_match_for_name(
|
2022-01-20 00:00:00 +00:00
|
|
|
candidates: &[Symbol],
|
2020-07-08 10:03:37 +00:00
|
|
|
lookup: Symbol,
|
2016-11-16 10:52:37 +00:00
|
|
|
dist: Option<usize>,
|
2022-03-14 21:07:19 +00:00
|
|
|
) -> Option<Symbol> {
|
|
|
|
find_best_match_for_name_impl(false, candidates, lookup, dist)
|
|
|
|
}
|
|
|
|
|
2024-01-10 16:24:46 +00:00
|
|
|
/// Find the best match for multiple words
|
|
|
|
///
|
|
|
|
/// This function is intended for use when the desired match would never be
|
|
|
|
/// returned due to a substring in `lookup` which is superfluous.
|
|
|
|
///
|
|
|
|
/// For example, when looking for the closest lint name to `clippy:missing_docs`,
|
|
|
|
/// we would find `clippy::erasing_op`, despite `missing_docs` existing and being a better suggestion.
|
|
|
|
/// `missing_docs` would have a larger edit distance because it does not contain the `clippy` tool prefix.
|
|
|
|
/// In order to find `missing_docs`, this function takes multiple lookup strings, computes the best match
|
|
|
|
/// for each and returns the match which had the lowest edit distance. In our example, `clippy:missing_docs` and
|
|
|
|
/// `missing_docs` would be `lookups`, enabling `missing_docs` to be the best match, as desired.
|
|
|
|
pub fn find_best_match_for_names(
|
|
|
|
candidates: &[Symbol],
|
|
|
|
lookups: &[Symbol],
|
|
|
|
dist: Option<usize>,
|
|
|
|
) -> Option<Symbol> {
|
|
|
|
lookups
|
|
|
|
.iter()
|
|
|
|
.map(|s| (s, find_best_match_for_name_impl(false, candidates, *s, dist)))
|
|
|
|
.filter_map(|(s, r)| r.map(|r| (s, r)))
|
|
|
|
.min_by(|(s1, r1), (s2, r2)| {
|
|
|
|
let d1 = edit_distance(s1.as_str(), r1.as_str(), usize::MAX).unwrap();
|
|
|
|
let d2 = edit_distance(s2.as_str(), r2.as_str(), usize::MAX).unwrap();
|
|
|
|
d1.cmp(&d2)
|
|
|
|
})
|
|
|
|
.map(|(_, r)| r)
|
|
|
|
}
|
|
|
|
|
2022-03-14 21:07:19 +00:00
|
|
|
#[cold]
|
|
|
|
fn find_best_match_for_name_impl(
|
|
|
|
use_substring_score: bool,
|
|
|
|
candidates: &[Symbol],
|
2023-03-20 14:48:26 +00:00
|
|
|
lookup_symbol: Symbol,
|
2022-03-14 21:07:19 +00:00
|
|
|
dist: Option<usize>,
|
Move lev_distance to rustc_ast, make non-generic
rustc_ast currently has a few dependencies on rustc_lexer. Ideally, an AST
would not have any dependency its lexer, for minimizing unnecessarily
design-time dependencies. Breaking this dependency would also have practical
benefits, since modifying rustc_lexer would not trigger a rebuild of rustc_ast.
This commit does not remove the rustc_ast --> rustc_lexer dependency,
but it does remove one of the sources of this dependency, which is the
code that handles fuzzy matching between symbol names for making suggestions
in diagnostics. Since that code depends only on Symbol, it is easy to move
it to rustc_span. It might even be best to move it to a separate crate,
since other tools such as Cargo use the same algorithm, and have simply
contain a duplicate of the code.
This changes the signature of find_best_match_for_name so that it is no
longer generic over its input. I checked the optimized binaries, and this
function was duplicated at nearly every call site, because most call sites
used short-lived iterator chains, generic over Map and such. But there's
no good reason for a function like this to be generic, since all it does
is immediately convert the generic input (the Iterator impl) to a concrete
Vec<Symbol>. This has all of the costs of generics (duplicated method bodies)
with no benefit.
Changing find_best_match_for_name to be non-generic removed about 10KB of
code from the optimized binary. I know it's a drop in the bucket, but we have
to start reducing binary size, and beginning to tame over-use of generics
is part of that.
2020-11-12 19:24:10 +00:00
|
|
|
) -> Option<Symbol> {
|
2023-03-20 14:48:26 +00:00
|
|
|
let lookup = lookup_symbol.as_str();
|
2022-01-17 00:00:00 +00:00
|
|
|
let lookup_uppercase = lookup.to_uppercase();
|
2017-11-30 21:39:47 +00:00
|
|
|
|
2021-10-16 19:51:22 +00:00
|
|
|
// Priority of matches:
|
|
|
|
// 1. Exact case insensitive match
|
2023-02-19 04:03:56 +00:00
|
|
|
// 2. Edit distance match
|
2021-10-16 19:51:22 +00:00
|
|
|
// 3. Sorted word match
|
2022-01-20 00:00:00 +00:00
|
|
|
if let Some(c) = candidates.iter().find(|c| c.as_str().to_uppercase() == lookup_uppercase) {
|
|
|
|
return Some(*c);
|
2021-10-16 19:51:22 +00:00
|
|
|
}
|
2022-01-20 00:00:00 +00:00
|
|
|
|
2023-11-27 17:55:32 +00:00
|
|
|
// `fn edit_distance()` use `chars()` to calculate edit distance, so we must
|
|
|
|
// also use `chars()` (and not `str::len()`) to calculate length here.
|
|
|
|
let lookup_len = lookup.chars().count();
|
|
|
|
|
|
|
|
let mut dist = dist.unwrap_or_else(|| cmp::max(lookup_len, 3) / 3);
|
2022-01-20 00:00:00 +00:00
|
|
|
let mut best = None;
|
2023-03-26 04:03:25 +00:00
|
|
|
// store the candidates with the same distance, only for `use_substring_score` current.
|
2023-03-20 14:48:26 +00:00
|
|
|
let mut next_candidates = vec![];
|
2022-01-20 00:00:00 +00:00
|
|
|
for c in candidates {
|
2022-03-14 21:07:19 +00:00
|
|
|
match if use_substring_score {
|
2023-02-19 04:03:56 +00:00
|
|
|
edit_distance_with_substrings(lookup, c.as_str(), dist)
|
2022-03-14 21:07:19 +00:00
|
|
|
} else {
|
2023-02-19 04:03:56 +00:00
|
|
|
edit_distance(lookup, c.as_str(), dist)
|
2022-03-14 21:07:19 +00:00
|
|
|
} {
|
2022-01-20 00:00:00 +00:00
|
|
|
Some(0) => return Some(*c),
|
|
|
|
Some(d) => {
|
2023-03-20 14:48:26 +00:00
|
|
|
if use_substring_score {
|
2023-03-26 04:03:25 +00:00
|
|
|
if d < dist {
|
|
|
|
dist = d;
|
|
|
|
next_candidates.clear();
|
|
|
|
} else {
|
|
|
|
// `d == dist` here, we need to store the candidates with the same distance
|
|
|
|
// so we won't decrease the distance in the next loop.
|
|
|
|
}
|
2023-03-20 14:48:26 +00:00
|
|
|
next_candidates.push(*c);
|
|
|
|
} else {
|
|
|
|
dist = d - 1;
|
|
|
|
}
|
2023-03-26 04:03:25 +00:00
|
|
|
best = Some(*c);
|
2022-01-20 00:00:00 +00:00
|
|
|
}
|
|
|
|
None => {}
|
|
|
|
}
|
2017-11-30 21:39:47 +00:00
|
|
|
}
|
2023-03-20 14:48:26 +00:00
|
|
|
|
2023-04-05 22:51:49 +00:00
|
|
|
// We have a tie among several candidates, try to select the best among them ignoring substrings.
|
2023-04-10 20:02:52 +00:00
|
|
|
// For example, the candidates list `force_capture`, `capture`, and user inputted `forced_capture`,
|
2023-04-05 22:51:49 +00:00
|
|
|
// we select `force_capture` with a extra round of edit distance calculation.
|
2023-03-20 14:48:26 +00:00
|
|
|
if next_candidates.len() > 1 {
|
2023-03-26 04:03:25 +00:00
|
|
|
debug_assert!(use_substring_score);
|
2023-03-20 14:48:26 +00:00
|
|
|
best = find_best_match_for_name_impl(
|
|
|
|
false,
|
|
|
|
&next_candidates,
|
|
|
|
lookup_symbol,
|
|
|
|
Some(lookup.len()),
|
|
|
|
);
|
|
|
|
}
|
2022-01-20 00:00:00 +00:00
|
|
|
if best.is_some() {
|
|
|
|
return best;
|
|
|
|
}
|
|
|
|
|
|
|
|
find_match_by_sorted_words(candidates, lookup)
|
2015-11-27 16:52:29 +00:00
|
|
|
}
|
2020-01-03 23:53:03 +00:00
|
|
|
|
Move lev_distance to rustc_ast, make non-generic
rustc_ast currently has a few dependencies on rustc_lexer. Ideally, an AST
would not have any dependency its lexer, for minimizing unnecessarily
design-time dependencies. Breaking this dependency would also have practical
benefits, since modifying rustc_lexer would not trigger a rebuild of rustc_ast.
This commit does not remove the rustc_ast --> rustc_lexer dependency,
but it does remove one of the sources of this dependency, which is the
code that handles fuzzy matching between symbol names for making suggestions
in diagnostics. Since that code depends only on Symbol, it is easy to move
it to rustc_span. It might even be best to move it to a separate crate,
since other tools such as Cargo use the same algorithm, and have simply
contain a duplicate of the code.
This changes the signature of find_best_match_for_name so that it is no
longer generic over its input. I checked the optimized binaries, and this
function was duplicated at nearly every call site, because most call sites
used short-lived iterator chains, generic over Map and such. But there's
no good reason for a function like this to be generic, since all it does
is immediately convert the generic input (the Iterator impl) to a concrete
Vec<Symbol>. This has all of the costs of generics (duplicated method bodies)
with no benefit.
Changing find_best_match_for_name to be non-generic removed about 10KB of
code from the optimized binary. I know it's a drop in the bucket, but we have
to start reducing binary size, and beginning to tame over-use of generics
is part of that.
2020-11-12 19:24:10 +00:00
|
|
|
fn find_match_by_sorted_words(iter_names: &[Symbol], lookup: &str) -> Option<Symbol> {
|
[rustc_span][perf] Hoist lookup sorted by words out of the loop.
@lqd commented on https://github.com/rust-lang/rust/pull/114351 asking
if `sort_by_words(lookup)` is computed repeatedly. I was assuming that
rustc should have no difficulties to hoist it automatically outside of
the loop to avoid repeated pure computation, but according to
https://godbolt.org/z/frs8Kj1rq it seems like I was wrong:
original version seems to have 2 calls per loop iteration
```
.LBB16_3:
mov rbx, qword ptr [r13]
mov r14, qword ptr [r13 + 8]
lea rdi, [rsp + 40]
mov rsi, rbx
mov rdx, r14
call example::sort_by_words
lea rdi, [rsp + 64]
mov rsi, qword ptr [rsp + 8]
mov rdx, qword ptr [rsp + 16]
call example::sort_by_words
mov rdi, qword ptr [rsp + 40]
mov rdx, qword ptr [rsp + 56]
mov rsi, qword ptr [rsp + 64]
cmp rdx, qword ptr [rsp + 80]
mov qword ptr [rsp + 32], rdi
mov qword ptr [rsp + 24], rsi
jne .LBB16_5
call qword ptr [rip + bcmp@GOTPCREL]
test eax, eax
sete al
mov dword ptr [rsp + 4], eax
mov rsi, qword ptr [rsp + 72]
test rsi, rsi
jne .LBB16_8
jmp .LBB16_9
```
but the manually hoisted version just 1:
```
.LBB16_3:
mov r13, qword ptr [r15]
mov r14, qword ptr [r15 + 8]
lea rdi, [rsp + 64]
mov rsi, r13
mov rdx, r14
call example::sort_by_words
mov rdi, qword ptr [rsp + 64]
mov rdx, qword ptr [rsp + 16]
cmp qword ptr [rsp + 80], rdx
mov qword ptr [rsp + 32], rdi
jne .LBB16_5
mov rsi, qword ptr [rsp + 8]
call qword ptr [rip + bcmp@GOTPCREL]
test eax, eax
sete bpl
mov rsi, qword ptr [rsp + 72]
test rsi, rsi
jne .LBB16_8
jmp .LBB16_9
```
This code is probably not very hot, but there is no reason to leave
such a low hanging fruit.
2023-08-03 03:51:16 +00:00
|
|
|
let lookup_sorted_by_words = sort_by_words(lookup);
|
2020-01-03 23:53:03 +00:00
|
|
|
iter_names.iter().fold(None, |result, candidate| {
|
[rustc_span][perf] Hoist lookup sorted by words out of the loop.
@lqd commented on https://github.com/rust-lang/rust/pull/114351 asking
if `sort_by_words(lookup)` is computed repeatedly. I was assuming that
rustc should have no difficulties to hoist it automatically outside of
the loop to avoid repeated pure computation, but according to
https://godbolt.org/z/frs8Kj1rq it seems like I was wrong:
original version seems to have 2 calls per loop iteration
```
.LBB16_3:
mov rbx, qword ptr [r13]
mov r14, qword ptr [r13 + 8]
lea rdi, [rsp + 40]
mov rsi, rbx
mov rdx, r14
call example::sort_by_words
lea rdi, [rsp + 64]
mov rsi, qword ptr [rsp + 8]
mov rdx, qword ptr [rsp + 16]
call example::sort_by_words
mov rdi, qword ptr [rsp + 40]
mov rdx, qword ptr [rsp + 56]
mov rsi, qword ptr [rsp + 64]
cmp rdx, qword ptr [rsp + 80]
mov qword ptr [rsp + 32], rdi
mov qword ptr [rsp + 24], rsi
jne .LBB16_5
call qword ptr [rip + bcmp@GOTPCREL]
test eax, eax
sete al
mov dword ptr [rsp + 4], eax
mov rsi, qword ptr [rsp + 72]
test rsi, rsi
jne .LBB16_8
jmp .LBB16_9
```
but the manually hoisted version just 1:
```
.LBB16_3:
mov r13, qword ptr [r15]
mov r14, qword ptr [r15 + 8]
lea rdi, [rsp + 64]
mov rsi, r13
mov rdx, r14
call example::sort_by_words
mov rdi, qword ptr [rsp + 64]
mov rdx, qword ptr [rsp + 16]
cmp qword ptr [rsp + 80], rdx
mov qword ptr [rsp + 32], rdi
jne .LBB16_5
mov rsi, qword ptr [rsp + 8]
call qword ptr [rip + bcmp@GOTPCREL]
test eax, eax
sete bpl
mov rsi, qword ptr [rsp + 72]
test rsi, rsi
jne .LBB16_8
jmp .LBB16_9
```
This code is probably not very hot, but there is no reason to leave
such a low hanging fruit.
2023-08-03 03:51:16 +00:00
|
|
|
if sort_by_words(candidate.as_str()) == lookup_sorted_by_words {
|
Move lev_distance to rustc_ast, make non-generic
rustc_ast currently has a few dependencies on rustc_lexer. Ideally, an AST
would not have any dependency its lexer, for minimizing unnecessarily
design-time dependencies. Breaking this dependency would also have practical
benefits, since modifying rustc_lexer would not trigger a rebuild of rustc_ast.
This commit does not remove the rustc_ast --> rustc_lexer dependency,
but it does remove one of the sources of this dependency, which is the
code that handles fuzzy matching between symbol names for making suggestions
in diagnostics. Since that code depends only on Symbol, it is easy to move
it to rustc_span. It might even be best to move it to a separate crate,
since other tools such as Cargo use the same algorithm, and have simply
contain a duplicate of the code.
This changes the signature of find_best_match_for_name so that it is no
longer generic over its input. I checked the optimized binaries, and this
function was duplicated at nearly every call site, because most call sites
used short-lived iterator chains, generic over Map and such. But there's
no good reason for a function like this to be generic, since all it does
is immediately convert the generic input (the Iterator impl) to a concrete
Vec<Symbol>. This has all of the costs of generics (duplicated method bodies)
with no benefit.
Changing find_best_match_for_name to be non-generic removed about 10KB of
code from the optimized binary. I know it's a drop in the bucket, but we have
to start reducing binary size, and beginning to tame over-use of generics
is part of that.
2020-11-12 19:24:10 +00:00
|
|
|
Some(*candidate)
|
2020-01-03 23:53:03 +00:00
|
|
|
} else {
|
|
|
|
result
|
|
|
|
}
|
|
|
|
})
|
|
|
|
}
|
|
|
|
|
2023-08-01 23:57:43 +00:00
|
|
|
fn sort_by_words(name: &str) -> Vec<&str> {
|
2020-01-03 23:53:03 +00:00
|
|
|
let mut split_words: Vec<&str> = name.split('_').collect();
|
2020-12-24 07:01:03 +00:00
|
|
|
// We are sorting primitive &strs and can use unstable sort here.
|
2020-09-09 22:03:58 +00:00
|
|
|
split_words.sort_unstable();
|
2023-08-01 23:57:43 +00:00
|
|
|
split_words
|
2020-01-03 23:53:03 +00:00
|
|
|
}
|