nordic-dev.net/rust - rust

nordic-dev.net/rust

Fork 0

mirror of https://github.com/rust-lang/rust.git synced 2025-01-23 05:03:47 +00:00

Commit Graph

Author	SHA1	Message	Date
Michael Howell	86da4be47f	rustdoc: use a trie for name-based search Preview and profiler results ---------------------------- Here's some quick profiling in Firefox done on the rust compiler docs: - Before: https://share.firefox.dev/3UPm3M8 - After: https://share.firefox.dev/40LXvYb Here's the results for the node.js profiler: - https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly. - https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html [the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/ The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary. The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie. [Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf [Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java [moman]: https://gitlab.com/notriddle/moman-rustdoc User-visible changes -------------------- I don't expect anybody to notice anything, but it does cause two changes: - Substring matches, in the middle of a name, only apply if there's three or more characters in the search query. - Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them. [^1]: technically utf-16 code units	2024-11-13 12:04:46 -07:00
Michael Howell	0ea58e2346	rustdoc-search: count path edits with separate edit limit Since the two are counted separately elsewhere, they should get their own limits, too. The biggest problem with combining them is that paths are loosely checked by not requiring every component to match, which means that if they are short and matched loosely, they can easily find "drunk typist" matches that make no sense, like this old result: std::collections::btree_map::itermut matching slice::itermut maxEditDistance = ("slice::itermut".length) / 3 = 14 / 3 = 4 editDistance("std", "slice") = 4 editDistance("itermut", "itermut") = 0 4 + 0 <= 4 PASS Of course, `slice::itermut` should not match stuff from btreemap. `slice` should not match `std`. The new result counts them separately: maxPathEditDistance = "slice".length / 3 = 5 / 3 = 1 maxEditDistance = "itermut".length / 3 = 7 / 3 = 2 editDistance("std", "slice") = 4 4 <= 1 FAIL Effectively, this makes path queries less "typo-resistant". It's not zero, but it means `vec` won't match the `v1` prelude. Queries without parent paths are unchanged.	2023-12-26 18:46:17 -07:00

Author

SHA1

Message

Date

Michael Howell

86da4be47f

rustdoc: use a trie for name-based search

Preview and profiler results
----------------------------

Here's some quick profiling in Firefox done on the rust compiler docs:

- Before: https://share.firefox.dev/3UPm3M8
- After: https://share.firefox.dev/40LXvYb

Here's the results for the node.js profiler:

- https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html

Here's a copy that you can use to try it out. Compare it with [the nightly].
Try typing `typecheckercontext` one character at a time, slowly.

- https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html

[the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/

The fuzzy match algo is based on [Fast String Correction with
Levenshtein-Automata] and the corresponding implementation code in [moman]
and [Lucene]; the bit-packing representation comes from Lucene, but the
actual matcher is more based on `fsc.py`. As suggested in the paper, a
trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a
side table of three-character[^1] windows that point into the trie.

[Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf
[Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java
[moman]: https://gitlab.com/notriddle/moman-rustdoc

User-visible changes
--------------------

I don't expect anybody to notice anything, but it does cause two changes:

- Substring matches, in the middle of a name, only apply if there's three
  or more characters in the search query.
- Levenshtein distance limit now maxes out at two. In the old version,
  the limit was w/3, so you could get looser matches for queries with
  9 or more characters[^1] in them.

[^1]: technically utf-16 code units

2024-11-13 12:04:46 -07:00

Michael Howell

0ea58e2346

rustdoc-search: count path edits with separate edit limit

Since the two are counted separately elsewhere, they should get
their own limits, too. The biggest problem with combining them
is that paths are loosely checked by not requiring every component
to match, which means that if they are short and matched loosely,
they can easily find "drunk typist" matches that make no sense,
like this old result:

    std::collections::btree_map::itermut matching slice::itermut
    maxEditDistance = ("slice::itermut".length) / 3 = 14 / 3 = 4
    editDistance("std", "slice") = 4
    editDistance("itermut", "itermut") = 0
        4 + 0 <= 4 PASS

Of course, `slice::itermut` should not match stuff from btreemap.
`slice` should not match `std`.

The new result counts them separately:

    maxPathEditDistance = "slice".length / 3 = 5 / 3 = 1
    maxEditDistance = "itermut".length / 3 = 7 / 3 = 2
    editDistance("std", "slice") = 4
        4 <= 1 FAIL

Effectively, this makes path queries less "typo-resistant".
It's not zero, but it means `vec` won't match the `v1` prelude.

Queries without parent paths are unchanged.

2023-12-26 18:46:17 -07:00

2 Commits