Commit Graph

2 Commits

Author SHA1 Message Date
Michael Howell
86da4be47f rustdoc: use a trie for name-based search
Preview and profiler results
----------------------------

Here's some quick profiling in Firefox done on the rust compiler docs:

- Before: https://share.firefox.dev/3UPm3M8
- After: https://share.firefox.dev/40LXvYb

Here's the results for the node.js profiler:

- https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html

Here's a copy that you can use to try it out. Compare it with [the nightly].
Try typing `typecheckercontext` one character at a time, slowly.

- https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html

[the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/

The fuzzy match algo is based on [Fast String Correction with
Levenshtein-Automata] and the corresponding implementation code in [moman]
and [Lucene]; the bit-packing representation comes from Lucene, but the
actual matcher is more based on `fsc.py`. As suggested in the paper, a
trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a
side table of three-character[^1] windows that point into the trie.

[Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf
[Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java
[moman]: https://gitlab.com/notriddle/moman-rustdoc

User-visible changes
--------------------

I don't expect anybody to notice anything, but it does cause two changes:

- Substring matches, in the middle of a name, only apply if there's three
  or more characters in the search query.
- Levenshtein distance limit now maxes out at two. In the old version,
  the limit was w/3, so you could get looser matches for queries with
  9 or more characters[^1] in them.

[^1]: technically utf-16 code units
2024-11-13 12:04:46 -07:00
Michael Howell
0ea58e2346 rustdoc-search: count path edits with separate edit limit
Since the two are counted separately elsewhere, they should get
their own limits, too. The biggest problem with combining them
is that paths are loosely checked by not requiring every component
to match, which means that if they are short and matched loosely,
they can easily find "drunk typist" matches that make no sense,
like this old result:

    std::collections::btree_map::itermut matching slice::itermut
    maxEditDistance = ("slice::itermut".length) / 3 = 14 / 3 = 4
    editDistance("std", "slice") = 4
    editDistance("itermut", "itermut") = 0
        4 + 0 <= 4 PASS

Of course, `slice::itermut` should not match stuff from btreemap.
`slice` should not match `std`.

The new result counts them separately:

    maxPathEditDistance = "slice".length / 3 = 5 / 3 = 1
    maxEditDistance = "itermut".length / 3 = 7 / 3 = 2
    editDistance("std", "slice") = 4
        4 <= 1 FAIL

Effectively, this makes path queries less "typo-resistant".
It's not zero, but it means `vec` won't match the `v1` prelude.

Queries without parent paths are unchanged.
2023-12-26 18:46:17 -07:00