Move token tree related lexer state to a separate struct
Just a types-based refactoring.
We only used a bunch of fields when tokenizing into a token tree, so let's move them out of the base lexer
Identify when a stmt could have been parsed as an expr
There are some expressions that can be parsed as a statement without
a trailing semicolon depending on the context, which can lead to
confusing errors due to the same looking code being accepted in some
places and not others. Identify these cases and suggest enclosing in
parenthesis making the parse non-ambiguous without changing the
accepted grammar.
Fix#54186, cc #54482, fix#59975, fix#47287.
introduce unescape module
A WIP PR to gauge early feedback
Currently, we deal with escape sequences twice: once when we [lex](112f7e9ac5/src/libsyntax/parse/lexer/mod.rs (L928-L1065)) a string, and a second time when we [unescape](112f7e9ac5/src/libsyntax/parse/mod.rs (L313-L366)) literals. Note that we also produce different sets of diagnostics in these two cases.
This PR aims to remove this duplication, by introducing a new `unescape` module as a single source of truth for character escaping rules.
I think this would be a useful cleanup by itself, but I also need this for https://github.com/rust-lang/rust/pull/59706.
In the current state, the PR has `unescape` module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module
What this PR doesn't have yet are:
* [x] handling of byte and byte string literals (should be simple to add)
* [x] good diagnostics
* [x] actual removal of code from lexer (giant `scan_char_or_byte` should go away completely)
* [x] performance check
* [x] general cleanup of the new code
Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that `unescape` produces a plain old `enum` with various problems, and they are rendered into `Handler` separately. This bit is not actually required (it is possible to just pass the `Handler` in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706
cc @eddyb , @petrochenkov
Currently, we deal with escape sequences twice: once when we lex a
string, and a second time when we unescape literals. This PR aims to
remove this duplication, by introducing a new `unescape` mode as a
single source of truth for character escaping rules
There are some expressions that can be parsed as a statement without
a trailing semicolon depending on the context, which can lead to
confusing errors due to the same looking code being accepted in some
places and not others. Identify these cases and suggest enclosing in
parenthesis making the parse non-ambiguous without changing the
accepted grammar.
Rollup of 24 pull requests
Successful merges:
- #58080 (Add FreeBSD armv6 and armv7 targets)
- #58204 (On return type `impl Trait` for block with no expr point at last semi)
- #58269 (Add librustc and libsyntax to rust-src distribution.)
- #58369 (Make the Entry API of HashMap<K, V> Sync and Send)
- #58861 (Expand where negative supertrait specific error is shown)
- #58877 (Suggest removal of `&` when borrowing macro and appropriate)
- #58883 (Suggest appropriate code for unused field when destructuring pattern)
- #58891 (Remove stray ` in the docs for the FromIterator implementation for Option)
- #58893 (race condition in thread local storage example)
- #58906 (Monomorphize generator field types for debuginfo)
- #58911 (Regression test for #58435.)
- #58912 (Regression test for #58813)
- #58916 (Fix release note problems noticed after merging.)
- #58918 (Regression test added for an async ICE.)
- #58921 (Add an explicit test for issue #50582)
- #58926 (Make the lifetime parameters of tcx consistent.)
- #58931 (Elide invalid method receiver error when it contains TyErr)
- #58940 (Remove JSBackend from config.toml)
- #58950 (Add self to mailmap)
- #58961 (On incorrect cfg literal/identifier, point at the right span)
- #58963 (libstd: implement Error::source for io::Error)
- #58970 (delay_span_bug in wfcheck's ty.lift_to_tcx unwrap)
- #58984 (Teach `-Z treat-err-as-bug` to take a number of errors to emit)
- #59007 (Add a test for invalid const arguments)
Failed merges:
- #58959 (Add release notes for PR #56243)
r? @ghost
`-Z treat-err-as-bug=0` will cause `rustc` to panic after the first
error is reported. `-Z treat-err-as-bug=2` will cause `rustc` to
panic after 3 errors have been reported.
Rename rustc_errors dependency in rust 2018 crates
I think this is a better solution than `use rustc_errors as errors` in `lib.rs` and `use crate::errors` in modules.
Related: rust-lang/cargo#5653
cc #58099
r? @Centril
Simplify `TokenStream` some more
These commits simplify `TokenStream`, remove `ThinTokenStream`, and avoid some clones. The end result is simpler code and a slight perf win on some benchmarks.
r? @petrochenkov
Add non-panicking `maybe_new_parser_from_file` variant
Add (seemingly?) missing `maybe_new_parser_from_file` constructor variant.
Disclaimer: I'm not certain this is the correct approach - just found out we don't have this when working on a Rustfmt PR to catch/prevent more Rust parser panics: https://github.com/rust-lang/rustfmt/pull/3240 and tried to make it work somehow.
Because it's an extra type layer that doesn't really help; in a couple
of places it actively gets in the way, and overall removing it makes the
code nicer. It does, however, move `tokenstream::TokenTree` further away
from the `TokenTree` in `quote.rs`.
More importantly, this change reduces the size of `TokenStream` from 48
bytes to 40 bytes on x86-64, which is enough to slightly reduce
instruction counts on numerous benchmarks, the best by 1.5%.
Note that `open_tt` and `close_tt` have gone from being methods on
`Delimited` to associated methods of `TokenTree`.
53956 panic on include bytes of own file
fix#53956
When using `include_bytes!` on a source file in the project, compiler would panic on subsequent compilations because `expand_include_bytes` would overwrite files in the source_map with no source. This PR changes `expand_include_bytes` to check source_map and use the already existing src, if any.
rustdoc: Replaces fn main search and extern crate search with proper parsing during doctests.
Fixes#21299.
Fixes#33731.
Let me know if there's any additional changes you'd like made!
This commit avoids an allocation when parsing any float and integer
literals that don't involved underscores.
This reduces the number of allocations done for the `tuple-stress`
benchmark by 10%, reducing its instruction count by just under 1%.
syntax: Optimize some literal parsing
Currently in the `wasm-bindgen` project we have a very very large crate that's
procedurally generated, `web-sys`. To generate this crate we parse all of a
browser's WebIDL and we then generate bindings for all of the APIs contained
within.
The resulting Rust file is 18MB large (wow!) and currently takes a very long
time to compile in debug mode. On the nightly compiler a *debug* build takes 90s
for the crate to finish. I was curious what was taking so long and upon
investigating a *massive* portion of the time was spent in the `lit_token`
method of the compiler, primarily formatting strings via `format!`.
Upon some more investigation it looks like the `byte_str_lit` was allocating an
error message once per byte, causing a very large number of allocations to
happen for large literals, of which wasm-bindgen generates quite a few (some are
MB large).
This commit fixes the issue by lazily allocating the error message, only doing
so if the error message is actually needed (which should be never). As a result,
the debug mode compilation time for our `web-sys` crate decreased from 90s to
20s, a very nice improvement! (although we've still got some work to do).
Currently in the `wasm-bindgen` project we have a very very large crate that's
procedurally generated, `web-sys`. To generate this crate we parse all of a
browser's WebIDL and we then generate bindings for all of the APIs contained
within.
The resulting Rust file is 18MB large (wow!) and currently takes a very long
time to compile in debug mode. On the nightly compiler a *debug* build takes 90s
for the crate to finish. I was curious what was taking so long and upon
investigating a *massive* portion of the time was spent in the `lit_token`
method of the compiler, primarily formatting strings via `format!`.
Upon some more investigation it looks like the `byte_str_lit` was allocating an
error message once per byte, causing a very large number of allocations to
happen for large literals, of which wasm-bindgen generates quite a few (some are
MB large).
This commit fixes the issue by lazily allocating the error message, only doing
so if the error message is actually needed (which should be never). As a result,
the debug mode compilation time for our `web-sys` crate decreased from 90s to
20s, a very nice improvement! (although we've still got some work to do).
async/await
This PR implements `async`/`await` syntax for `async fn` in Rust 2015 and `async` closures and `async` blocks in Rust 2018 (tracking issue: https://github.com/rust-lang/rust/issues/50547). Limitations: non-`move` async closures with arguments are currently not supported, nor are `async fn` with multiple different input lifetimes. These limitations are not fundamental and will be removed in the future, however I'd like to go ahead and get this PR merged so we can start experimenting with this in combination with futures 0.3.
Based on https://github.com/rust-lang/rust/pull/51414.
cc @petrochenkov for parsing changes.
r? @eddyb
This is gated on edition 2018 & the `async_await` feature gate.
The parser will accept `async fn` and `async unsafe fn` as fn
items. Along the same lines as `const fn`, only `async unsafe fn`
is permitted, not `unsafe async fn`.The parser will not accept
`async` functions as trait methods.
To do a little code clean up, four fields of the function type
struct have been merged into the new `FnHeader` struct: constness,
asyncness, unsafety, and ABI.
Also, a small bug in HIR printing is fixed: it previously printed
`const unsafe fn` as `unsafe const fn`, which is grammatically
incorrect.
In the common case, the string value in a string literal Token is the
same as the string value in a string literal LitKind. (The exception is
when escapes or \r are involved.) This patch takes advantage of that to
avoid calling str_lit() and re-interning the string in that case. This
speeds up incremental builds for a few of the rustc-benchmarks, the best
by 3%.
`char_lit` uses an allocation in order to ignore '_' chars in \u{...}
literals. This patch changes it to not do that by processing the chars
more directly.
This improves various rustc-perf benchmark measurements by up to 6%,
particularly regex, futures, clap, coercions, hyper, and encoding.