By making it an `EscapeError` instead of a `LitError`. This makes it
like the other errors produced when checking string literals contents,
e.g. for invalid escape sequences or bare CR chars.
NOTE: this means these errors are issued earlier, before expansion,
which changes behaviour. It will be possible to move the check back to
the later point if desired. If that happens, it's likely that all the
string literal contents checks will be delayed together.
One nice thing about this: the old approach had some code in
`report_lit_error` to calculate the span of the nul char from a range.
This code used a hardwired `+2` to account for the `c"` at the start of
a C string literal, but this should have changed to a `+3` for raw C
string literals to account for the `cr"`, which meant that the caret in
`cr"` nul error messages was one short of where it should have been. The
new approach doesn't need any of this and avoids the off-by-one error.
One consequence is that errors returned by
`maybe_new_parser_from_source_str` now must be consumed, so a bunch of
places that previously ignored those errors now cancel them. (Most of
them explicitly dropped the errors before. I guess that was to indicate
"we are explicitly ignoring these", though I'm not 100% sure.)
Two different lifetimes are conflated. This doesn't matter right now,
but needs to be fixed for the next commit to work. And the more
descriptive lifetime names make the code easier to read.
In #119606 I added them and used a `_mv` suffix, but that wasn't great.
A `with_` prefix has three different existing uses.
- Constructors, e.g. `Vec::with_capacity`.
- Wrappers that provide an environment to execute some code, e.g.
`with_session_globals`.
- Consuming chaining methods, e.g. `Span::with_{lo,hi,ctxt}`.
The third case is exactly what we want, so this commit changes
`DiagnosticBuilder::foo_mv` to `DiagnosticBuilder::with_foo`.
Thanks to @compiler-errors for the suggestion.
The existing uses are replaced in one of three ways.
- In a function that also has calls to `emit`, just rearrange the code
so that exactly one of `delay_as_bug` or `emit` is called on every
path.
- In a function returning a `DiagnosticBuilder`, use
`downgrade_to_delayed_bug`. That's good enough because it will get
emitted later anyway.
- In `unclosed_delim_err`, one set of errors is being replaced with
another set, so just cancel the original errors.
This works for most of its call sites. This is nice, because `emit` very
much makes sense as a consuming operation -- indeed,
`DiagnosticBuilderState` exists to ensure no diagnostic is emitted
twice, but it uses runtime checks.
For the small number of call sites where a consuming emit doesn't work,
the commit adds `DiagnosticBuilder::emit_without_consuming`. (This will
be removed in subsequent commits.)
Likewise, `emit_unless` becomes consuming. And `delay_as_bug` becomes
consuming, while `delay_as_bug_without_consuming` is added (which will
also be removed in subsequent commits.)
All this requires significant changes to `DiagnosticBuilder`'s chaining
methods. Currently `DiagnosticBuilder` method chaining uses a
non-consuming `&mut self -> &mut Self` style, which allows chaining to
be used when the chain ends in `emit()`, like so:
```
struct_err(msg).span(span).emit();
```
But it doesn't work when producing a `DiagnosticBuilder` value,
requiring this:
```
let mut err = self.struct_err(msg);
err.span(span);
err
```
This style of chaining won't work with consuming `emit` though. For
that, we need to use to a `self -> Self` style. That also would allow
`DiagnosticBuilder` production to be chained, e.g.:
```
self.struct_err(msg).span(span)
```
However, removing the `&mut self -> &mut Self` style would require that
individual modifications of a `DiagnosticBuilder` go from this:
```
err.span(span);
```
to this:
```
err = err.span(span);
```
There are *many* such places. I have a high tolerance for tedious
refactorings, but even I gave up after a long time trying to convert
them all.
Instead, this commit has it both ways: the existing `&mut self -> Self`
chaining methods are kept, and new `self -> Self` chaining methods are
added, all of which have a `_mv` suffix (short for "move"). Changes to
the existing `forward!` macro lets this happen with very little
additional boilerplate code. I chose to add the suffix to the new
chaining methods rather than the existing ones, because the number of
changes required is much smaller that way.
This doubled chainging is a bit clumsy, but I think it is worthwhile
because it allows a *lot* of good things to subsequently happen. In this
commit, there are many `mut` qualifiers removed in places where
diagnostics are emitted without being modified. In subsequent commits:
- chaining can be used more, making the code more concise;
- more use of chaining also permits the removal of redundant diagnostic
APIs like `struct_err_with_code`, which can be replaced easily with
`struct_err` + `code_mv`;
- `emit_without_diagnostic` can be removed, which simplifies a lot of
machinery, removing the need for `DiagnosticBuilderState`.
`Diagnostic` has 40 methods that return `&mut Self` and could be
considered setters. Four of them have a `set_` prefix. This doesn't seem
necessary for a type that implements the builder pattern. This commit
removes the `set_` prefixes on those four methods.
This lets different error levels share the same return type from
`emit_*`.
- A lot of inconsistencies in the `DiagCtxt` API are removed.
- `Noted` is removed.
- `FatalAbort` is introduced for fatal errors (abort via `raise`),
replacing the `EmissionGuarantee` impl for `!`.
- `Bug` is renamed `BugAbort` (to avoid clashing with `Level::Bug` and
to mirror `FatalAbort`), and modified to work in the new way with bug
errors (abort via panic).
- Various diagnostic creators and emitters updated to the new, better
signatures. Note that `DiagCtxt::bug` no longer needs to call
`panic_any`, because `emit` handles that.
Also shorten the obnoxiously long
`diagnostic_builder_emit_producing_guarantee` name.
It's unclear why this is used here. All entries in the third column of
`UNICODE_ARRAY` are covered by `ASCII_ARRAY`, so if the lookup fails
it's a genuine compiler bug. It was added way back in #29837, for no
clear reason.
This commit changes it to `span_bug`, which is more typical.
The `span` arg is described in a comment as "interior span of the
literal, without quotes", which is incorrect. It's actually the span of
the error part of the literal, corresponding to `range`.
This commit renames `span` and `span_without_quotes` to make things
clearer, and fixes the erroneous comment.
This is an extension of the previous commit. It means the output of
something like this:
```
stringify!(let a: Vec<u32> = vec![];)
```
goes from this:
```
let a: Vec<u32> = vec![] ;
```
With this PR, it now produces this string:
```
let a: Vec<u32> = vec![];
```
`tokenstream::Spacing` appears on all `TokenTree::Token` instances,
both punct and non-punct. Its current usage:
- `Joint` means "can join with the next token *and* that token is a
punct".
- `Alone` means "cannot join with the next token *or* can join with the
next token but that token is not a punct".
The fact that `Alone` is used for two different cases is awkward.
This commit augments `tokenstream::Spacing` with a new variant
`JointHidden`, resulting in:
- `Joint` means "can join with the next token *and* that token is a
punct".
- `JointHidden` means "can join with the next token *and* that token is a
not a punct".
- `Alone` means "cannot join with the next token".
This *drastically* improves the output of `print_tts`. For example,
this:
```
stringify!(let a: Vec<u32> = vec![];)
```
currently produces this string:
```
let a : Vec < u32 > = vec! [] ;
```
With this PR, it now produces this string:
```
let a: Vec<u32> = vec![] ;
```
(The space after the `]` is because `TokenTree::Delimited` currently
doesn't have spacing information. The subsequent commit fixes this.)
The new `print_tts` doesn't replicate original code perfectly. E.g.
multiple space characters will be condensed into a single space
character. But it's much improved.
`print_tts` still produces the old, uglier output for code produced by
proc macros. Because we have to translate the generated code from
`proc_macro::Spacing` to the more expressive `token::Spacing`, which
results in too much `proc_macro::Along` usage and no
`proc_macro::JointHidden` usage. So `space_between` still exists and
is used by `print_tts` in conjunction with the `Spacing` field.
This change will also help with the removal of `Token::Interpolated`.
Currently interpolated tokens are pretty-printed nicely via AST pretty
printing. `Token::Interpolated` removal will mean they get printed with
`print_tts`. Without this change, that would result in much uglier
output for code produced by decl macro expansions. With this change, AST
pretty printing and `print_tts` produce similar results.
The commit also tweaks the comments on `proc_macro::Spacing`. In
particular, it refers to "compound tokens" rather than "multi-char
operators" because lifetimes aren't operators.
Stabilize C string literals
RFC: https://rust-lang.github.io/rfcs/3348-c-str-literal.html
Tracking issue: https://github.com/rust-lang/rust/issues/105723
Documentation PR (reference manual): https://github.com/rust-lang/reference/pull/1423
# Stabilization report
Stabilizes C string and raw C string literals (`c"..."` and `cr#"..."#`), which are expressions of type [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). Both new literals require Rust edition 2021 or later.
```rust
const HELLO: &core::ffi::CStr = c"Hello, world!";
```
C strings may contain any byte other than `NUL` (`b'\x00'`), and their in-memory representation is guaranteed to end with `NUL`.
## Implementation
Originally implemented by PR https://github.com/rust-lang/rust/pull/108801, which was reverted due to unintentional changes to lexer behavior in Rust editions < 2021.
The current implementation landed in PR https://github.com/rust-lang/rust/pull/113476, which restricts C string literals to Rust edition >= 2021.
## Resolutions to open questions from the RFC
* Adding C character literals (`c'.'`) of type `c_char` is not part of this feature.
* Support for `c"..."` literals does not prevent `c'.'` literals from being added in the future.
* C string literals should not be blocked on making `&CStr` a thin pointer.
* It's possible to declare constant expressions of type `&'static CStr` in stable Rust (as of v1.59), so C string literals are not adding additional coupling on the internal representation of `CStr`.
* The unstable `concat_bytes!` macro should not accept `c"..."` literals.
* C strings have two equally valid `&[u8]` representations (with or without terminal `NUL`), so allowing them to be used in `concat_bytes!` would be ambiguous.
* Adding a type to represent C strings containing valid UTF-8 is not part of this feature.
* Support for a hypothetical `&Utf8CStr` may be explored in the future, should such a type be added to Rust.
Reimplement C-str literals
This reverts #113334, cc `@fmease.`
While converting lexer tokens to ast Tokens in `rustc_parse`, we check the edition of the span of the token. If the edition < 2021, we split the token into two, one being the identifier and other being the str literal.
Instead of linking to the old Rust Reference site on static.rust-lang.org,
link to the current website doc.rust-lang.org/stable/reference instead in
diagnostic about incorrect literals.
Error message all end up passing into a function as an `impl
Into<{D,Subd}iagnosticMessage>`. If an error message is creatd as
`&format("...")` that means we allocate a string (in the `format!`
call), then take a reference, and then clone (allocating again) the
reference to produce the `{D,Subd}iagnosticMessage`, which is silly.
This commit removes the leading `&` from a lot of these cases. This
means the original `String` is moved into the
`{D,Subd}iagnosticMessage`, avoiding the double allocations. This
requires changing some function argument types from `&str` to `String`
(when all arguments are `String`) or `impl
Into<{D,Subd}iagnosticMessage>` (when some arguments are `String` and
some are `&str`).
Currently a `{D,Subd}iagnosticMessage` can be created from any type that
impls `Into<String>`. That includes `&str`, `String`, and `Cow<'static,
str>`, which are reasonable. It also includes `&String`, which is pretty
weird, and results in many places making unnecessary allocations for
patterns like this:
```
self.fatal(&format!(...))
```
This creates a string with `format!`, takes a reference, passes the
reference to `fatal`, which does an `into()`, which clones the
reference, doing a second allocation. Two allocations for a single
string, bleh.
This commit changes the `From` impls so that you can only create a
`{D,Subd}iagnosticMessage` from `&str`, `String`, or `Cow<'static,
str>`. This requires changing all the places that currently create one
from a `&String`. Most of these are of the `&format!(...)` form
described above; each one removes an unnecessary static `&`, plus an
allocation when executed. There are also a few places where the existing
use of `&String` was more reasonable; these now just use `clone()` at
the call site.
As well as making the code nicer and more efficient, this is a step
towards possibly using `Cow<'static, str>` in
`{D,Subd}iagnosticMessage::{Str,Eager}`. That would require changing
the `From<&'a str>` impls to `From<&'static str>`, which is doable, but
I'm not yet sure if it's worthwhile.
Note that at the time of this commit, `unic-emoji-char` seems to have
data tables only up to Unicode 5.0, but Unicode is already newer than
this.
A newer emoji such as `🥺` will not be recognized as an emoji
but older emojis such as `🐱` will.
Improve unexpected close and mismatch delimiter hint in TokenTreesReader
Fixes#103882Fixes#68987Fixes#69259
The inner indentation mismatching will be covered by outer block, the new added function `report_error_prone_delim_block` will find out the error prone candidates for reporting.
Instead of `ast::Lit`.
Literal lowering now happens at two different times. Expression literals
are lowered when HIR is crated. Attribute literals are lowered during
parsing.
This commit changes the language very slightly. Some programs that used
to not compile now will compile. This is because some invalid literals
that are removed by `cfg` or attribute macros will no longer trigger
errors. See this comment for more details:
https://github.com/rust-lang/rust/pull/102944#issuecomment-1277476773
It deals with eight cases: ints, floats, and the six quoted types
(char/byte/strings). For ints and floats we have an early return, and
the other six types fall through to the code at the end, which makes the
function hard to read.
This commit rearranges things to avoid the early returns.
There are three kinds of "byte" literals: byte literals, byte string
literals, and raw byte string literals. None are allowed to have
non-ASCII chars in them.
Two `EscapeError` variants exist for when that constraint is violated.
- `NonAsciiCharInByte`: used for byte literals and byte string literals.
- `NonAsciiCharInByteString`: used for raw byte string literals.
As a result, the messages for raw byte string literals use different
wording, without good reason. Also, byte string literals are incorrectly
described as "byte constants" in some error messages.
This commit eliminates `NonAsciiCharInByteString` so the three cases are
handled similarly, and described correctly. The `mode` is enough to
distinguish them.
Note: Some existing error messages mention "byte constants" and some
mention "byte literals". I went with the latter here, because it's a
more correct name, as used by the Reference.
It's passed to numerous places where we just need an `is_byte` bool.
Passing the bool avoids the need for some assertions.
Also rename `is_bytes()` as `is_byte()`, to better match `Mode::Byte`,
`Mode::ByteStr`, and `Mode::RawByteStr`.
It's now only used in one function. Also, the "should we glue the
tokens?" check is only necessary when pushing a `TokenTree::Token`, not
when pushing a `TokenTree::Delimited`.
As part of this, we now do the "should we glue the tokens?" check
immediately, which avoids having look back at the previous token. It
also puts all the logic dealing with token gluing in a single place.
`Cursor` is currently hidden, and the main tokenization path uses
`rustc_lexer::first_token` which involves constructing a new `Cursor`
for every single token, which is weird. Also, `first_token` also can't
handle empty input, so callers have to check for that first.
This commit makes `Cursor` public, so `StringReader` can contain a
`Cursor`, which results in a simpler structure. The commit also changes
`StringReader::advance_token` so it returns an `Option<Token>`,
simplifying the the empty input case.
`TokenTreesReader` wraps a `StringReader`, but the `into_token_trees`
function obscures this. This commit moves to a more straightforward
control flow.
The spacing computation is done in two parts. In the first part
`next_token` and `bump` use `Spacing::Alone` to mean "preceded by
whitespace" and `Spacing::Joint` to mean the opposite. In the second
part `parse_token_tree_other` then adjusts the `spacing` value to mean
the usual thing (i.e. "is the following token joinable punctuation?").
This shift in meaning is very confusing and it took me some time to
understand what was going on.
This commit changes the first part to use a bool, and adds some
comments, which makes things much clearer.
`parse_token_tree` is basically a match with four arms: `Eof`,
`OpenDelim`, `CloseDelim`, and "other". It has two call sites, and at
each call site one of the arms is unreachable. It's also not inlined.
This commit removes `parse_token_tree` by splitting it into four
functions and inlining them. This avoids some repeated conditional
tests and also some non-inlined function calls on the hot path.