rework the README.md for rustc and add other readmes

This takes way longer than I thought it would. =)
2025-02-16 17:03:35 +00:00 · 2017-08-31 14:33:19 -04:00 · 2017-08-31 14:33:19 -04:00 · 44e45d9fea
commit 44e45d9fea
parent 9a00f3cc30
11 changed files with 463 additions and 41 deletions
--- a/src/librustc/README.md
+++ b/src/librustc/README.md
@ -13,49 +13,82 @@ https://github.com/rust-lang/rust/issues

 Your concerns are probably the same as someone else's.

+You may also be interested in the
+[Rust Forge](https://forge.rust-lang.org/), which includes a number of
+interesting bits of information.
+
+Finally, at the end of this file is a GLOSSARY defining a number of
+common (and not necessarily obvious!) names that are used in the Rust
+compiler code. If you see some funky name and you'd like to know what
+it stands for, check there!
+
 The crates of rustc
 ===================

-Rustc consists of a number of crates, including `libsyntax`,
-`librustc`, `librustc_back`, `librustc_trans`, and `librustc_driver`
-(the names and divisions are not set in stone and may change;
-in general, a finer-grained division of crates is preferable):
+Rustc consists of a number of crates, including `syntax`,
+`rustc`, `rustc_back`, `rustc_trans`, `rustc_driver`, and
+many more. The source for each crate can be found in a directory
+like `src/libXXX`, where `XXX` is the crate name.

- [`libsyntax`][libsyntax] contains those things concerned purely with syntax –
-  that is, the AST, parser, pretty-printer, lexer, macro expander, and
-  utilities for traversing ASTs – are in a separate crate called
-  "syntax", whose files are in `./../libsyntax`, where `.` is the
-  current directory (that is, the parent directory of front/, middle/,
-  back/, and so on).
+(NB. The names and divisions of these crates are not set in
+stone and may change over time -- for the time being, we tend towards
+a finer-grained division to help with compilation time, though as
+incremental improves that may change.)

- `librustc` (the current directory) contains the high-level analysis
-  passes, such as the type checker, borrow checker, and so forth.
-  It is the heart of the compiler.
+The dependency structure of these crates is roughly a diamond:

- [`librustc_back`][back] contains some very low-level details that are
-  specific to different LLVM targets and so forth.
-
- [`librustc_trans`][trans] contains the code to convert from Rust IR into LLVM
-  IR, and then from LLVM IR into machine code, as well as the main
-  driver that orchestrates all the other passes and various other bits
-  of miscellany. In general it contains code that runs towards the
-  end of the compilation process.
-
- [`librustc_driver`][driver] invokes the compiler from
-  [`libsyntax`][libsyntax], then the analysis phases from `librustc`, and
-  finally the lowering and codegen passes from [`librustc_trans`][trans].
-
-Roughly speaking the "order" of the three crates is as follows:
-
-              librustc_driver
-                      |
-    +-----------------+-------------------+
-    |                                     |
-    libsyntax -> librustc -> librustc_trans
+````
+                  rustc_driver
+                /      |       \
+              /        |         \
+            /          |           \
+          /            v             \
+rustc_trans    rustc_borrowck   ...  rustc_metadata
+          \            |            /
+            \          |          /
+              \        |        /
+                \      v      /
+                    rustc
+                       |
+                       v
+                    syntax
+                    /    \
+                  /       \
+           syntax_pos  syntax_ext
+```                    


-The compiler process:
-=====================
+The idea is that `rustc_driver`, at the top of this lattice, basically
+defines the overall control-flow of the compiler. It doesn't have much
+"real code", but instead ties together all of the code defined in the
+other crates and defines the overall flow of execution.
+
+At the other extreme, the `rustc` crate defines the common and
+pervasive data structures that all the rest of the compiler uses
+(e.g., how to represent types, traits, and the program itself). It
+also contains some amount of the compiler itself, although that is
+relatively limited.
+
+Finally, all the crates in the bulge in the middle define the bulk of
+the compiler -- they all depend on `rustc`, so that they can make use
+of the various types defined there, and they export public routines
+that `rustc_driver` will invoke as needed (more and more, what these
+crates export are "query definitions", but those are covered later
+on).
+
+Below `rustc` lie various crates that make up the parser and error
+reporting mechanism. For historical reasons, these crates do not have
+the `rustc_` prefix, but they are really just as much an internal part
+of the compiler and not intended to be stable (though they do wind up
+getting used by some crates in the wild; a practice we hope to
+gradually phase out).
+
+Each crate has a `README.md` file that describes, at a high-level,
+what it contains, and tries to give some kind of explanation (some
+better than others).
+
+The compiler process
+====================

 The Rust compiler is comprised of six main compilation phases.

@ -172,3 +205,29 @@ The 3 central data structures:
 [back]: https://github.com/rust-lang/rust/tree/master/src/librustc_back/
 [rustc]: https://github.com/rust-lang/rust/tree/master/src/librustc/
 [driver]: https://github.com/rust-lang/rust/tree/master/src/librustc_driver
+
+Glossary
+========
+
+The compiler uses a number of...idiosyncratic abbreviations and
+things. This glossary attempts to list them and give you a few
+pointers for understanding them better.
+
+- AST -- the **abstract syntax tree** produced the `syntax` crate; reflects user syntax
+  very closely.
+- cx -- we tend to use "cx" as an abbrevation for context. See also tcx, infcx, etc.
+- HIR -- the **High-level IR**, created by lowering and desugaring the AST. See `librustc/hir`.
+- `'gcx` -- the lifetime of the global arena (see `librustc/ty`).
+- generics -- the set of generic type parameters defined on a type or item
+- infcx -- the inference context (see `librustc/infer`)
+- MIR -- the **Mid-level IR** that is created after type-checking for use by borrowck and trans.
+  Defined in the `src/librustc/mir/` module, but much of the code that manipulates it is
+  found in `src/librustc_mir`.
+- obligation -- something that must be proven by the trait system.
+- sess -- the **compiler session**, which stores global data used throughout compilation
+- substs -- the **substitutions** for a given generic type or item
+  (e.g., the `i32, u32` in `HashMap<i32, u32>`)
+- tcx -- the "typing context", main data structure of the compiler (see `librustc/ty`).
+- trans -- the code to **translate** MIR into LLVM IR.
+- trait reference -- a trait and values for its type parameters (see `librustc/ty`).
+- ty -- the internal representation of a **type** (see `librustc/ty`).
--- a/src/librustc/hir/README.md
+++ b/src/librustc/hir/README.md
@ -0,0 +1,123 @@
+# Introduction to the HIR
+
+The HIR -- "High-level IR" -- is the primary IR used in most of
+rustc. It is a desugared version of the "abstract syntax tree" (AST)
+that is generated after parsing, macro expansion, and name resolution
+have completed. Many parts of HIR resemble Rust surface syntax quite
+closely, with the exception that some of Rust's expression forms have
+been desugared away (as an example, `for` loops are converted into a
+`loop` and do not appear in the HIR).
+
+This README covers the main concepts of the HIR.
+
+### Out-of-band storage and the `Crate` type
+
+The top-level data-structure in the HIR is the `Crate`, which stores
+the contents of the crate currently being compiled (we only ever
+construct HIR for the current crate). Whereas in the AST the crate
+data structure basically just contains the root module, the HIR
+`Crate` structure contains a number of maps and other things that
+serve to organize the content of the crate for easier access.
+
+For example, the contents of individual items (e.g., modules,
+functions, traits, impls, etc) in the HIR are not immediately
+accessible in the parents. So, for example, if had a module item `foo`
+containing a function `bar()`:
+
+```
+mod foo {
+  fn bar() { }
+}
+```
+
+Then in the HIR the representation of module `foo` (the `Mod`
+stuct) would have only the **`ItemId`** `I` of `bar()`. To get the
+details of the function `bar()`, we would lookup `I` in the
+`items` map.
+
+One nice result from this representation is that one can iterate
+over all items in the crate by iterating over the key-value pairs
+in these maps (without the need to trawl through the IR in total).
+There are similar maps for things like trait items and impl items,
+as well as "bodies" (explained below).
+
+The other reason to setup the representation this way is for better
+integration with incremental compilation. This way, if you gain access
+to a `&hir::Item` (e.g. for the mod `foo`), you do not immediately
+gain access to the contents of the function `bar()`. Instead, you only
+gain access to the **id** for `bar()`, and you must some function to
+lookup the contents of `bar()` given its id; this gives us a change to
+observe that you accessed the data for `bar()` and record the
+dependency.
+
+### Identifiers in the HIR
+
+Most of the code that has to deal with things in HIR tends not to
+carry around references into the HIR, but rather to carry around
+*identifier numbers* (or just "ids"). Right now, you will find four
+sorts of identifiers in active use:
+
+- `DefId`, which primarily name "definitions" or top-level items.
+  - You can think of a `DefId` as being shorthand for a very explicit
+    and complete path, like `std::collections::HashMap`. However,
+    these paths are able to name things that are not nameable in
+    normal Rust (e.g., impls), and they also include extra information
+    about the crate (such as its version number, as two versions of
+    the same crate can co-exist).
+  - A `DefId` really consists of two parts, a `CrateNum` (which
+    identifies the crate) and a `DefIndex` (which indixes into a list
+    of items that is maintained per crate).
+- `HirId`, which combines the index of a particular item with an
+  offset within that item.
+  - the key point of a `HirId` is that it is *relative* to some item (which is named
+    via a `DefId`).
+- `BodyId`, this is an absolute identifier that refers to a specific
+  body (definition of a function or constant) in the crate. It is currently
+  effectively a "newtype'd" `NodeId`.
+- `NodeId`, which is an absolute id that identifies a single node in the HIR tree.
+  - While these are still in common use, **they are being slowly phased out**.
+  - Since they are absolute within the crate, adding a new node
+    anywhere in the tree causes the node-ids of all subsequent code in
+    the crate to change. This is terrible for incremental compilation,
+    as you can perhaps imagine.
+
+### HIR Map
+
+Most of the time when you are working with the HIR, you will do so via
+the **HIR Map**, accessible in the tcx via `tcx.hir` (and defined in
+the `hir::map` module). The HIR map contains a number of methods to
+convert between ids of various kinds and to lookup data associated
+with a HIR node.
+
+For example, if you have a `DefId`, and you would like to convert it
+to a `NodeId`, you can use `tcx.hir.as_local_node_id(def_id)`. This
+returns an `Option<NodeId>` -- this will be `None` if the def-id
+refers to something outside of the current crate (since then it has no
+HIR node), but otherwise returns `Some(n)` where `n` is the node-id of
+the definition.
+
+Similarly, you can use `tcx.hir.find(n)` to lookup the node for a
+`NodeId`. This returns a `Option<Node<'tcx>>`, where `Node` is an enum
+defined in the map; by matching on this you can find out what sort of
+node the node-id referred to and also get a pointer to the data
+itself. Often, you know what sort of node `n` is -- e.g., if you know
+that `n` must be some HIR expression, you can do
+`tcx.hir.expect_expr(n)`, which will extract and return the
+`&hir::Expr`, panicking if `n` is not in fact an expression.
+
+Finally, you can use the HIR map to find the parents of nodes, via
+calls like `tcx.hir.get_parent_node(n)`.
+
+### HIR Bodies
+
+A **body** represents some kind of executable code, such as the body
+of a function/closure or the definition of a constant. Bodies are
+associated with an **owner**, which is typically some kind of item
+(e.g., a `fn()` or `const`), but could also be a closure expression
+(e.g., `|x, y| x + y`). You can use the HIR map to find find the body
+associated with a given def-id (`maybe_body_owned_by()`) or to find
+the owner of a body (`body_owner_def_id()`).
+
+
+
+
--- a/src/librustc/hir/map/README.md
+++ b/src/librustc/hir/map/README.md
@ -0,0 +1,4 @@
+The HIR map, accessible via `tcx.hir`, allows you to quickly navigate the
+HIR and convert between various forms of identifiers. See [the HIR README] for more information.
+
+[the HIR README]: ../README.md
--- a/src/librustc/hir/mod.rs
+++ b/src/librustc/hir/mod.rs
@ -413,6 +413,9 @@ pub struct WhereEqPredicate {

 pub type CrateConfig = HirVec<P<MetaItem>>;

+/// The top-level data structure that stores the entire contents of
+/// the crate currently being compiled.
+///
 #[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Debug)]
 pub struct Crate {
    pub module: Mod,
@ -927,7 +930,27 @@ pub struct BodyId {
    pub node_id: NodeId,
 }

-/// The body of a function or constant value.
+/// The body of a function, closure, or constant value. In the case of
+/// a function, the body contains not only the function body itself
+/// (which is an expression), but also the argument patterns, since
+/// those are something that the caller doesn't really care about.
+///
+/// Example:
+///
+/// ```rust
+/// fn foo((x, y): (u32, u32)) -> u32 {
+///     x + y
+/// }
+/// ```
+///
+/// Here, the `Body` associated with `foo()` would contain:
+///
+/// - an `arguments` array containing the `(x, y)` pattern
+/// - a `value` containing the `x + y` expression (maybe wrapped in a block)
+/// - `is_generator` would be false
+///
+/// All bodies have an **owner**, which can be accessed via the HIR
+/// map using `body_owner_def_id()`.
 #[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Hash, Debug)]
 pub struct Body {
    pub arguments: HirVec<Arg>,
--- a/src/librustc/lib.rs
+++ b/src/librustc/lib.rs
@ -8,7 +8,28 @@
 // option. This file may not be copied, modified, or distributed
 // except according to those terms.

-//! The Rust compiler.
+//! The "main crate" of the Rust compiler. This crate contains common
+//! type definitions that are used by the other crates in the rustc
+//! "family". Some prominent examples (note that each of these modules
+//! has their own README with further details).
+//!
+//! - **HIR.** The "high-level (H) intermediate representation (IR)" is
+//!   defined in the `hir` module.
+//! - **MIR.** The "mid-level (M) intermediate representation (IR)" is
+//!   defined in the `mir` module. This module contains only the
+//!   *definition* of the MIR; the passes that transform and operate
+//!   on MIR are found in `librustc_mir` crate.
+//! - **Types.** The internal representation of types used in rustc is
+//!   defined in the `ty` module. This includes the **type context**
+//!   (or `tcx`), which is the central context during most of
+//!   compilation, containing the interners and other things.
+//! - **Traits.** Trait resolution is implemented in the `traits` module.
+//! - **Type inference.** The type inference code can be found in the `infer` module;
+//!   this code handles low-level equality and subtyping operations. The
+//!   type check pass in the compiler is found in the `librustc_typeck` crate.
+//!
+//! For a deeper explanation of how the compiler works and is
+//! organized, see the README.md file in this directory.
 //!
 //! # Note
 //!
--- a/src/librustc/ty/README.md
+++ b/src/librustc/ty/README.md
@ -0,0 +1,159 @@
+# Types and the Type Context
+
+The `ty` module defines how the Rust compiler represents types
+internally. It also defines the *typing context* (`tcx` or `TyCtxt`),
+which is the central data structure in the compiler.
+
+## The tcx and how it uses lifetimes
+
+The `tcx` ("typing context") is the central data structure in the
+compiler. It is the context that you use to perform all manner of
+queries. The struct `TyCtxt` defines a reference to this shared context:
+
+```rust
+tcx: TyCtxt<'a, 'gcx, 'tcx>
+//          --  ----  ----
+//          |   |     |
+//          |   |     innermost arena lifetime (if any)
+//          |   "global arena" lifetime
+//          lifetime of this reference
+```
+
+As you can see, the `TyCtxt` type takes three lifetime parameters.
+These lifetimes are perhaps the most complex thing to understand about
+the tcx. During rust compilation, we allocate most of our memory in
+**arenas**, which are basically pools of memory that get freed all at
+once. When you see a reference with a lifetime like `'tcx` or `'gcx`,
+you know that it refers to arena-allocated data (or data that lives as
+long as the arenas, anyhow).
+
+We use two distinct levels of arenas. The outer level is the "global
+arena". This arena lasts for the entire compilation: so anything you
+allocate in there is only freed once compilation is basically over
+(actually, when we shift to executing LLVM).
+
+To reduce peak memory usage, when we do type inference, we also use an
+inner level of arena. These arenas get thrown away once type inference
+is over. This is done because type inference generates a lot of
+"throw-away" types that are not particularly interesting after type
+inference completes, so keeping around those allocations would be
+wasteful.
+
+Often, we wish to write code that explicitly asserts that it is not
+taking place during inference. In that case, there is no "local"
+arena, and all the types that you can access are allocated in the
+global arena.  To express this, the idea is to us the same lifetime
+for the `'gcx` and `'tcx` parameters of `TyCtxt`. Just to be a touch
+confusing, we tend to use the name `'tcx` in such contexts. Here is an
+example:
+
+```rust
+fn not_in_inference<'a, 'tcx>(tcx: TyCtxt<'a, 'tcx, 'tcx>, def_id: DefId) {
+    //                                        ----  ----
+    //                                        Using the same lifetime here asserts
+    //                                        that the innermost arena accessible through
+    //                                        this reference *is* the global arena.
+}
+```
+
+In contrast, if we want to code that can be usable during type inference, then you
+need to declare a distinct `'gcx` and `'tcx` lifetime parameter:
+
+```rust
+fn maybe_in_inference<'a, 'gcx, 'tcx>(tcx: TyCtxt<'a, 'gcx, 'tcx>, def_id: DefId) {
+    //                                                ----  ----
+    //                                        Using different lifetimes here means that
+    //                                        the innermost arena *may* be distinct
+    //                                        from the global arena (but doesn't have to be).
+}
+```
+
+### Allocating and working with types
+
+Rust types are represented using the `ty::Ty<'tcx>` type. This is in fact a simple type alias
+for a reference with `'tcx` lifetime:
+
+```rust
+pub type Ty<'tcx> = &'tcx TyS<'tcx>;
+```
+
+The `TyS` struct defines the actual details of how a type is
+represented. The most interesting part of it is the `sty` field, which
+contains an enum that lets us test what sort of type this is. For
+example, it is very common to see code that tests what sort of type you have
+that looks roughly like so:
+
+```rust
+fn test_type<'tcx>(ty: Ty<'tcx>) {
+    match ty.sty {
+        ty::TyArray(elem_ty, len) => { ... }
+        ...
+    }
+}
+```
+
+(Note though that doing such low-level tests on types during inference
+can be risky, as there are may be inference variables and other things
+to consider, or sometimes types are not yet known that will become
+known later.).
+
+To allocate a new type, you can use the various `mk_` methods defined
+on the `tcx`. These have names that correpond mostly to the various kinds
+of type variants. For example:
+
+```rust
+let array_ty = tcx.mk_array(elem_ty, len * 2);
+```
+
+These methods all return a `Ty<'tcx>` -- note that the lifetime you
+get back is the lifetime of the innermost arena that this `tcx` has
+access to. In fact, types are always canonicalized and interned (so we
+never allocate exactly the same type twice) and are always allocated
+in the outermost arena where they can be (so, if they do not contain
+any inference variables or other "temporary" types, they will be
+allocated in the global arena). However, the lifetime `'tcx` is always
+a safe approximation, so that is what you get back.
+
+NB. Because types are interned, it is possible to compare them for
+equality efficiently using `==` -- however, this is almost never what
+you want to do unless you happen to be hashing and looking for
+duplicates. This is because often in Rust there are multiple ways to
+represent the same type, particularly once inference is involved. If
+you are going to be testing for type equality, you probably need to
+start looking into the inference code to do it right.
+
+You can also find various common types in the tcx itself by accessing
+`tcx.types.bool`, `tcx.types.char`, etc (see `CommonTypes` for more).
+
+### Beyond types: Other kinds of arena-allocated data structures
+
+In addition to types, there are a number of other arena-allocated data
+structures that you can allocate, and which are found in this
+module. Here are a few examples:
+
+- `Substs`, allocated with `mk_substs` -- this will intern a slice of types, often used to
+  specify the values to be substituted for generics (e.g., `HashMap<i32, u32>`
+  would be represented as a slice `&'tcx [tcx.types.i32, tcx.types.u32]`.
+- `TraitRef`, typically passed by value -- a **trait reference**
+  consists of a reference to a trait along with its various type
+  parameters (including `Self`), like `i32: Display` (here, the def-id
+  would reference the `Display` trait, and the substs would contain
+  `i32`).
+- `Predicate` defines something the trait system has to prove (see `traits` module).
+
+### Import conventions
+
+Although there is no hard and fast rule, the `ty` module tends to be used like so:
+
+```rust
+use ty::{self, Ty, TyCtxt};
+```
+
+In particular, since they are so common, the `Ty` and `TyCtxt` types
+are imported directly. Other types are often referenced with an
+explicit `ty::` prefix (e.g., `ty::TraitRef<'tcx>`). But some modules
+choose to import a larger or smaller set of names explicitly.
+
+
+
+
--- a/src/librustc/ty/context.rs
+++ b/src/librustc/ty/context.rs
@ -793,9 +793,11 @@ impl<'tcx> CommonTypes<'tcx> {
    }
 }

-/// The data structure to keep track of all the information that typechecker
-/// generates so that so that it can be reused and doesn't have to be redone
-/// later on.
+/// The central data structure of the compiler. Keeps track of all the
+/// information that typechecker generates so that so that it can be
+/// reused and doesn't have to be redone later on.
+///
+/// See [the README](README.md) for more deatils.
 #[derive(Copy, Clone)]
 pub struct TyCtxt<'a, 'gcx: 'a+'tcx, 'tcx: 'a> {
    gcx: &'a GlobalCtxt<'gcx>,
--- a/src/librustc_back/README.md
+++ b/src/librustc_back/README.md
@ -0,0 +1,6 @@
+NB: This crate is part of the Rust compiler. For an overview of the
+compiler as a whole, see
+[the README.md file found in `librustc`](../librustc/README.md).
+
+`librustc_back` contains some very low-level details that are
+specific to different LLVM targets and so forth.
--- a/src/librustc_driver/README.md
+++ b/src/librustc_driver/README.md
@ -0,0 +1,12 @@
+NB: This crate is part of the Rust compiler. For an overview of the
+compiler as a whole, see
+[the README.md file found in `librustc`](../librustc/README.md).
+
+The `driver` crate is effectively the "main" function for the rust
+compiler.  It orchstrates the compilation process and "knits together"
+the code from the other crates within rustc. This crate itself does
+not contain any of the "main logic" of the compiler (though it does
+have some code related to pretty printing or other minor compiler
+options).
+
+
--- a/src/librustc_trans/README.md
+++ b/src/librustc_trans/README.md
@ -1 +1,7 @@
-See [librustc/README.md](../librustc/README.md).
+NB: This crate is part of the Rust compiler. For an overview of the
+compiler as a whole, see
+[the README.md file found in `librustc`](../librustc/README.md).
+
+The `trans` crate contains the code to convert from MIR into LLVM IR,
+and then from LLVM IR into machine code. In general it contains code
+that runs towards the end of the compilation process.
--- a/src/libsyntax/README.md
+++ b/src/libsyntax/README.md
@ -0,0 +1,7 @@
+NB: This crate is part of the Rust compiler. For an overview of the
+compiler as a whole, see
+[the README.md file found in `librustc`](../librustc/README.md).
+
+The `syntax` crate contains those things concerned purely with syntax
+– that is, the AST ("abstract syntax tree"), parser, pretty-printer,
+lexer, macro expander, and utilities for traversing ASTs.