rework the README.md for rustc and add other readmes

This takes way longer than I thought it would. =)
2024-11-01 15:01:51 +00:00 · 2017-08-31 14:33:19 -04:00 · 2017-08-31 14:33:19 -04:00 · 44e45d9fea
commit 44e45d9fea
parent 9a00f3cc30
11 changed files with 463 additions and 41 deletions
--- a/src/librustc/README.md
+++ b/src/librustc/README.md
@ -13,49 +13,82 @@ https://github.com/rust-lang/rust/issues
 Your concerns are probably the same as someone else's.
 You may also be interested in the
 [Rust Forge](https://forge.rust-lang.org/), which includes a number of
 interesting bits of information.
 Finally, at the end of this file is a GLOSSARY defining a number of
 common (and not necessarily obvious!) names that are used in the Rust
 compiler code. If you see some funky name and you'd like to know what
 it stands for, check there!
 The crates of rustc
 ===================
-Rustc consists of a number of crates, including `libsyntax`,
+Rustc consists of a number of crates, including `syntax`,
-`librustc`, `librustc_back`, `librustc_trans`, and `librustc_driver`
+`rustc`, `rustc_back`, `rustc_trans`, `rustc_driver`, and
-(the names and divisions are not set in stone and may change;
+many more. The source for each crate can be found in a directory
-in general, a finer-grained division of crates is preferable):
+like `src/libXXX`, where `XXX` is the crate name.
- [`libsyntax`][libsyntax] contains those things concerned purely with syntax –
+(NB. The names and divisions of these crates are not set in
-  that is, the AST, parser, pretty-printer, lexer, macro expander, and
+stone and may change over time -- for the time being, we tend towards
-  utilities for traversing ASTs – are in a separate crate called
+a finer-grained division to help with compilation time, though as
-  "syntax", whose files are in `./../libsyntax`, where `.` is the
+incremental improves that may change.)
  current directory (that is, the parent directory of front/, middle/,
  back/, and so on).
- `librustc` (the current directory) contains the high-level analysis
+The dependency structure of these crates is roughly a diamond:
  passes, such as the type checker, borrow checker, and so forth.
  It is the heart of the compiler.
- [`librustc_back`][back] contains some very low-level details that are
+````
-  specific to different LLVM targets and so forth.
+                  rustc_driver
-
+                /      |       \
- [`librustc_trans`][trans] contains the code to convert from Rust IR into LLVM
+              /        |         \
-  IR, and then from LLVM IR into machine code, as well as the main
+            /          |           \
-  driver that orchestrates all the other passes and various other bits
+          /            v             \
-  of miscellany. In general it contains code that runs towards the
+rustc_trans    rustc_borrowck   ...  rustc_metadata
-  end of the compilation process.
+          \            |            /
-
+            \          |          /
- [`librustc_driver`][driver] invokes the compiler from
+              \        |        /
-  [`libsyntax`][libsyntax], then the analysis phases from `librustc`, and
+                \      v      /
-  finally the lowering and codegen passes from [`librustc_trans`][trans].
+                    rustc
-
+                       |
-Roughly speaking the "order" of the three crates is as follows:
+                       v
-
+                    syntax
-              librustc_driver
+                    /    \
-                      |
+                  /       \
-    +-----------------+-------------------+
+           syntax_pos  syntax_ext
-    |                                     |
+```                    
    libsyntax -> librustc -> librustc_trans
-The compiler process:
+The idea is that `rustc_driver`, at the top of this lattice, basically
-=====================
+defines the overall control-flow of the compiler. It doesn't have much
 "real code", but instead ties together all of the code defined in the
 other crates and defines the overall flow of execution.
 At the other extreme, the `rustc` crate defines the common and
 pervasive data structures that all the rest of the compiler uses
 (e.g., how to represent types, traits, and the program itself). It
 also contains some amount of the compiler itself, although that is
 relatively limited.
 Finally, all the crates in the bulge in the middle define the bulk of
 the compiler -- they all depend on `rustc`, so that they can make use
 of the various types defined there, and they export public routines
 that `rustc_driver` will invoke as needed (more and more, what these
 crates export are "query definitions", but those are covered later
 on).
 Below `rustc` lie various crates that make up the parser and error
 reporting mechanism. For historical reasons, these crates do not have
 the `rustc_` prefix, but they are really just as much an internal part
 of the compiler and not intended to be stable (though they do wind up
 getting used by some crates in the wild; a practice we hope to
 gradually phase out).
 Each crate has a `README.md` file that describes, at a high-level,
 what it contains, and tries to give some kind of explanation (some
 better than others).
 The compiler process
 ====================
 The Rust compiler is comprised of six main compilation phases.
@ -172,3 +205,29 @@ The 3 central data structures:
 [back]: https://github.com/rust-lang/rust/tree/master/src/librustc_back/
 [rustc]: https://github.com/rust-lang/rust/tree/master/src/librustc/
 [driver]: https://github.com/rust-lang/rust/tree/master/src/librustc_driver
 Glossary
 ========
 The compiler uses a number of...idiosyncratic abbreviations and
 things. This glossary attempts to list them and give you a few
 pointers for understanding them better.
 - AST -- the **abstract syntax tree** produced the `syntax` crate; reflects user syntax
  very closely.
 - cx -- we tend to use "cx" as an abbrevation for context. See also tcx, infcx, etc.
 - HIR -- the **High-level IR**, created by lowering and desugaring the AST. See `librustc/hir`.
 - `'gcx` -- the lifetime of the global arena (see `librustc/ty`).
 - generics -- the set of generic type parameters defined on a type or item
 - infcx -- the inference context (see `librustc/infer`)
 - MIR -- the **Mid-level IR** that is created after type-checking for use by borrowck and trans.
  Defined in the `src/librustc/mir/` module, but much of the code that manipulates it is
  found in `src/librustc_mir`.
 - obligation -- something that must be proven by the trait system.
 - sess -- the **compiler session**, which stores global data used throughout compilation
 - substs -- the **substitutions** for a given generic type or item
  (e.g., the `i32, u32` in `HashMap<i32, u32>`)
 - tcx -- the "typing context", main data structure of the compiler (see `librustc/ty`).
 - trans -- the code to **translate** MIR into LLVM IR.
 - trait reference -- a trait and values for its type parameters (see `librustc/ty`).
 - ty -- the internal representation of a **type** (see `librustc/ty`).
--- a/src/librustc/hir/README.md
+++ b/src/librustc/hir/README.md
@ -0,0 +1,123 @@
 # Introduction to the HIR
 The HIR -- "High-level IR" -- is the primary IR used in most of
 rustc. It is a desugared version of the "abstract syntax tree" (AST)
 that is generated after parsing, macro expansion, and name resolution
 have completed. Many parts of HIR resemble Rust surface syntax quite
 closely, with the exception that some of Rust's expression forms have
 been desugared away (as an example, `for` loops are converted into a
 `loop` and do not appear in the HIR).
 This README covers the main concepts of the HIR.
 ### Out-of-band storage and the `Crate` type
 The top-level data-structure in the HIR is the `Crate`, which stores
 the contents of the crate currently being compiled (we only ever
 construct HIR for the current crate). Whereas in the AST the crate
 data structure basically just contains the root module, the HIR
 `Crate` structure contains a number of maps and other things that
 serve to organize the content of the crate for easier access.
 For example, the contents of individual items (e.g., modules,
 functions, traits, impls, etc) in the HIR are not immediately
 accessible in the parents. So, for example, if had a module item `foo`
 containing a function `bar()`:
 ```
 mod foo {
  fn bar() { }
 }
 ```
 Then in the HIR the representation of module `foo` (the `Mod`
 stuct) would have only the **`ItemId`** `I` of `bar()`. To get the
 details of the function `bar()`, we would lookup `I` in the
 `items` map.
 One nice result from this representation is that one can iterate
 over all items in the crate by iterating over the key-value pairs
 in these maps (without the need to trawl through the IR in total).
 There are similar maps for things like trait items and impl items,
 as well as "bodies" (explained below).
 The other reason to setup the representation this way is for better
 integration with incremental compilation. This way, if you gain access
 to a `&hir::Item` (e.g. for the mod `foo`), you do not immediately
 gain access to the contents of the function `bar()`. Instead, you only
 gain access to the **id** for `bar()`, and you must some function to
 lookup the contents of `bar()` given its id; this gives us a change to
 observe that you accessed the data for `bar()` and record the
 dependency.
 ### Identifiers in the HIR
 Most of the code that has to deal with things in HIR tends not to
 carry around references into the HIR, but rather to carry around
 *identifier numbers* (or just "ids"). Right now, you will find four
 sorts of identifiers in active use:
 - `DefId`, which primarily name "definitions" or top-level items.
  - You can think of a `DefId` as being shorthand for a very explicit
    and complete path, like `std::collections::HashMap`. However,
    these paths are able to name things that are not nameable in
    normal Rust (e.g., impls), and they also include extra information
    about the crate (such as its version number, as two versions of
    the same crate can co-exist).
  - A `DefId` really consists of two parts, a `CrateNum` (which
    identifies the crate) and a `DefIndex` (which indixes into a list
    of items that is maintained per crate).
 - `HirId`, which combines the index of a particular item with an
  offset within that item.
  - the key point of a `HirId` is that it is *relative* to some item (which is named
    via a `DefId`).
 - `BodyId`, this is an absolute identifier that refers to a specific
  body (definition of a function or constant) in the crate. It is currently
  effectively a "newtype'd" `NodeId`.
 - `NodeId`, which is an absolute id that identifies a single node in the HIR tree.
  - While these are still in common use, **they are being slowly phased out**.
  - Since they are absolute within the crate, adding a new node
    anywhere in the tree causes the node-ids of all subsequent code in
    the crate to change. This is terrible for incremental compilation,
    as you can perhaps imagine.
 ### HIR Map
 Most of the time when you are working with the HIR, you will do so via
 the **HIR Map**, accessible in the tcx via `tcx.hir` (and defined in
 the `hir::map` module). The HIR map contains a number of methods to
 convert between ids of various kinds and to lookup data associated
 with a HIR node.
 For example, if you have a `DefId`, and you would like to convert it
 to a `NodeId`, you can use `tcx.hir.as_local_node_id(def_id)`. This
 returns an `Option<NodeId>` -- this will be `None` if the def-id
 refers to something outside of the current crate (since then it has no
 HIR node), but otherwise returns `Some(n)` where `n` is the node-id of
 the definition.
 Similarly, you can use `tcx.hir.find(n)` to lookup the node for a
 `NodeId`. This returns a `Option<Node<'tcx>>`, where `Node` is an enum
 defined in the map; by matching on this you can find out what sort of
 node the node-id referred to and also get a pointer to the data
 itself. Often, you know what sort of node `n` is -- e.g., if you know
 that `n` must be some HIR expression, you can do
 `tcx.hir.expect_expr(n)`, which will extract and return the
 `&hir::Expr`, panicking if `n` is not in fact an expression.
 Finally, you can use the HIR map to find the parents of nodes, via
 calls like `tcx.hir.get_parent_node(n)`.
 ### HIR Bodies
 A **body** represents some kind of executable code, such as the body
 of a function/closure or the definition of a constant. Bodies are
 associated with an **owner**, which is typically some kind of item
 (e.g., a `fn()` or `const`), but could also be a closure expression
 (e.g., `|x, y| x + y`). You can use the HIR map to find find the body
 associated with a given def-id (`maybe_body_owned_by()`) or to find
 the owner of a body (`body_owner_def_id()`).
--- a/src/librustc/hir/map/README.md
+++ b/src/librustc/hir/map/README.md
@ -0,0 +1,4 @@
 The HIR map, accessible via `tcx.hir`, allows you to quickly navigate the
 HIR and convert between various forms of identifiers. See [the HIR README] for more information.
 [the HIR README]: ../README.md
--- a/src/librustc/hir/mod.rs
+++ b/src/librustc/hir/mod.rs
@ -413,6 +413,9 @@ pub struct WhereEqPredicate {
 pub type CrateConfig = HirVec<P<MetaItem>>;
 /// The top-level data structure that stores the entire contents of
 /// the crate currently being compiled.
 ///
 #[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Debug)]
 pub struct Crate {
    pub module: Mod,
@ -927,7 +930,27 @@ pub struct BodyId {
    pub node_id: NodeId,
 }
-/// The body of a function or constant value.
+/// The body of a function, closure, or constant value. In the case of
 /// a function, the body contains not only the function body itself
 /// (which is an expression), but also the argument patterns, since
 /// those are something that the caller doesn't really care about.
 ///
 /// Example:
 ///
 /// ```rust
 /// fn foo((x, y): (u32, u32)) -> u32 {
 ///     x + y
 /// }
 /// ```
 ///
 /// Here, the `Body` associated with `foo()` would contain:
 ///
 /// - an `arguments` array containing the `(x, y)` pattern
 /// - a `value` containing the `x + y` expression (maybe wrapped in a block)
 /// - `is_generator` would be false
 ///
 /// All bodies have an **owner**, which can be accessed via the HIR
 /// map using `body_owner_def_id()`.
 #[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Hash, Debug)]
 pub struct Body {
    pub arguments: HirVec<Arg>,
--- a/src/librustc/lib.rs
+++ b/src/librustc/lib.rs
@ -8,7 +8,28 @@
 // option. This file may not be copied, modified, or distributed
 // except according to those terms.
-//! The Rust compiler.
+//! The "main crate" of the Rust compiler. This crate contains common
 //! type definitions that are used by the other crates in the rustc
 //! "family". Some prominent examples (note that each of these modules
 //! has their own README with further details).
 //!
 //! - **HIR.** The "high-level (H) intermediate representation (IR)" is
 //!   defined in the `hir` module.
 //! - **MIR.** The "mid-level (M) intermediate representation (IR)" is
 //!   defined in the `mir` module. This module contains only the
 //!   *definition* of the MIR; the passes that transform and operate
 //!   on MIR are found in `librustc_mir` crate.
 //! - **Types.** The internal representation of types used in rustc is
 //!   defined in the `ty` module. This includes the **type context**
 //!   (or `tcx`), which is the central context during most of
 //!   compilation, containing the interners and other things.
 //! - **Traits.** Trait resolution is implemented in the `traits` module.
 //! - **Type inference.** The type inference code can be found in the `infer` module;
 //!   this code handles low-level equality and subtyping operations. The
 //!   type check pass in the compiler is found in the `librustc_typeck` crate.
 //!
 //! For a deeper explanation of how the compiler works and is
 //! organized, see the README.md file in this directory.
 //!
 //! # Note
 //!
--- a/src/librustc/ty/README.md
+++ b/src/librustc/ty/README.md
@ -0,0 +1,159 @@
 # Types and the Type Context
 The `ty` module defines how the Rust compiler represents types
 internally. It also defines the *typing context* (`tcx` or `TyCtxt`),
 which is the central data structure in the compiler.
 ## The tcx and how it uses lifetimes
 The `tcx` ("typing context") is the central data structure in the
 compiler. It is the context that you use to perform all manner of
 queries. The struct `TyCtxt` defines a reference to this shared context:
 ```rust
 tcx: TyCtxt<'a, 'gcx, 'tcx>
 //          --  ----  ----
 //          |   |     |
 //          |   |     innermost arena lifetime (if any)
 //          |   "global arena" lifetime
 //          lifetime of this reference
 ```
 As you can see, the `TyCtxt` type takes three lifetime parameters.
 These lifetimes are perhaps the most complex thing to understand about
 the tcx. During rust compilation, we allocate most of our memory in
 **arenas**, which are basically pools of memory that get freed all at
 once. When you see a reference with a lifetime like `'tcx` or `'gcx`,
 you know that it refers to arena-allocated data (or data that lives as
 long as the arenas, anyhow).
 We use two distinct levels of arenas. The outer level is the "global
 arena". This arena lasts for the entire compilation: so anything you
 allocate in there is only freed once compilation is basically over
 (actually, when we shift to executing LLVM).
 To reduce peak memory usage, when we do type inference, we also use an
 inner level of arena. These arenas get thrown away once type inference
 is over. This is done because type inference generates a lot of
 "throw-away" types that are not particularly interesting after type
 inference completes, so keeping around those allocations would be
 wasteful.
 Often, we wish to write code that explicitly asserts that it is not
 taking place during inference. In that case, there is no "local"
 arena, and all the types that you can access are allocated in the
 global arena.  To express this, the idea is to us the same lifetime
 for the `'gcx` and `'tcx` parameters of `TyCtxt`. Just to be a touch
 confusing, we tend to use the name `'tcx` in such contexts. Here is an
 example:
 ```rust
 fn not_in_inference<'a, 'tcx>(tcx: TyCtxt<'a, 'tcx, 'tcx>, def_id: DefId) {
    //                                        ----  ----
    //                                        Using the same lifetime here asserts
    //                                        that the innermost arena accessible through
    //                                        this reference *is* the global arena.
 }
 ```
 In contrast, if we want to code that can be usable during type inference, then you
 need to declare a distinct `'gcx` and `'tcx` lifetime parameter:
 ```rust
 fn maybe_in_inference<'a, 'gcx, 'tcx>(tcx: TyCtxt<'a, 'gcx, 'tcx>, def_id: DefId) {
    //                                                ----  ----
    //                                        Using different lifetimes here means that
    //                                        the innermost arena *may* be distinct
    //                                        from the global arena (but doesn't have to be).
 }
 ```
 ### Allocating and working with types
 Rust types are represented using the `ty::Ty<'tcx>` type. This is in fact a simple type alias
 for a reference with `'tcx` lifetime:
 ```rust
 pub type Ty<'tcx> = &'tcx TyS<'tcx>;
 ```
 The `TyS` struct defines the actual details of how a type is
 represented. The most interesting part of it is the `sty` field, which
 contains an enum that lets us test what sort of type this is. For
 example, it is very common to see code that tests what sort of type you have
 that looks roughly like so:
 ```rust
 fn test_type<'tcx>(ty: Ty<'tcx>) {
    match ty.sty {
        ty::TyArray(elem_ty, len) => { ... }
        ...
    }
 }
 ```
 (Note though that doing such low-level tests on types during inference
 can be risky, as there are may be inference variables and other things
 to consider, or sometimes types are not yet known that will become
 known later.).
 To allocate a new type, you can use the various `mk_` methods defined
 on the `tcx`. These have names that correpond mostly to the various kinds
 of type variants. For example:
 ```rust
 let array_ty = tcx.mk_array(elem_ty, len * 2);
 ```
 These methods all return a `Ty<'tcx>` -- note that the lifetime you
 get back is the lifetime of the innermost arena that this `tcx` has
 access to. In fact, types are always canonicalized and interned (so we
 never allocate exactly the same type twice) and are always allocated
 in the outermost arena where they can be (so, if they do not contain
 any inference variables or other "temporary" types, they will be
 allocated in the global arena). However, the lifetime `'tcx` is always
 a safe approximation, so that is what you get back.
 NB. Because types are interned, it is possible to compare them for
 equality efficiently using `==` -- however, this is almost never what
 you want to do unless you happen to be hashing and looking for
 duplicates. This is because often in Rust there are multiple ways to
 represent the same type, particularly once inference is involved. If
 you are going to be testing for type equality, you probably need to
 start looking into the inference code to do it right.
 You can also find various common types in the tcx itself by accessing
 `tcx.types.bool`, `tcx.types.char`, etc (see `CommonTypes` for more).
 ### Beyond types: Other kinds of arena-allocated data structures
 In addition to types, there are a number of other arena-allocated data
 structures that you can allocate, and which are found in this
 module. Here are a few examples:
 - `Substs`, allocated with `mk_substs` -- this will intern a slice of types, often used to
  specify the values to be substituted for generics (e.g., `HashMap<i32, u32>`
  would be represented as a slice `&'tcx [tcx.types.i32, tcx.types.u32]`.
 - `TraitRef`, typically passed by value -- a **trait reference**
  consists of a reference to a trait along with its various type
  parameters (including `Self`), like `i32: Display` (here, the def-id
  would reference the `Display` trait, and the substs would contain
  `i32`).
 - `Predicate` defines something the trait system has to prove (see `traits` module).
 ### Import conventions
 Although there is no hard and fast rule, the `ty` module tends to be used like so:
 ```rust
 use ty::{self, Ty, TyCtxt};
 ```
 In particular, since they are so common, the `Ty` and `TyCtxt` types
 are imported directly. Other types are often referenced with an
 explicit `ty::` prefix (e.g., `ty::TraitRef<'tcx>`). But some modules
 choose to import a larger or smaller set of names explicitly.
--- a/src/librustc/ty/context.rs
+++ b/src/librustc/ty/context.rs
@ -793,9 +793,11 @@ impl<'tcx> CommonTypes<'tcx> {
    }
 }
-/// The data structure to keep track of all the information that typechecker
+/// The central data structure of the compiler. Keeps track of all the
-/// generates so that so that it can be reused and doesn't have to be redone
+/// information that typechecker generates so that so that it can be
-/// later on.
+/// reused and doesn't have to be redone later on.
 ///
 /// See [the README](README.md) for more deatils.
 #[derive(Copy, Clone)]
 pub struct TyCtxt<'a, 'gcx: 'a+'tcx, 'tcx: 'a> {
    gcx: &'a GlobalCtxt<'gcx>,
--- a/src/librustc_back/README.md
+++ b/src/librustc_back/README.md
@ -0,0 +1,6 @@
 NB: This crate is part of the Rust compiler. For an overview of the
 compiler as a whole, see
 [the README.md file found in `librustc`](../librustc/README.md).
 `librustc_back` contains some very low-level details that are
 specific to different LLVM targets and so forth.
--- a/src/librustc_driver/README.md
+++ b/src/librustc_driver/README.md
@ -0,0 +1,12 @@
 NB: This crate is part of the Rust compiler. For an overview of the
 compiler as a whole, see
 [the README.md file found in `librustc`](../librustc/README.md).
 The `driver` crate is effectively the "main" function for the rust
 compiler.  It orchstrates the compilation process and "knits together"
 the code from the other crates within rustc. This crate itself does
 not contain any of the "main logic" of the compiler (though it does
 have some code related to pretty printing or other minor compiler
 options).
--- a/src/librustc_trans/README.md
+++ b/src/librustc_trans/README.md
@ -1 +1,7 @@
-See [librustc/README.md](../librustc/README.md).
+NB: This crate is part of the Rust compiler. For an overview of the
 compiler as a whole, see
 [the README.md file found in `librustc`](../librustc/README.md).
 The `trans` crate contains the code to convert from MIR into LLVM IR,
 and then from LLVM IR into machine code. In general it contains code
 that runs towards the end of the compilation process.
--- a/src/libsyntax/README.md
+++ b/src/libsyntax/README.md
@ -0,0 +1,7 @@
 NB: This crate is part of the Rust compiler. For an overview of the
 compiler as a whole, see
 [the README.md file found in `librustc`](../librustc/README.md).
 The `syntax` crate contains those things concerned purely with syntax
 – that is, the AST ("abstract syntax tree"), parser, pretty-printer,
 lexer, macro expander, and utilities for traversing ASTs.