Skip to content

抽象语法树

🌐 AST

下一章中的解析器负责将标记(Tokens)转换成抽象语法树(AST)。与源代码文本相比,在 AST 上工作要方便得多。

🌐 The parser in the upcoming chapter is responsible for turning Tokens into an abstract syntax tree (AST). It is much nicer to work on the AST compared to the source text.

所有 JavaScript 工具都在 AST 级别上工作,例如:

🌐 All JavaScript toolings work on the AST level, for example:

  • 一个代码检查工具(例如 ESLint)会检查 AST 是否有错误
  • 格式化工具(例如 prettier)将 AST 打印回 JavaScript 文本
  • 一个压缩器(例如 terser)会转换抽象语法树 (AST)
  • 打包工具会连接不同文件的 AST 之间的所有导入和导出语句

在本章中,我们将使用 Rust 的结构体和枚举来构建 JavaScript 的 AST。

🌐 In this chapter, let's construct a JavaScript AST by using Rust structs and enums.

熟悉 AST

🌐 Getting familiar with the AST

为了让自己熟悉 AST,让我们访问 ASTExplorer 看看它的样子。在顶部面板选择 JavaScript,然后选择 acorn,输入 var a,我们就能看到树状视图和 JSON 视图。

🌐 To get ourselves comfortable with an AST, let's visit ASTExplorer and see what it looks like. On the top panel, select JavaScript, and then acorn, type in var a and we will see a tree view and a JSON view.

json
{
  "type": "Program",
  "start": 0,
  "end": 5,
  "body": [
    {
      "type": "VariableDeclaration",
      "start": 0,
      "end": 5,
      "declarations": [
        {
          "type": "VariableDeclarator",
          "start": 4,
          "end": 5,
          "id": {
            "type": "Identifier",
            "start": 4,
            "end": 5,
            "name": "a"
          },
          "init": null
        }
      ],
      "kind": "var"
    }
  ],
  "sourceType": "script"
}

由于这是一个树,每个对象都是一个具有类型名称的节点(例如 ProgramVariableDeclarationVariableDeclaratorIdentifier)。 startend 是来自源的偏移量。

🌐 Since this is a tree, every object is a node with a type name (e.g. Program, VariableDeclaration, VariableDeclarator, Identifier). start and end are the offsets from the source.

树形结构

🌐 estree

estree 是 JavaScript 的社区标准语法规范,它定义了 所有的 AST 节点,以便不同的工具能够相互兼容。

任何 AST 节点的基本构建块是 Node 类型:

🌐 The basic building block for any AST node is the Node type:

rust
#[derive(Debug, Default, Clone, Copy, Serialize, PartialEq, Eq)]
pub struct Node {
    /// Start offset in source
    pub start: usize,

    /// End offset in source
    pub end: usize,
}

impl Node {
    pub fn new(start: usize, end: usize) -> Self {
        Self { start, end }
    }
}

“var a”的AST定义为

🌐 AST for var a is defined as

rust
pub struct Program {
    pub node: Node,
    pub body: Vec<Statement>,
}

pub enum Statement {
    VariableDeclarationStatement(VariableDeclaration),
}

pub struct VariableDeclaration {
    pub node: Node,
    pub declarations: Vec<VariableDeclarator>,
}

pub struct VariableDeclarator {
    pub node: Node,
    pub id: BindingIdentifier,
    pub init: Option<Expression>,
}

pub struct BindingIdentifier {
    pub node: Node,
    pub name: String,
}

pub enum Expression {
}

Rust 没有继承,所以 Node 被添加到每个结构体中(这被称为“组合优于继承”)。

🌐 Rust does not have inheritance, so Node is added to each struct (this is called "composition over Inheritance").

StatementExpression 是枚举类型,因为它们将会扩展为许多其他节点类型,例如:

rust
pub enum Expression {
    AwaitExpression(AwaitExpression),
    YieldExpression(YieldExpression),
}

pub struct AwaitExpression {
    pub node: Node,
    pub expression: Box<Expression>,
}

pub struct YieldExpression {
    pub node: Node,
    pub expression: Box<Expression>,
}

需要“Box”,因为Rust中不允许自指结构。

🌐 The Box is needed because self-referential structs are not allowed in Rust.

INFO

JavaScript 语法有很多难点,阅读 语法教程 会很有趣。

Rust 优化

🌐 Rust Optimizations

内存分配

🌐 Memory Allocations

我们需要注意堆分配的结构体,比如 VecBox,因为堆分配并不便宜。

🌐 We need to look out for heap-allocated structs such as Vec and Box because heap allocations are not cheap.

看看 swc 的实际实现,我们可以看到一个 AST 可以有很多 BoxVec,并且注意到 StatementExpression 枚举包含十几个枚举变体。

🌐 Take a look at the real world implementation from swc, we can see that an AST can have lots of Boxs and Vecs, and also note that the Statement and Expression enums contain a dozen of enum variants.

记忆竞技场

🌐 Memory Arena

实际上,为 AST 使用全局内存分配器并不是很高效。每个 BoxVec 都是按需分配的,然后分别释放。我们希望做的是预先分配内存,并整体释放它。

🌐 Using the global memory allocator for the AST is actually not really efficient. Every Box and Vec are allocated on demand and then dropped separately. What we would like to do is pre-allocate memory and drop it in wholesale.

bumpalo 根据其文档显示,是我们用例的一个非常好的候选方案:

增量分配是一种快速但有限的分配方法。 我们有一块内存,并在这块内存中维护一个指针。每当我们分配一个对象时, 我们会快速检查在这块内存中是否有足够的容量来分配该对象,然后将指针按对象的大小更新。就是这样!

增量分配的缺点是没有通用的方法来释放单个对象或回收已不再使用对象的内存区域。

这些取舍使增量分配非常适合阶段性分配。也就是说,一组对象将在同一程序阶段分配、使用,然后可以作为一个整体一起释放。

通过使用 bumpalo::collections::Vecbumpalo::boxed::Box,我们的 AST 将会添加生命周期:

🌐 By using bumpalo::collections::Vec and bumpalo::boxed::Box, our AST will have lifetimes added to it:

rust
use bumpalo::collections::Vec;
use bumpalo::boxed::Box;

pub enum Expression<'a> {
    AwaitExpression(Box<'a, AwaitExpression>),
    YieldExpression(Box<'a, YieldExpression>),
}

pub struct AwaitExpression<'a> {
    pub node: Node,
    pub expression: Expression<'a>,
}

pub struct YieldExpression<'a> {
    pub node: Node,
    pub expression: Expression<'a>,
}

INFO

如果我们在这个阶段对处理生命周期不太熟悉,请务必小心。我们的程序即使没有内存区域也能正常运行。

出于简化考虑,以下章节中的代码没有演示内存区域的使用。

🌐 Code in the following chapters does not demonstrate the use of a memory arena for simplicity.

枚举大小

🌐 Enum Size

我们要做的第一个优化是减小枚举的大小。

🌐 The first optimization we are going to make is to reduce the size of the enums.

已知 Rust 枚举的字节大小是其所有变体的并集。例如,以下枚举将占用 56 个字节(1 个字节用于标记,48 个字节用于数据,8 个字节用于对齐)。

🌐 It is known that the byte size of a Rust enum is the union of all its variants. For example, the following enum will take up 56 bytes (1 byte for the tag, 48 bytes for the payload, and 8 bytes for alignment).

rust
enum Name {
    Anonymous, // 0 byte payload
    Nickname(String), // 24 byte payload
    FullName{ first: String, last: String }, // 48 byte payload
}

INFO

这个例子摘自这篇博文

至于 ExpressionStatement 枚举,在我们当前的设置下,它们可能占用超过 200 字节。

🌐 As for the Expression and Statement enums, they can take up to more than 200 bytes with our current setup.

这 200 个字节需要传来传去,或者在每次进行 matches!(expr, Expression::AwaitExpression(_)) 检查时访问,这对性能来说不是很缓存友好。

🌐 These 200 bytes need to be passed around, or accessed every time we do a matches!(expr, Expression::AwaitExpression(_)) check, which is not very cache friendly for performance.

更好的方法是将枚举变体打包,只携带16字节的数据。

🌐 A better approach would be to box the enum variants and only carry 16 bytes around.

rust
pub enum Expression {
    AwaitExpression(Box<AwaitExpression>),
    YieldExpression(Box<YieldExpression>),
}

pub struct AwaitExpression {
    pub node: Node,
    pub expression: Expression,
}

pub struct YieldExpression {
    pub node: Node,
    pub expression: Expression,
}

为了确保枚举在 64 位系统上确实占用 16 字节,我们可以使用 std::mem::size_of

🌐 To make sure the enums are indeed 16 bytes on 64-bit systems, we can use std::mem::size_of.

rust
#[test]
fn no_bloat_enum_sizes() {
    use std::mem::size_of;
    assert_eq!(size_of::<Statement>(), 16);
    assert_eq!(size_of::<Expression>(), 16);
}

“无膨胀枚举大小”的测试用例通常可以在 Rust 编译器源代码中看到,用于确保枚举大小较小。

rust
// https://github.com/rust-lang/rust/blob/9c20b2a8cc7588decb6de25ac6a7912dcef24d65/compiler/rustc_ast/src/ast.rs#L3033-L3042

// Some nodes are used a lot. Make sure they don't unintentionally get bigger.
#[cfg(all(target_arch = "x86_64", target_pointer_width = "64"))]
mod size_asserts {
    use super::*;
    use rustc_data_structures::static_assert_size;
    // These are in alphabetical order, which is easy to maintain.
    static_assert_size!(AssocItem, 160);
    static_assert_size!(AssocItemKind, 72);
    static_assert_size!(Attribute, 32);
    static_assert_size!(Block, 48);

要找到其他大型类型,我们可以运行

🌐 To find other large types, we can run

bash
RUSTFLAGS=-Zprint-type-sizes cargo +nightly build -p name_of_the_crate --release

看看

🌐 and see

print-type-size type: `ast::js::Statement`: 16 bytes, alignment: 8 bytes
print-type-size     discriminant: 8 bytes
print-type-size     variant `BlockStatement`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `BreakStatement`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `ContinueStatement`: 8 bytes
print-type-size         field `.0`: 8 bytes
print-type-size     variant `DebuggerStatement`: 8 bytes
print-type-size         field `.0`: 8 bytes

JSON 序列化

🌐 JSON Serialization

serde 可用于将 AST 序列化为 JSON。需要一些技术来使其兼容 estree。 下面是一些例子:

rust
use serde::Serialize;

#[derive(Debug, Clone, Serialize, PartialEq)]
#[serde(tag = "type")]
#[cfg_attr(feature = "estree", serde(rename = "Identifier"))]
pub struct IdentifierReference {
    #[serde(flatten)]
    pub node: Node,
    pub name: Atom,
}

#[derive(Debug, Clone, Serialize, PartialEq, Hash)]
#[serde(tag = "type")]
#[cfg_attr(feature = "estree", serde(rename = "Identifier"))]
pub struct BindingIdentifier {
    #[serde(flatten)]
    pub node: Node,
    pub name: Atom,
}

#[derive(Debug, Serialize, PartialEq)]
#[serde(untagged)]
pub enum Expression<'a> {
    ...
}
  • serde(tag = "type") 用于将结构体名称设为“类型”字段,即 { "type" : "..." }
  • cfg_attr + serde(rename) 用于将不同的结构体名称重命名为相同的名称,因为 estree 无法区分不同的标识符
  • 枚举上的 serde(untagged) 用于不为枚举创建额外的 JSON 对象