Skip to content

解析器

🌐 Parser

我们将要构建的解析器称为递归下降解析器,它是手动遍历语法并构建抽象语法树(AST)的过程。

🌐 The parser we are going to construct is called a recursive descent parser, it is the manual process of going down the grammar and building up the AST.

解析器从简单开始,它包含源代码、词法分析器以及从词法分析器获取的当前已使用的标记。

🌐 The parser starts simple, it holds the source code, the lexer, and the current token consumed from the lexer.

rust
pub struct Parser<'a> {
    /// Source Code
    source: &'a str,

    lexer: Lexer<'a>,

    /// Current Token consumed from the lexer
    cur_token: Token,

    /// The end range of the previous token
    prev_token_end: usize,
}

impl<'a> Parser<'a> {
    pub fn new(source: &'a str) -> Self {
        Self {
            source,
            lexer: Lexer::new(source),
            cur_token: Token::default(),
        }
    }

    pub fn parse(&mut self) -> Program<'a> {
        Ok(Program {
            node: Node {
                start: 0,
                end: self.source.len(),
            }
            body: vec![]
        })
    }
}

辅助函数

🌐 Helper functions

当前标记 cur_token: Token 保存了词法分析器返回的当前标记。我们将通过添加一些用于导航和检查此标记的辅助函数,使解析器代码更清晰。

🌐 The current token cur_token: Token holds the current token returned from the lexer. We'll make the parser code cleaner by adding some helper functions for navigating and inspecting this token.

rust
impl<'a> Parser<'a> {
    fn start_node(&self) -> Node {
        let token = self.cur_token();
        Node::new(token.start, 0)
    }

    fn finish_node(&self, node: Node) -> Node {
        Node::new(node.start, self.prev_token_end)
    }

    fn cur_token(&self) -> &Token {
        &self.cur_token
    }

    fn cur_kind(&self) -> Kind {
        self.cur_token.kind
    }

    /// Checks if the current index has token `Kind`
    fn at(&self, kind: Kind) -> bool {
        self.cur_kind() == kind
    }

    /// Advance if we are at `Kind`
    fn bump(&mut self, kind: Kind) {
        if self.at(kind) {
            self.advance();
        }
    }

    /// Advance any token
    fn bump_any(&mut self) {
        self.advance();
    }

    /// Advance and return true if we are at `Kind`, return false otherwise
    fn eat(&mut self, kind: Kind) -> bool {
        if self.at(kind) {
            self.advance();
            return true;
        }
        false
    }

    /// Move to the next token
    fn advance(&mut self) {
        let token = self.lexer.next_token();
        self.prev_token_end = self.cur_token.end;
        self.cur_token = token;
    }
}

解析函数

🌐 Parse Functions

DebuggerStatement 是最简单的语句来解析,所以让我们尝试解析它并返回一个有效的程序

🌐 The DebuggerStatement is the most simple statement to parse, so let's try and parse it and return a valid program

rust
impl<'a> Parser<'a> {
    pub fn parse(&mut self) -> Program {
        let stmt = self.parse_debugger_statement();
        let body = vec![stmt];
        Program {
            node: Node {
                start: 0,
                end: self.source.len(),
            }
            body,
        }
    }

    fn parse_debugger_statement(&mut self) -> Statement {
        let node = self.start_node();
        // NOTE: the token returned from the lexer is `Kind::Debugger`, we'll fix this later.
        self.bump_any();
        Statement::DebuggerStatement {
            node: self.finish_node(node),
        }
    }
}

所有其他的解析函数都是建立在这些基础辅助函数之上的,例如在 swc 中解析 while 语句:

🌐 All the other parse functions build on these primitive helper functions, for example parsing the while statement in swc:

rust
// https://github.com/swc-project/swc/blob/554b459e26b24202f66c3c58a110b3f26bbd13cd/crates/swc_ecma_parser/src/parser/stmt.rs#L952-L970

fn parse_while_stmt(&mut self) -> PResult<Stmt> {
    let start = cur_pos!(self);

    assert_and_bump!(self, "while");

    expect!(self, '(');
    let test = self.include_in_expr(true).parse_expr()?;
    expect!(self, ')');

    let ctx = Context {
        is_break_allowed: true,
        is_continue_allowed: true,
        ..self.ctx()
    };
    let body = self.with_ctx(ctx).parse_stmt(false).map(Box::new)?;

    let span = span!(self, start);
    Ok(Stmt::While(WhileStmt { span, test, body }))
}

解析表达式

🌐 Parsing Expressions

表达式的语法结构嵌套且递归,这可能会在长表达式上导致堆栈溢出(例如在这个 TypeScript 测试中),

🌐 The grammar for expressions is deeply nested and recursive, which may cause stack overflow on long expressions (for example in this TypeScript test),

为了避免递归,我们可以使用一种称为“普拉特解析(Pratt Parsing)”的技术。更深入的教程可以在这里找到,由 Rust-Analyzer 的作者撰写。Rust 版本可以在Rome中找到。

🌐 To avoid recursion, we can use a technique called "Pratt Parsing". A more in-depth tutorial can be found here, written by the author of Rust-Analyzer. And a Rust version here in Rome.

列表

🌐 Lists

有很多地方我们需要解析由标点符号分隔的列表,例如 [a, b, c]{a, b, c}

🌐 There are lots of places where we need to parse a list separated by a punctuation, for example [a, b, c] or {a, b, c}.

解析列表的代码都很相似,我们可以使用模板方法模式通过使用特质来避免重复。

🌐 The code for parsing lists are all similar, we can use the template method pattern to avoid duplication by using traits.

rust
// https://github.com/rome/tools/blob/85ddb4b2c622cac9638d5230dcefb6cf571677f8/crates/rome_js_parser/src/parser/parse_lists.rs#L131-L157

fn parse_list(&mut self, p: &mut Parser) -> CompletedMarker {
    let elements = self.start_list(p);
    let mut progress = ParserProgress::default();
    let mut first = true;
    while !p.at(JsSyntaxKind::EOF) && !self.is_at_list_end(p) {
        if first {
            first = false;
        } else {
            self.expect_separator(p);

            if self.allow_trailing_separating_element() && self.is_at_list_end(p) {
                break;
            }
        }

        progress.assert_progressing(p);

        let parsed_element = self.parse_element(p);

        if parsed_element.is_absent() && p.at(self.separating_element_kind()) {
            // a missing element
            continue;
        } else if self.recover(p, parsed_element).is_err() {
            break;
        }
    }
    self.finish_list(p, elements)
}

这个模式也可以防止我们陷入无限循环,特别是 progress.assert_progressing(p);

🌐 This pattern can also prevent us from infinite loops, specifically progress.assert_progressing(p);.

然后可以为不同的列表提供实现细节,例如:

🌐 Implementation details can then be provided for different lists, for example:

rust
// https://github.com/rome/tools/blob/85ddb4b2c622cac9638d5230dcefb6cf571677f8/crates/rome_js_parser/src/syntax/expr.rs#L1543-L1580

struct ArrayElementsList;

impl ParseSeparatedList for ArrayElementsList {
    fn parse_element(&mut self, p: &mut Parser) -> ParsedSyntax {
        match p.cur() {
            T![...] => parse_spread_element(p, ExpressionContext::default()),
            T![,] => Present(p.start().complete(p, JS_ARRAY_HOLE)),
            _ => parse_assignment_expression_or_higher(p, ExpressionContext::default()),
        }
    }

    fn is_at_list_end(&self, p: &mut Parser) -> bool {
        p.at(T![']'])
    }

    fn recover(&mut self, p: &mut Parser, parsed_element: ParsedSyntax) -> RecoveryResult {
        parsed_element.or_recover(
            p,
            &ParseRecovery::new(
                JS_UNKNOWN_EXPRESSION,
                EXPR_RECOVERY_SET.union(token_set!(T![']'])),
            ),
            js_parse_error::expected_array_element,
        )
    }

    fn list_kind() -> JsSyntaxKind {
        JS_ARRAY_ELEMENT_LIST
    }

    fn separating_element_kind(&mut self) -> JsSyntaxKind {
        T![,]
    }

    fn allow_trailing_separating_element(&self) -> bool {
        true
    }
}

涵盖语法

🌐 Cover Grammar

覆盖语法中有详细说明,有时我们需要将 Expression 转换为 BindingIdentifier。像 JavaScript 这样的动态语言可以直接重写节点类型:

🌐 Detailed in cover grammar, there are times when we need to convert an Expression to a BindingIdentifier. Dynamic languages such as JavaScript can simply rewrite the node type:

javascript
https://github.com/acornjs/acorn/blob/11735729c4ebe590e406f952059813f250a4cbd1/acorn/src/lval.js#L11-L26

但在Rust中,我们需要做结构对结构的转换。一个干净利落的方法是使用一个特质。

🌐 But in Rust, we need to do a struct to struct transformation. A nice and clean way to do this is to use an trait.

rust
pub trait CoverGrammar<'a, T>: Sized {
    fn cover(value: T, p: &mut Parser<'a>) -> Result<Self>;
}

该特性接受 T 作为输入类型,Self 作为输出类型,因此我们可以定义如下内容:

🌐 The trait accepts T as the input type, and Self and the output type, so we can define the following:

rust
impl<'a> CoverGrammar<'a, Expression<'a>> for BindingPattern<'a> {
    fn cover(expr: Expression<'a>, p: &mut Parser<'a>) -> Result<Self> {
        match expr {
            Expression::Identifier(ident) => Self::cover(ident.unbox(), p),
            Expression::ObjectExpression(expr) => Self::cover(expr.unbox(), p),
            Expression::ArrayExpression(expr) => Self::cover(expr.unbox(), p),
            _ => Err(()),
        }
    }
}

impl<'a> CoverGrammar<'a, ObjectExpression<'a>> for BindingPattern<'a> {
    fn cover(obj_expr: ObjectExpression<'a>, p: &mut Parser<'a>) -> Result<Self> {
        ...
        BindingIdentifier::ObjectPattern(ObjectPattern { .. })
    }
}

impl<'a> CoverGrammar<'a, ArrayExpression<'a>> for BindingPattern<'a> {
    fn cover(expr: ArrayExpression<'a>, p: &mut Parser<'a>) -> Result<Self> {
        ...
        BindingIdentifier::ArrayPattern(ArrayPattern { .. })
    }
}

然后,对于任何需要将 Expression 转换为 BindingPattern 的地方,调用 BindingPattern::cover(expression)

🌐 Then for anywhere we need to convert an Expression to BindingPattern, call BindingPattern::cover(expression).


TypeScript

所以你已经掌握了 JavaScript,现在想挑战解析 TypeScript 吗? 坏消息是没有规范, 但好消息是 TypeScript 解析器就在 一个文件 中 🙃。

🌐 So you are done with JavaScript and you want to challenge parsing TypeScript? The bad news is that there is no specification, but the good news is that the TypeScript parser is in a single file 🙃.

JSX 与 TSX

🌐 JSX vs TSX

对于以下代码,

🌐 For the following code,

javascript
let foo = <string> bar;

如果这是 tsx(未结束的 JSX),就是语法错误,但如果是带有 TSTypeAssertionVariableDeclaration 就是正确的。

🌐 It is a syntax error if this is tsx (Unterminated JSX), but it is correct VariableDeclaration with TSTypeAssertion.

前瞻

🌐 Lookahead

在某些情况下,解析器需要向前查看并预读多个标记,以确定正确的语法。

🌐 In certain places, the parser need to lookahead and peek more than one token to determine the correct grammar.

TS索引签名

🌐 TSIndexSignature

例如,要解析 TSIndexSignature,可以考虑以下两种情况:

🌐 For example, to parse TSIndexSignature, consider the following two cases:

typescript
type A = { readonly [a: number]: string }
           ^__________________________^ TSIndexSignature

type B = { [a]: string }
           ^_________^ TSPropertySignature

对于第一个 {type A,我们需要预览 5 个标记(readonly[a:number)以确保它是 TSIndexSignature 而不是 TSPropertySignature

🌐 For type A on the first {, we need to peek 5 tokens (readonly, [, a, : and number) in order to make sure it is a TSIndexSignature and not a TSPropertySignature.

为了实现这一目标并提高效率,词法分析器需要一个用于存储多个标记的缓冲区。

🌐 To make this possible and efficient, the lexer requires a buffer for storing multiple tokens.

箭头表达式

🌐 Arrow Expressions

覆盖语法中讨论过,当在 SequenceExpression 之后找到 => 标记时,我们需要将 Expression 转换为 BindingPattern

🌐 Discussed in cover grammar, we need to convert from Expressions to BindingPatterns when the => token is found after a SequenceExpression.

但是这种方法对 TypeScript 无效,因为 () 中的每一项都可能包含 TypeScript 语法,情况太多,无法一一覆盖,例如:

🌐 But this approach does not work for TypeScript as each item inside the () can have TypeScript syntax, there are just too many cases to cover, for example:

typescript
(<x>a, b as c, d!);
(a?: b = {} as c!) => {};

建议针对这个特定情况学习 TypeScript 的源代码。相关的代码有:

🌐 It is recommended to study the TypeScript source code for this specific case. The relevant code are:

typescript
function tryParseParenthesizedArrowFunctionExpression(
  allowReturnTypeInArrowFunction: boolean,
): Expression | undefined {
  const triState = isParenthesizedArrowFunctionExpression();
  if (triState === Tristate.False) {
    // It's definitely not a parenthesized arrow function expression.
    return undefined;
  }

  // If we definitely have an arrow function, then we can just parse one, not requiring a
  // following => or { token. Otherwise, we *might* have an arrow function.  Try to parse
  // it out, but don't allow any ambiguity, and return 'undefined' if this could be an
  // expression instead.
  return triState === Tristate.True
    ? parseParenthesizedArrowFunctionExpression(
        /*allowAmbiguity*/ true,
        /*allowReturnTypeInArrowFunction*/ true,
      )
    : tryParse(() =>
        parsePossibleParenthesizedArrowFunctionExpression(allowReturnTypeInArrowFunction),
      );
}

//  True        -> We definitely expect a parenthesized arrow function here.
//  False       -> There *cannot* be a parenthesized arrow function here.
//  Unknown     -> There *might* be a parenthesized arrow function here.
//                 Speculatively look ahead to be sure, and rollback if not.
function isParenthesizedArrowFunctionExpression(): Tristate {
  if (
    token() === SyntaxKind.OpenParenToken ||
    token() === SyntaxKind.LessThanToken ||
    token() === SyntaxKind.AsyncKeyword
  ) {
    return lookAhead(isParenthesizedArrowFunctionExpressionWorker);
  }

  if (token() === SyntaxKind.EqualsGreaterThanToken) {
    // ERROR RECOVERY TWEAK:
    // If we see a standalone => try to parse it as an arrow function expression as that's
    // likely what the user intended to write.
    return Tristate.True;
  }
  // Definitely not a parenthesized arrow function.
  return Tristate.False;
}

总之,TypeScript 解析器结合了前瞻(快速路径)和回溯来解析箭头函数。

🌐 In summary, the TypeScript parser uses a combination of lookahead (fast path) and backtracking to parse arrow functions.