Skip to content

词法分析器

🌐 Lexer

令牌

🌐 Token

词法分析器,也称为分词器或扫描器,负责将源代码转换为标记。 标记随后会被解析器使用,因此我们不必担心原始文本中的空格和注释。

🌐 The lexer, also known as tokenizer or scanner, is responsible for transforming source text into tokens. The tokens will later be consumed by the parser so we don't have to worry about whitespaces and comments from the original text.

让我们从简单开始,将单个 + 文本转换为一个标记。

🌐 Let's start simple and transform a single + text into a token.

rust
#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Token {
    /// Token Type
    pub kind: Kind,

    /// Start offset in source
    pub start: usize,

    /// End offset in source
    pub end: usize,
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub enum Kind {
    Eof, // end of file
    Plus,
}

一个 + 给我们

🌐 A single + gives us

[
    Token { kind: Kind::Plus, start: 0, end: 1 },
    Token { kind: Kind::Eof,  start: 1, end: 1 }
]

要遍历字符串,我们可以选择跟踪索引并假装我们在写 C 代码,或者我们可以查看 字符串文档 并找到一个 Chars 迭代器来使用。

🌐 To loop through the string, we can either keep track of an index and pretend that we are writing C code, or we can take a look at the string documentation and find ourselves a Chars iterator to work with.

INFO

Chars 迭代器抽象了跟踪索引和边界检查,让我们感觉格外安全。

当我们调用 chars.next() 时,它会给我们一个 Option<char>。 但请注意,char 不是一个 0-255 的 ASCII 值,它是一个 utf8 Unicode 代码点值,其范围是 0 到 0x10FFFF。

🌐 It gives us an Option<char> when we call chars.next(). But please note that a char is not a 0-255 ASCII value, it is a utf8 Unicode point value with the range of 0 to 0x10FFFF.

让我们定义一个初始词法分析器抽象

🌐 Let's define a starter lexer abstraction

rust
use std::str::Chars;

struct Lexer<'a> {
    /// Source Text
    source: &'a str,

    /// The remaining characters
    chars: Chars<'a>
}

impl<'a> Lexer<'a> {
    pub fn new(source: &'a str) -> Self {
        Self {
            source,
            chars: source.chars()
        }
    }
}

INFO

这里的生命周期 'a 表示该迭代器持有某个地方的引用,在这种情况下,它引用的是一个 &'a str

要将源文本转换为标记,只需不断调用 chars.next() 并匹配返回的 char。最终的标记将始终是 Kind::Eof

🌐 To convert the source text to tokens, just keep calling chars.next() and match on the returned chars. The final token will always be Kind::Eof.

rust
impl<'a> Lexer<'a> {
    fn read_next_kind(&mut self) -> Kind {
        while let Some(c) = self.chars.next() {
            match c {
              '+' => return Kind::Plus,
              _ => {}
            }
        }
        Kind::Eof
    }

    fn read_next_token(&mut self) -> Token {
        let start = self.offset();
        let kind = self.read_next_kind();
        let end = self.offset();
        Token { kind, start, end }
    }

    /// Get the length offset from the source text, in UTF-8 bytes
    fn offset(&self) -> usize {
        self.source.len() - self.chars.as_str().len()
    }
}

fn offset 中的 .len().as_str().len() 方法调用感觉像是 O(n),所以我们深入探讨一下。

🌐 The .len() and .as_str().len() method calls inside fn offset feel like O(n), so let's dig deeper.

.as_str() 返回指向字符串切片的指针

rust
// https://github.com/rust-lang/rust/blob/b998821e4c51c44a9ebee395c91323c374236bbb/library/core/src/str/iter.rs#L112-L115

pub fn as_str(&self) -> &'a str {
    // SAFETY: `Chars` is only made from a str, which guarantees the iter is valid UTF-8.
    unsafe { from_utf8_unchecked(self.iter.as_slice()) }
}

切片是一种对内存块的视图,由指针和长度表示。 .len() 方法返回存储在切片中的元数据

🌐 A slice is a view into a block of memory represented as a pointer and a length. The .len() method returns the metadata stored inside the slice

rust
// https://github.com/rust-lang/rust/blob/b998821e4c51c44a9ebee395c91323c374236bbb/library/core/src/str/mod.rs#L157-L159

pub const fn len(&self) -> usize {
    self.as_bytes().len()
}
rust
// https://github.com/rust-lang/rust/blob/b998821e4c51c44a9ebee395c91323c374236bbb/library/core/src/str/mod.rs#L323-L325

pub const fn as_bytes(&self) -> &[u8] {
    // SAFETY: const sound because we transmute two types with the same layout
    unsafe { mem::transmute(self) }
}
rust
// https://github.com/rust-lang/rust/blob/b998821e4c51c44a9ebee395c91323c374236bbb/library/core/src/slice/mod.rs#L129-L138

pub const fn len(&self) -> usize {
    // FIXME: Replace with `crate::ptr::metadata(self)` when that is const-stable.
    // As of this writing this causes a "Const-stable functions can only call other
    // const-stable functions" error.

    // SAFETY: Accessing the value from the `PtrRepr` union is safe since *const T
    // and PtrComponents<T> have the same memory layouts. Only std can make this
    // guarantee.
    unsafe { crate::ptr::PtrRepr { const_ptr: self }.components.metadata }
}

上述所有代码将会编译成一次单独的数据访问,因此 .as_str().len() 实际上是 O(1)。

🌐 All the above code will get compiled into a single data access, so .as_str().len() is actually O(1).

偷看

🌐 Peek

要对多字符运算符如 +++= 进行分词,需要一个辅助函数 peek

🌐 To tokenize multi-character operators such as ++ or +=, a helper function peek is required:

rust
fn peek(&self) -> Option<char> {
    self.chars.clone().next()
}

我们不想推进原始的 chars 迭代器,所以我们克隆了迭代器并推进索引。

🌐 We don't want to advance the original chars iterator so we clone the iterator and advance the index.

INFO

如果我们深入研究源代码clone 很便宜,它只是复制了追踪和边界索引。

rust
// https://github.com/rust-lang/rust/blob/b998821e4c51c44a9ebee395c91323c374236bbb/library/core/src/slice/iter.rs#L148-L152

impl<T> Clone for Iter<'_, T> {
    fn clone(&self) -> Self {
        Iter { ptr: self.ptr, end: self.end, _marker: self._marker }
    }
}

peekchars.next() 之间的区别在于,前者总是返回 相同的 下一个 char,而后者会前进并返回不同的 char

🌐 The difference between peek and chars.next() is the former will always return the same next char, while the later will move forward and return a different char.

为了演示,考虑字符串 abc

🌐 To demonstrate, consider the string abc:

  • 重复调用 peek() 返回 Some(a)Some(a)Some(a)、…
  • 重复的 chars.next() 调用返回 Some('a')Some('b')Some('c')None

配备了 peek 后,对 +++= 的分词只是嵌套的 if 语句。

🌐 Equipped with peek, tokenizing ++ and += are just nested if statements.

这是来自 jsparagus 的一个实际实现示例:

🌐 Here is a real-world implementation from jsparagus:

rust
// https://github.com/mozilla-spidermonkey/jsparagus/blob/master/crates/parser/src/lexer.rs#L1769-L1791

'+' => match self.peek() {
    Some('+') => {
        self.chars.next();
        return self.set_result(
            TerminalId::Increment,
            SourceLocation::new(start, self.offset()),
            TokenValue::None,
        );
    }
    Some('=') => {
        self.chars.next();
        return self.set_result(
            TerminalId::AddAssign,
            SourceLocation::new(start, self.offset()),
            TokenValue::None,
        );
    }
    _ => return self.set_result(
        TerminalId::Plus,
        SourceLocation::new(start, self.offset()),
        TokenValue::None,
    ),
},

上述逻辑适用于所有运算符,因此让我们拓展一下对 JavaScript 词法分析的了解。

🌐 The above logic applies to all operators, so let us expand our knowledge on lexing JavaScript.

JavaScript

用 Rust 写的词法分析器相当无聊,感觉就像在写 C 代码,我们写着长长的链式 if 语句,检查每个 char,然后返回相应的标记。

🌐 A lexer written in Rust is rather boring, it feels like writing C code where we write long chained if statements and check for each char and then return the respective token.

真正有趣的时候是我们开始为 JavaScript 进行词法分析时。

🌐 The real fun begins when we start lexing for JavaScript.

让我们打开 ECMAScript 语言规范,重新学习 JavaScript。

🌐 Let's open up the ECMAScript Language Specification and re-learn JavaScript.

TIP

我仍然记得第一次打开规范说明书时,躲到一个小角落里痛苦地哭泣,因为感觉就像是在阅读充满术语的外语文本。如果有些内容不明白,可以去看看我的规范说明书阅读指南

评论

🌐 Comments

注释没有语义上的意义,如果我们在编写运行时可以忽略它们,但如果我们在编写代码检查器或打包工具,它们需要被考虑在内。

🌐 Comments have no semantic meaning, they can be skipped if we are writing a runtime, but they need to be taken into consideration if we are writing a linter or a bundler.

标识符与 Unicode

🌐 Identifiers and Unicode

我们主要使用 ASCII 编码,但 第11章 ECMAScript 语言:源文本 规定源文本应使用 Unicode 编码。而 第12.6章 名称和关键字 规定标识符应根据 Unicode 标准附录 #31 中给出的默认标识符语法进行解释。具体如下:

🌐 We mostly code in ASCII, but Chapter 11 ECMAScript Language: Source Text states the source text should be in Unicode. And Chapter 12.6 Names and Keywords states the identifiers are interpreted according to the Default Identifier Syntax given in Unicode Standard Annex #31. In detail:

IdentifierStartChar ::
    UnicodeIDStart

IdentifierPartChar ::
    UnicodeIDContinue

UnicodeIDStart ::
    any Unicode code point with the Unicode property “ID_Start”

UnicodeIDContinue ::
    any Unicode code point with the Unicode property “ID_Continue”

这意味着我们可以写 var ಠ_ಠ 但不能写 var 🦀 拥有 Unicode 属性“ID_Start”,而 🦀 没有。

🌐 This means that we can write var ಠ_ಠ but not var 🦀, has the Unicode property "ID_Start" while 🦀 does not.

INFO

我发布了 unicode-id-start 这个 crate 就是为了这个目的。unicode_id_start::is_id_start(char)unicode_id_start::is_id_continue(char) 可以被调用来检查 Unicode。

🌐 I published the unicode-id-start crate for this exact purpose. unicode_id_start::is_id_start(char) and unicode_id_start::is_id_continue(char) can be called to check Unicode.

关键词

🌐 Keywords

所有的 关键词,例如 ifwhilefor,都需要被分词并作为一个整体进行解析。它们需要被添加到标记类型枚举中,这样我们在解析器中就不必进行字符串比较了。

🌐 All the keywords such as if, while and for need to be tokenized and interpreted as a whole. They need to be added to the token kind enum so we don't have to make string comparisons in the parser.

rust
pub enum Kind {
    Identifier,
    If,
    While,
    For
}

TIP

“undefined”不是关键词,这里也没必要添加。

对关键字进行分词只不过是匹配上面的标识符。

🌐 Tokenizing keywords will just be matching the identifier from above.

rust
fn match_keyword(&self, ident: &str) -> Kind {
    // all keywords are 1 < length <= 10
    if ident.len() == 1 || ident.len() > 10 {
        return Kind::Identifier;
    }
    match ident {
        "if" => Kind::If,
        "while" => Kind::While,
        "for" => Kind::For,
        _ => Kind::Identifier
    }
}

令牌价值

🌐 Token Value

在编译器的后期阶段,我们经常需要比较标识符、数字和字符串,例如在代码检查工具中测试标识符。

🌐 We often need to compare identifiers, numbers and strings in later stages of the compiler phases, for example testing against identifiers inside a linter.

这些值目前是普通的源文本,让我们将它们转换为 Rust 类型,这样使用起来会更方便。

🌐 These values are currently in plain source text, let's convert them to Rust types so they are easier to work with.

rust
pub enum Kind {
    Eof, // end of file
    Plus,
    Identifier,
    Number,
    String,
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Token {
    /// Token Type
    pub kind: Kind,

    /// Start offset in source
    pub start: usize,

    /// End offset in source
    pub end: usize,

    pub value: TokenValue,
}

#[derive(Debug, Clone, PartialEq)]
pub enum TokenValue {
    None,
    Number(f64),
    String(String),
}

当标识符 foo 或字符串 "bar" 被标记化时,我们得到

🌐 When an identifier foo or string "bar" is tokenized , we get

Token { kind: Kind::Identifier, start: 0, end: 2, value: TokenValue::String("foo") }
Token { kind: Kind::String, start: 0, end: 4, value: TokenValue::String("bar") }

要将它们转换为 Rust 字符串,调用 let s = self.source[token.start..token.end].to_string() 并使用 token.value = TokenValue::String(s) 保存它。

🌐 To convert them to Rust strings, call let s = self.source[token.start..token.end].to_string() and save it with token.value = TokenValue::String(s).

当我们对数字 1.23 进行分词时,会得到一个带有 Token { start: 0, end: 3 } 的标记。 要将其转换为 Rust 的 f64,我们可以使用字符串 parse 方法,通过调用 self.source[token.start..token.end].parse::<f64>(),然后将值保存到 token.value。 对于二进制、八进制和整数,可以在 jsparagus 中找到它们解析技术的示例。

Rust 优化

🌐 Rust Optimizations

较小的令牌

🌐 Smaller Tokens

将令牌值放入 Kind 枚举中,并且追求更简单、更安全的代码,这很诱人:

🌐 It is tempting to put the token values inside the Kind enum and aim for simpler and safer code:

rust
pub enum Kind {
    Number(f64),
    String(String),
}

但众所周知,Rust 枚举的字节大小是其所有变体的联合。与原始枚举相比,这个枚举占用了更多字节,原始枚举只有 1 个字节。在解析器中将大量使用这个 Kind 枚举,处理 1 字节的枚举显然比处理多字节枚举要快。

🌐 But it is known that the byte size of a Rust enum is the union of all its variants. This enum packs a lot of bytes compared to the original enum, which has only 1 byte. There will be heavy usages of this Kind enum in the parser, dealing with a 1 byte enum will obviously be faster than a multi-byte enum.

字符串驻留

🌐 String Interning

在编译器中使用 String 性能不佳,主要原因如下:

🌐 It is not performant to use String in compilers, mainly due to:

  • String 是一个堆分配的对象
  • 字符串比较是一个 O(n) 操作

字符串驻留通过在缓存中为每个不同的字符串值存储唯一标识的唯一副本来解决这些问题。每个不同的标识符或字符串只会有一次堆分配,并且字符串比较的时间复杂度变为 O(1)。

crates.io 上有很多字符串驻留库,各有优缺点。

🌐 There are lots of string interning libraries on crates.io with different pros and cons.

一个合适的起点是使用 string-cache,它具有 Atom 类型和编译时 atom!("string") 接口。

🌐 A sufficient starting point is to use string-cache, it has an Atom type and a compile time atom!("string") interface.

使用 string-cacheTokenValue 变为

🌐 With string-cache, TokenValue becomes

rust
#[derive(Debug, Clone, PartialEq)]
pub enum TokenValue {
    None,
    Number(f64),
    String(Atom), 
}

并且字符串比较变为 matches!(value, TokenValue::String(atom!("string")))

🌐 and string comparison becomes matches!(value, TokenValue::String(atom!("string"))).