在构建 JavaScript 编译器时追求性能
🌐 Pursuit of Performance on Building a JavaScript Compiler
最初发布在 https://rustmagazine.org/issue-3/javascript-compiler/
🌐 Originally posted on https://rustmagazine.org/issue-3/javascript-compiler/
关于表现
🌐 On Performance
在两年的 Rust 编程之后,性能已经成为我根深蒂固的习惯——归根结底就是少分配内存和少消耗 CPU 周期。
🌐 After two years of writing Rust, performance has become an ingrained discipline for me - it boils down to allocate less memory and use fewer CPU cycles.
然而,如果不了解问题字段或不清楚潜在的解决方案,实现最佳性能可能会很困难。
🌐 However, achieving optimal performance can be difficult without the knowledge of the problem domain or awareness of potential solutions.
在接下来的部分中,我将带你了解我的性能与优化之旅。我的学习方法偏好通过研究、尝试和错误相结合,因此接下来的部分将按此方式组织。
🌐 I will take you on my journey of performance and optimization in the following sections. My preferred method of learning is through a combination of research, trial, and error, so the following sections will be organized as such.
解析
🌐 Parsing
Oxc 是一个标准编译器,包括抽象语法树(AST)、词法分析器和递归下降解析器。
🌐 Oxc is a standard compiler that includes an abstract syntax tree (AST), a lexer, and a recursive descent parser.
抽象语法树 (AST)
🌐 Abstract Syntax Tree (AST)
编译器的第一个架构设计是它的抽象语法树(AST)。
🌐 The first architectural design for a compiler is its AST.
所有 JavaScript 工具都在 AST 级别工作,例如:
🌐 All JavaScript tools work on the AST level, for example:
- 一个代码检查工具(例如 ESLint)会检查 AST 是否有错误
- 格式化工具(例如 prettier)将抽象语法树重新输出为 JavaScript 文本
- 一个压缩器(例如 terser)会转换抽象语法树 (AST)
- 打包工具会连接不同文件的 AST 之间的所有导入和导出语句
如果 AST 不够用户友好,构建这些工具会很痛苦。
🌐 It will be painful to build these tools if the AST is not user-friendly.
对于 JavaScript,最常用的 AST 规范是 estree。 我的第一个 AST 版本复制了 estree:
🌐 For JavaScript, the most used AST specification is estree. My first AST version replicates estree:
pub struct Program {
pub node: Node,
pub body: Vec<Statement>,
}
pub enum Statement {
VariableDeclarationStatement(VariableDeclaration),
}
pub struct VariableDeclaration {
pub node: Node,
pub declarations: Vec<VariableDeclarator>,
}在 Rust 中,声明一个树结构相对简单,因为它涉及使用结构体和枚举。
🌐 In Rust, declaring a tree is relatively straightforward, as it involves using structs and enums.
内存分配
🌐 Memory Allocation
我在编写解析器的时候,花了几个月时间研究这个版本的 AST。 有一天我决定对它进行性能分析。分析器显示程序在调用 drop 上花费了很多时间。
🌐 I worked on this version of AST for a couple of months while writing the parser. And one day I decided to profile it. The profiler showed the program was spending a lot of time calling drop.
💡 AST 的节点通过 Box 或 Vec 分配在堆上,它们是单独分配的,因此会按顺序释放。
有没有解决方法来缓解这个问题?
🌐 Is there a solution to mitigate this?
所以在开发解析器时,我研究了一些用 Rust 编写的其他 JavaScript 解析器,主要是 ratel 和 jsparagus。
🌐 So while working on the parser I studied some of the other JavaScript parsers written in Rust, mainly ratel and jsparagus.
这两个解析器都在声明它们的 AST 时使用了生命周期注解,
🌐 Both of these parsers declare their AST with a lifetime annotation,
pub enum Statement<'ast> {
Expression(ExpressionNode<'ast>),
}并且他们有一个名为 arena.rs 的附属文件。
🌐 and they have an accompanying file called arena.rs.
我不明白它的作用,所以我忽略了它们,直到我开始阅读它们对内存区域的使用:bumpalo 和 toolshed。
🌐 I did not understand what it does so I neglected them until I started reading about their usage of memory arenas: bumpalo and toolshed.
总之,内存区会一次性按块或页面分配内存,并在内存区被释放时整体释放。AST 分配在内存区上,因此释放 AST 是一个快速操作。
🌐 In summary, memory arena allocates memory upfront in chunks or pages and deallocate altogether when the arena is dropped. The AST is allocated on the arena so dropping the AST is a fast operation.
伴随而来的另一个好处是,抽象语法树(AST)是按特定顺序构建的,树的遍历也遵循相同的顺序,从而在访问过程中实现线性内存访问。这种访问模式将非常高效,因为所有相近的内存会以页为单位被读入CPU缓存,从而实现更快的访问速度。
🌐 Another nice side effect that comes with this is that, the AST is constructed in a specific order, and tree traversal also follows the same order, resulting in linear memory access during the visitation process. This access pattern will be efficient since all nearby memory will be read into the CPU cache in pages, resulting in faster access times.
不幸的是,对于 Rust 初学者来说,使用内存区可能具有挑战性,因为所有数据结构和相关函数都需要使用生命周期注解进行参数化。
我尝试了五次才在 bumpalo 内分配 AST。
🌐 Unfortunately it can be challenging for Rust beginners to use memory arenas because all data structures and relevant functions need to be parameterized by lifetime annotations. It took me five attempts to allocate the AST inside bumpalo.
将 AST 的内存分配改为内存区域大约提升了 20% 的性能。
🌐 Changing to a memory arena for the AST resulted around 20% performance improvement.
枚举大小
🌐 Enum Sizes
由于 AST 的递归性质,我们需要以一种能够避免“递归无间接”的方式来定义类型:
🌐 Due to the recursive nature of ASTs, we need to define the types in a way to avoid the "recursive without indirection" error:
error[E0072]: recursive types `Enum` and `Variant` have infinite size
--> crates/oxc_linter/src/lib.rs:1:1
|
1 | enum Enum {
| ^^^^^^^^^
2 | Variant(Variant),
| ------- recursive without indirection
3 | }
4 | struct Variant {
| ^^^^^^^^^^^^^^
5 | field: Enum,
| ---- recursive without indirection
|
help: insert some indirection (e.g., a `Box`, `Rc`, or `&`) to break the cycle
|
2 ~ Variant(Box<Variant>),
3 | }
4 | struct Variant {
5 ~ field: Box<Enum>,有两种方法可以做到这一点。要么在枚举变体中对枚举进行装箱,要么对结构体字段进行装箱。
🌐 There are two ways to do this. Either box the enum in the enum variant or box the struct field.
我在2017年的Rust论坛上也发现了同样的问题,有没有更好的方式来表示抽象语法树?
🌐 I found the same question in the Rust forum back in 2017, Is there a better way to represent an abstract syntax tree?
Aleksey(matklad)告诉我们要对枚举变体进行装箱,以保持 Expression 枚举的小巧。但这是什么意思?
🌐 Aleksey (matklad) told us to box the enum variants to keep the Expression enum small. But what does this mean?
事实证明,Rust 枚举的内存布局取决于其所有变体的大小,其总字节大小取决于最大的变体。 例如,以下枚举将占用 56 字节(1 字节用于标记,48 字节用于有效载荷,8 字节用于对齐)。
🌐 As it turns out, the memory layout of a Rust enum is dependent on the sizes of all its variants, its total byte size dependents on the largest variant. For example, the following enum will take up 56 bytes (1 byte for the tag, 48 bytes for the payload, and 8 bytes for alignment).
enum Enum {
A, // 0 byte payload
B(String), // 24 byte payload
C { first: String, last: String }, // 48 byte payload
}在典型的 JavaScript AST 中,Expression 枚举包含 45 个变体,Statement 枚举包含 20 个变体。如果不通过枚举变体进行封装,它们占用的空间超过 200 字节。 这 200 字节需要在各处传递,并且每次进行 matches!(expr, Expression::Variant(_)) 检查时都要访问,这对性能来说不是很利于缓存。
🌐 In a typical JavaScript AST, the Expression enum holds 45 variants and the Statement enum holds 20 variants. They take up more than 200 bytes if not boxed by enum variants. These 200 bytes have to be passed around, and also accessed every time we do a matches!(expr, Expression::Variant(_)) check, which is not very cache friendly for performance.
因此,为了提高内存访问效率,最好将枚举变体进行封装。
🌐 So to make memory access efficient, it is best to box the enum variants.
perf-book 描述了有关如何查找大类型的更多信息。
🌐 The perf-book describes additional info on how to find large types.
我还复制了限制小枚举大小的测试。
🌐 I also copied the test for restricting small enum sizes.
#[cfg(all(target_arch = "x86_64", target_pointer_width = "64"))]
#[test]
fn no_bloat_enum_sizes() {
use std::mem::size_of;
use crate::ast::*;
assert_eq!(size_of::<Statement>(), 16);
assert_eq!(size_of::<Expression>(), 16);
assert_eq!(size_of::<Declaration>(), 16);
}对枚举变体进行装箱大约提高了 10% 的速度。
🌐 Boxing the enum variants resulted around 10% speed-up.
跨度
🌐 Span
有时候,我们可能并没有意识到可以实现更小的内存占用,直到我们花一些额外时间去检查数据结构。
🌐 Occasionally, we may not realize that a smaller memory footprint is possible until we spend some extra time examining the data structures.
在这种情况下,所有 AST 节点的叶子都包含一个称为“span”的小型数据结构,用于存储源文本的字节偏移量,并包含两个 usize。
🌐 In this instance, the leaf of all AST nodes contains a small data structure called the "span", which is used for storing the byte offset from the source text and comprises two usizes.
pub struct Node {
pub start: usize,
pub end: usize,
}有人指出给我,我可以安全地将 usize 改为 u32,以减少峰值内存,因为大于 u32 的文件是一个 4GB 文件。
🌐 It was pointed out to me that I can safely change usize to u32 to reduce peak memory because larger than u32 is a 4GB file.
切换到 u32 提升了性能 在大文件上性能提高最多可达 5%。
🌐 Changing to u32 improved the performance up to 5% performance on large files.
字符串和标识符
🌐 Strings and Identifiers
在抽象语法树(AST)内部,可以尝试使用源文本的字符串引用来表示标识符名称和字符串字面量。
🌐 Inside the AST, one may attempt to use a string reference to the source text for identifier names and string literals.
pub struct StringLiteral<'a> {
pub value: &'a str,
}
pub struct Identifier<'a> {
pub name: &'a str,
}但不幸的是,在 JavaScript 中,字符串和标识符可以包含转义序列,即 '\251'、'\xA9' 和 '©' 对版权符号而言是相同的。
🌐 But unfortunately in JavaScript, strings and identifiers can have escape sequences, i.e. '\251', '\xA9' and '©' are the same for the copyright symbol.
这意味着我们必须计算转义后的值并分配一个新的 String。
🌐 This implies that we must compute the escaped values and allocate a new String.
字符串驻留
🌐 String interning
当有大量堆分配的字符串时,可以使用一种称为字符串驻留的技术,通过只存储每个不同字符串值的一份副本来减少总内存使用。
🌐 When there are lots of heap-allocated strings, a technique called string interning can be used to reduce total memory by storing only one copy of each distinct string value.
string-cache 是一个由 Servo 团队发布的流行且广泛使用的库。最初,我在 AST 中使用 string-cache 库来处理标识符和字符串。解析器在单线程下性能很快,但当我开始实现 linter 并且有多个解析器通过 rayon 并行运行时,CPU 的整体利用率约为所有核心的 50%。
在性能分析中,一个名为 parking_lot::raw_mutex::RawMutex::lock_slow 的方法在执行时间上排在首位。我对锁和多核编程知之甚少,但全局锁本身就很奇怪,所以我决定移除 string-cache 库以实现 CPU 的充分利用。
🌐 Upon profiling, a method called parking_lot::raw_mutex::RawMutex::lock_slow showed up on the top of the execution time. I did not know much about locks and multi-core programming, but a global lock was just strange to start with, so I decided to remove the string-cache library to enable full CPU utilization.
从 AST 中移除 string-cache 提高了并行解析性能约 30%.
🌐 Removing string-cache from the AST improved the performance of parallel parsing by about 30%.
字符串缓存
🌐 string-cache
半年后,在进行另一个对性能要求很高的项目时,string-cache 库再次出现问题。在并行文本解析过程中,它会阻塞所有线程。
🌐 Half a year later, while working on another performance-critical project, the string-cache library resurfaced again. It was blocking all the threads during parallel text parsing.
我决定研究 string-cache 的功能,因为这次我有所准备,在阅读了 Mara Bos 的书《Rust 原子操作与锁》(https://marabos.nl/atomics/) 之后。
🌐 I decided to study what string-cache does because I am prepared this time after reading the book Rust Atomics and Locks by Mara Bos.
以下是锁周围的相关代码。请注意,这段代码是八年前,也就是2015年编写的。
🌐 Here are the relevantcode around the lock. Please note that the code was written eight years ago in 2015.
pub(crate) static DYNAMIC_SET: Lazy<Mutex<Set>> = Lazy::new(|| {
Mutex::new({
// ... in another place
let ptr: std::ptr::NonNull<Entry> =
DYNAMIC_SET.lock().insert(string_to_add, hash.g);所以这是很直接的。每次插入字符串时,它都会锁定数据结构 Set。由于这个例程在解析器中被频繁调用,它的性能会因为同步而受到负面影响。
🌐 So this is straightforward. It locks the data structure Set every time a string is being inserted. As this routine is called frequently within a parser, its performance is impacted negatively by synchronization.
现在让我们来看看Set 数据结构并看看它的作用:
🌐 Now let's take a look at the Set data structure and see what it does:
pub(crate) fn insert(&mut self, string: Cow<str>, hash: u32) -> NonNull<Entry> {
let bucket_index = (hash & BUCKET_MASK) as usize;
{
let mut ptr: Option<&mut Box<Entry>> = self.buckets[bucket_index].as_mut();
while let Some(entry) = ptr.take() {
if entry.hash == hash && *entry.string == *string {
if entry.ref_count.fetch_add(1, SeqCst) > 0 {
return NonNull::from(&mut **entry);
}
entry.ref_count.fetch_sub(1, SeqCst);
break;
}
ptr = entry.next_in_bucket.as_mut();
}
}
debug_assert!(mem::align_of::<Entry>() >= ENTRY_ALIGNMENT);
let string = string.into_owned();
let mut entry = Box::new(Entry {
next_in_bucket: self.buckets[bucket_index].take(),
hash,
ref_count: AtomicIsize::new(1),
string: string.into_boxed_str(),
});
let ptr = NonNull::from(&mut *entry);
self.buckets[bucket_index] = Some(entry);
ptr
}看起来它正在寻找一个桶来存储字符串,如果字符串不在桶中,它就会插入该字符串。
🌐 It looks like it is looking for a bucket to store the string and it inserts the string if it is not in the bucket.
💡 这是线性探测吗?如果这是线性探测,那么这个 Set 只是一个 HashMap,只是没有说它是 HashMap。 💡 如果这是一个 HashMap,那么 Mutex<HashMap> 就是一个并发哈希映射。
尽管当我们知道要寻找什么时,解决方案看起来很简单,但我花了一个月才弄明白这一点,因为我之前没有意识到这个问题。当我意识到这只是一个并发哈希映射时,将互斥锁应用到各个桶而不是整个哈希映射,这是一个明显且合乎逻辑的解决方案。在实现此更改的一小时内,我提交了一个拉取请求,并对结果感到满意 😃。
🌐 Although the solution may seem straightforward when we know what to look for, it took me a month to figure this out because I was unaware of the issue. When it became evident that this is just a concurrent hashmap, applying the Mutex to the buckets instead of the entire hashmap was a clear and logical solution. Within an hour of implementing this change, I submitted a pull request and was happy with the outcome 😃.
https://github.com/servo/string-cache/pull/268值得一提的是,字符串驻留是 Rust 社区内部的一个争参数。对于 这篇博客文章 中展示的示例,有一些单线程库,比如 string-interner、lasso、lalrpop-intern、intaglio 和 strena。
🌐 It is worth mentioning that string interning is a battlefield within the Rust community. For the example shown in this blog post, there are single-threaded libraries such string-interner, lasso, lalrpop-intern, intaglio and strena.
由于我们正在并行解析文件,一种选择是使用多线程的字符串内部化库,例如 ustr。 然而,在对 ustr 和增强版的 string-cache 进行性能分析后,很明显其性能仍然低于预期,相比之下,我接下来要讲的方案表现更好。
🌐 Since we are parsing files in parallel, an option is to utilize a multi-threaded string interner library such as ustr. However, after profiling both ustr and the enhanced version of string-cache, it became apparent that the performance was still below expectations compared to the approach I am going to explain below.
针对表现不佳的一些初步猜测如下:
🌐 Some preliminary guesses for the sub-par performance are:
- 哈希 - 互联网用户需要对字符串进行哈希以去重
- 间接性——我们需要从一个“很远的”堆中读取字符串值,这对缓存不友好
字符串内联
🌐 String Inlining
所以我们又回到了最初必须分配大量字符串的问题。幸运的是,如果我们看看我们处理的数据类型,这个问题是有部分解决方案的:短的 JavaScript 变量名和一些短字符串。有一种叫做字符串内联的技术,可以将字符串的所有字节存储在栈上。
🌐 So we are back to the initial problem of having to allocate lots of strings. Fortunately, there is a partial solution to this problem if we look at what kind of data we are dealing with: short JavaScript variable names and some short strings. There is a technique called string inlining, where we store all of the bytes of a string on the stack.
本质上,我们希望以下枚举来存储我们的字符串。
🌐 In essence, we want the following enum to store our string.
enum Str {
Static(&'static str),
Inline(InlineReprensation),
Heap(String),
}为了最小化枚举的大小,InlineRepresentation 应该与 String 的大小相同。
🌐 To minimize the size of the enum, InlineRepresentation should have the same size as String.
#[cfg(all(target_arch = "x86_64", target_pointer_width = "64"))]
#[test]
fn test_size() {
use std::mem::size_of;
assert_eq!(size_of::<String>(), size_of::<InlineReprensation>());
}Rust 社区中的许多 crate 旨在优化内存使用。这是社区中的另一个战场。最受欢迎的有
🌐 Many crates in the Rust community aim to optimize memory usage. This is yet another battlefield within the community. The most popular ones are
这些 crate 各自具有独特的特性和实现内存优化的方法,因此在选择使用哪一个时会有各种权衡和考虑。例如,smol_str 和 flexstr 的克隆操作是 O(1) 的。flexstr 可以存储 22 字节,smol_str 和 smartstring 可以存储 23 字节,而 compact_str 在 64 位系统上可以存储 24 字节。
🌐 Each of these crates have unique characteristics and approaches to achieving memory optimization, leading to a variety of trade-offs and considerations when choosing which one to use. For example smol_str and flexstr clones are O(1). flexstr can store 22 bytes, smol_str and smartstring can store 23 bytes, and compact_str can store 24 bytes on 64-bit systems.
https://fasterthanli.me 对这个话题有一个 深入探讨。
将 String 改为 compact_str::CompactStr 大大减少了内存分配。
🌐 Changing String to compact_str::CompactStr reduced memory allocations by a large amount.
词法分析器
🌐 Lexer
令牌
🌐 Token
词法分析器(也称为分词器)的工作是将源代码转换成称为“标记”的结构化数据。
🌐 The job of the lexer (also known as tokenizer) is to turn source text into structured data called a token.
pub struct Token {
pub kind: Kind,
}为了使操作更方便,通常在 Rust 中将 token 类型定义为枚举。枚举的各个变体保存每个 token 对应的数据。
🌐 To make it easier to work with, a token kind is typically defined as an enum in Rust. The variants of the enums hold the corresponding data for each token.
pub enum Kind {
// Keywords
For,
While,
...
// Literals
String(String),
Num(f64),
...
}这个枚举目前使用 32 字节,而词法分析器通常需要构造数百万个这样的令牌 Kind。每次构造 Kind::For 或 Kind::While 时,都必须在堆栈上分配 32 字节的内存。
🌐 This enum currently uses 32 bytes, and a lexer often need to construct millions of this token Kind. Every time it constructs a Kind::For or Kind::While, it has to allocate 32 bytes of memory on the stack.
改进这一点的一个聪明方法是将枚举变体拆开,使 Kind 保持为单字节,并将其他值移到另一个枚举中,
🌐 A clever way to improve this is to break up the enum variant to keep Kind to a single byte and move the values into another enum,
pub struct Token<'a> {
pub kind: Kind,
pub value: TokenValue
}
pub enum TokenValue {
None,
String(String),
Num(f64),
}由于我们控制所有的解析代码,所以我们有责任确保其安全,通过始终将对应的令牌值声明为其类型来实现这一点。
🌐 Since we control all the parsing code, it is our job to keep this safe by always declaring the corresponding token value to its kind.
虽然32字节的TokenValue已经相当小,但由于它被频繁分配,仍可能对性能产生负面影响。
🌐 While a TokenValue of 32 bytes is already quite small, it may still have a negative impact on performance because it is allocated frequently.
让我们看看 String 类型,看看能找到什么,通过在我们的代码编辑器中使用“转到定义”,我们将依次浏览 String -> Vec -> RawVec:
🌐 Let's take a look at the String type and see what we can find, by using the "go-to definition" in our code editors, we'll go through String -> Vec -> RawVec:
pub struct String {
vec: Vec<u8>,
}
pub struct Vec {
buf: RawVec<T, A>,
len: usize,
}
pub struct RawVec {
ptr: Unique<T>,
cap: usize,
alloc: A,
}正如广告中所说,String 只是 u8 的 Vec,而 Vec 有一个长度和容量字段。由于我们不会修改这个字符串,内存使用方面的优化可以是去掉容量字段,改用字符串切片(&str)。
🌐 As advertised, a String is just a Vec of u8s, and a Vec has a length and a capacity field. Since we are never going to mutate this string, an optimization in terms of memory usage would be to drop the cap field and use a string slice (&str) instead.
pub enum TokenValue<'a> {
None,
String(&'a str),
Num(f64),
}TokenValue 变为 24 字节。
在 TokenValue 中使用字符串切片而不是 String 确实可以减少内存使用,但代价是需要添加生命周期注解。这可能会引发借用检查器的问题,而且生命周期注解会传播到代码库的其他部分,使我们的代码管理起来有些困难。我在 8 个月前就输掉了借用检查的“游戏”,但在重新审视这个问题时,终于赢了。
🌐 While using a string slice instead of String in TokenValue would reduce memory usage, it does come with the downside of adding a lifetime annotation. This can lead to issues with the borrow checker and the lifetime annotation will propagate to the rest of the codebase, making our code somewhat difficult to manage. I lost the borrow checking game 8 months ago but finally won when I revisited this.
当有意义时,我们总是可以选择拥有不可变数据的版本,而不是使用引用。例如,Box<str> 对于 String,以及 Box<[u8]> 对于 Vec<u8>。
🌐 When it makes sense, we can always go for the owned version of the immutable data instead of using references. For example Box<str> for String and Box<[u8]> for Vec<u8>.
总之,我们总能想出一些方法来保持数据结构的小巧,这有时会带来性能上的提升。
🌐 In summary, we can always come up with tricks to keep our data structures small, and it will sometimes reward us performance improvement.
奶牛
🌐 Cow
我第一次接触到“Cow”这个术语是在我学习 jsparagus 的代码时,它有一个叫做 AutoCow 的基础设施。
🌐 I first encountered the term Cow when I was studying jsparagus's code, it has an infrastructure called AutoCow.
我模糊地理解了这段代码在做什么。 当一个 JavaScript 字符串被分词时,遇到转义序列时,它会分配一个新的字符串;如果没有遇到,它会返回原始的字符串片段:
🌐 I vaguely understood what the code was doing. When a JavaScript string is being tokenized, it allocates a new string when it encounters an escaped sequence or it returns the original string slice if it doesn't:
fn finish(&mut self, lexer: &Lexer<'alloc>) -> &'alloc str {
match self.value.take() {
Some(arena_string) => arena_string.into_bump_str(),
None => &self.start[..self.start.len() - lexer.chars.as_str().len()],
}
}这很聪明,因为在99.9%的情况下,它不会分配新的字符串,因为转义字符串很少见。
🌐 This is clever because 99.9% of the time it will not allocate a new string because escaped strings are rare.
但是“Cow”或“写时复制智能指针”这个术语我从来没有理解过。
🌐 But the term Cow or "clone-on-write smart pointer" never made sense to me.
类型 Cow 是一种智能指针,提供写时克隆功能:它可以封装并提供对借用数据的不可变访问,并在需要修改或所有权时延迟克隆数据。该类型被设计为通过 Borrow trait 与一般借用数据一起使用。
如果你像我一样是 Rust 新手,那么这个描述根本没有帮助(我仍然不明白它在说什么)。
🌐 If you are new to Rust (like I was), then this description just doesn't help (I still don't understand what it is talking about).
有人向我指出,clone-on-write 只是这种数据结构的一个用例。一个更好的名称应该叫做 RefOrOwned,因为它是一种包含拥有的数据或引用的数据类型。
🌐 It was pointed out to me that clone-on-write is just a use case of this data structure. A better name should be called RefOrOwned because it is a type that contains either owned data or a reference.
单指令多数据
🌐 SIMD
当我浏览旧的 Rust 博客时,宣布可移植 SIMD 项目组 引起了我的注意。 我一直想尝试 SIMD,但从未有过机会。 经过一些研究,我发现了一个可能适用于解析器的用例:你能多快地从字符串中移除空格? 作者 Daniel Lemire。 结果发现,这之前已经有人做过了,在一个叫 RapidJSON 的 JSON 解析器中,使用 SIMD 来移除空白符 。
🌐 When I was going through the old Rust blogs, the Announcing the Portable SIMD Project Group caught my attention. I always wanted to play around with SIMD but never got the chance. After some research, I found a use case that may apply to a parser: How quickly can you remove spaces from a string? by Daniel Lemire. So it turns out this has been done before, in a JSON parser called RapidJSON, which uses SIMD to remove whitespaces.
所以最终在可移植 SIMD 和 RapidJSON 的代码帮助下, 我不仅成功地跳过了空白字符, 还成功地跳过了多行注释。
🌐 So eventually with the help of portable-SIMD and RapidJSON's code, not only did I manage to skip whitespaces, I also managed to skip multi-line comments as well.
这两项改进都提升了性能几个百分点。
🌐 Both changes improved the performance by a few percent.
关键词匹配
🌐 Keyword match
在性能分析的顶端,有一条热点代码路径,占总执行时间的大约 1-2%.
🌐 At the top of the performance profile, there is a hot code path that takes about 1 - 2% of the total execution time.
它尝试将一个字符串与 JavaScript 关键字匹配:
🌐 It tries to match a string to a JavaScript keyword:
fn match_keyword(s: &str) -> Self {
match s {
"as" => As,
"do" => Do,
"if" => If,
...
"constructor" => Constructor,
_ => Ident,
}
}随着 TypeScript 的加入,我们需要匹配的字符串有 84 个。经过一些研究,我发现了 V8 的一篇博客 极速解析,第 1 部分:优化扫描器,其中详细描述了其 关键字匹配代码。
🌐 With the addition of TypeScript, there are 84 strings for us to match from. After some research, I found a blog from V8 Blazingly fast parsing, part 1: optimizing the scanner, it describes its keyword matching code in detail.
由于关键字列表是静态的,我们可以计算一个完美哈希函数,为每个标识符至多给出一个候选关键字。V8 使用 gperf 来计算这个函数。结果是通过标识符的长度和前两个字符计算哈希值,以找到唯一的候选关键字。只有当关键字的长度与输入标识符的长度匹配时,我们才会将标识符与关键字进行比较。
所以快速哈希加上整数比较应该比84次字符串比较更快。但我们尝试了再次和再次,仍然没有效果。
🌐 So a quick hash plus an integer comparison should be faster than 84 string comparisons. But we tried again and again to no avail.
事实证明,LLVM 已经优化了我们的代码。 通过在 rustc 上使用 --emit=llvm-ir,我们找到了相关的代码:
🌐 As it turns out, LLVM already optimized our code. By using --emit=llvm-ir on rustc, we find the relevant code:
switch i64 %s.1, label %bb6 [
i64 2, label %"_ZN4core5slice3cmp81_$LT$impl$u20$core..cmp..PartialEq$LT$$u5b$B$u5d$$GT$$u20$for$u20$$u5b$A$u5d$$GT$2eq17h46d405acb5da4997E.exit.i"
i64 3, label %"_ZN4core5slice3cmp81_$LT$impl$u20$core..cmp..PartialEq$LT$$u5b$B$u5d$$GT$$u20$for$u20$$u5b$A$u5d$$GT$2eq17h46d405acb5da4997E.exit280.i"
i64 4, label %"_ZN4core5slice3cmp81_$LT$impl$u20$core..cmp..PartialEq$LT$$u5b$B$u5d$$GT$$u20$for$u20$$u5b$A$u5d$$GT$2eq17h46d405acb5da4997E.exit325.i"
i64 5, label %"_ZN4core5slice3cmp81_$LT$impl$u20$core..cmp..PartialEq$LT$$u5b$B$u5d$$GT$$u20$for$u20$$u5b$A$u5d$$GT$2eq17h46d405acb5da4997E.exit380.i"
i64 6, label %"_ZN4core5slice3cmp81_$LT$impl$u20$core..cmp..PartialEq$LT$$u5b$B$u5d$$GT$$u20$for$u20$$u5b$A$u5d$$GT$2eq17h46d405acb5da4997E.exit450.i"
i64 7, label %"_ZN4core5slice3cmp81_$LT$impl$u20$core..cmp..PartialEq$LT$$u5b$B$u5d$$GT$$u20$for$u20$$u5b$A$u5d$$GT$2eq17h46d405acb5da4997E.exit540.i"
i64 8, label %"_ZN4core5slice3cmp81_$LT$impl$u20$core..cmp..PartialEq$LT$$u5b$B$u5d$$GT$$u20$for$u20$$u5b$A$u5d$$GT$2eq17h46d405acb5da4997E.exit590.i"
i64 9, label %"_ZN4core5slice3cmp81_$LT$impl$u20$core..cmp..PartialEq$LT$$u5b$B$u5d$$GT$$u20$for$u20$$u5b$A$u5d$$GT$2eq17h46d405acb5da4997E.exit625.i"
i64 10, label %"_ZN4core5slice3cmp81_$LT$impl$u20$core..cmp..PartialEq$LT$$u5b$B$u5d$$GT$$u20$for$u20$$u5b$A$u5d$$GT$2eq17h46d405acb5da4997E.exit655.i"
i64 11, label %"_ZN4core5slice3cmp81_$LT$impl$u20$core..cmp..PartialEq$LT$$u5b$B$u5d$$GT$$u20$for$u20$$u5b$A$u5d$$GT$2eq17h46d405acb5da4997E.exit665.i"
], !dbg !191362%s 是字符串,%s.1 是它的长度……它正在根据字符串长度进行分支!编译器比我们聪明 😃.
(是的,我们对此变得非常认真,所以开始研究 LLVM IR 和汇编代码。)
后来,[@strager] 发布了一段非常有教育意义的 YouTube 视频 比 Rust 和 C++ 更快:完美的哈希表 讨论这个话题。这段视频教会了我们一种系统的方法来推断微调性能问题
🌐 Later on, @strager posted a very educational YouTube video Faster than Rust and C++: the PERFECT hash table on this topic. The video taught us a systematic approach to reasoning about fine-tuning performance problems
最终,我们得出结论,简单的关键词匹配对我们来说已经足够了,因为它只占性能的约1 - 2%,而在花费几天时间后,投入的精力并不值得——Rust 并不具备我们构建这个完美哈希映射所需的所有组件。
🌐 In the end, we concluded that the simple keyword match is enough for us since it was only about 1 - 2% of the performance, and the effort is not worth it after spending a few days on it - Rust does not have all the pieces we need to build this perfect hashmap.
代码检查工具
🌐 Linter
Lint工具是一种用于分析源代码问题的程序。
🌐 A linter is a program that analyzes the source code for problems.
最简单的代码检查器会访问每个 AST 节点并检查规则。 访问者模式 可以使用:
🌐 The simplest linter visits each AST node and checks for rules. The visitor pattern can be used:
pub trait Visit<'a>: Sized {
// ... lots of visit functions
fn visit_debugger_statement(&mut self, stmt: &'a DebuggerStatement) {
// report error
}
}父节点树
🌐 Parent Pointing Tree
使用访问者遍历 AST 向下很容易,但如果我们想向上遍历树以收集一些信息怎么办?
🌐 It is easy to go down the AST by using visitors, but what if we want to go up the tree to collect some information?
在 Rust 中解决这个问题尤其具有挑战性,因为无法向 AST 的节点添加指针。
🌐 This problem is particularly challenging to solve in Rust, because it is not possible to add a pointer to the nodes of the AST.
让我们暂时忘掉 AST,专注于具有节点指向其父节点属性的通用树。要构建一个通用树,每个树节点都需要是相同类型 Node,我们可以使用 Rc 来引用它们的父节点:
🌐 Let's forget about ASTs for a second and focus on generic trees with the property of a node having a pointer to its parent. To build a generic tree, each tree node needs to be the same type Node, we can reference their parent by using Rc:
struct Node {
parent: Option<Rc<Node>>,
}如果我们需要进行变更,使用这个模式会很繁琐,而且由于节点必须在不同时间被丢弃,它的性能也不高。
🌐 It is tedious to work with this pattern if we need mutation, and it is not performant because the nodes have to be dropped at different times.
一个更高效的解决方案是使用 Vec 作为其后备存储,并使用索引作为指针。
🌐 A more efficient solution is to use a Vec as its backing storage and use indexes for pointers.
struct Tree {
nodes: Vec<Node>
}
struct Node {
parent: Option<usize> // index into `nodes`
}indextree 是一个适合这个任务的不错的库。
回到我们的 AST,我们可以通过让节点指向一个封装了每种 AST 节点类型的枚举来构建一个 indextree。我们称之为无类型 AST。
🌐 Back to our AST, we can build a indextree by having the nodes point to an enum that wraps every single kind of AST node. We call this the untyped AST.
struct Node<'a> {
kind: AstKind<'a>
}
enum AstKind<'a> {
BlockStatement(&'a BlockStatement<'a>),
// ...
ArrayExpression(&'a ArrayExpression<'a>),
// ...
Class(&'a Class<'a>),
// ...
}最后缺少的一部分是在构建这棵树的访问者模式中添加回调。
🌐 The last missing piece is to have callbacks inside the visitor pattern that builds this tree.
pub trait Visit<'a> {
fn enter_node(&mut self, _kind: AstKind<'a>) {}
fn leave_node(&mut self, _kind: AstKind<'a>) {}
fn visit_block_statement(&mut self, stmt: &'a BlockStatement<'a>) {
let kind = AstKind::BlockStatement(stmt);
self.enter_node(kind);
self.visit_statements(&stmt.body);
self.leave_node(kind);
}
}
impl<'a> Visit<'a> for TreeBuilder<'a> {
fn enter_node(&mut self, kind: AstKind<'a>) {
self.push_ast_node(kind);
}
fn leave_node(&mut self, kind: AstKind<'a>) {
self.pop_ast_node();
}
}最终的数据结构变为 indextree::Arena<Node<'a>>,其中每个 Node 都有一个指向 AstKind<'a> 的指针。indextree::Node::parent 可以被调用来获取任何节点的父节点。
建立这个父指向树的一个好处是,在访问 AST 节点时变得非常方便,无需实现任何访问器。一个代码检查器可以简单地循环遍历 indextree 中的所有节点:
🌐 The nice benefit of making this parent pointing tree is that it becomes convenient to visit AST nodes without having to implement any visitors. A linter becomes a simple loop over all the nodes inside the indextree:
for node in nodes {
match node.get().kind {
AstKind::DebuggerStatement(stmt) => {
// report error
}
_ => {}
}
}完整示例请见这里。
🌐 A full example is provided here.
乍一看,这个过程可能显得缓慢且低效。然而,通过内存区访问已类型化的 AST 并将指针推进 indextree 是高效的线性内存访问模式。当前的基准测试显示,这种方法比 ESLint 快 84 倍,因此对于我们的目的来说,它绝对足够快。
🌐 At first glance, this process may seem slow and inefficient. However, visiting the typed AST through a memory arena and pushing a pointer into indextree are efficient linear memory access patterns. The current benchmark indicates that this approach is 84 times faster than ESLint, so it is certainly fast enough for our purposes.
并行处理文件
🌐 Processing files in parallel
该 linter 使用 ignore 库进行目录遍历,支持 .gitignore 并添加了额外的忽略文件,如 .eslintignore。
🌐 The linter uses the ignore crate for directory traversal, it supports .gitignore and adds additional ignore files such as .eslintignore.
这个板条箱的一个小问题是它没有并行接口,ignore::Walk::new(".")没有对应的par_iter。
🌐 A small problem with this crate is that it does not have a parallel interface, There is no par_iter for ignore::Walk::new(".").
相反,需要使用原语
🌐 Instead, primitives need to be used
let walk = Walk::new(&self.options);
rayon::spawn(move || {
walk.iter().for_each(|path| {
tx_path.send(path).unwrap();
});
});
let linter = Arc::clone(&self.linter);
rayon::spawn(move || {
while let Ok(path) = rx_path.recv() {
let tx_error = tx_error.clone();
let linter = Arc::clone(&linter);
rayon::spawn(move || {
if let Some(diagnostics) = Self::lint_path(&linter, &path) {
tx_error.send(diagnostics).unwrap();
}
drop(tx_error);
});
}
});这开启了一个有用的功能,我们可以在单个线程中打印所有诊断信息,这将引出本文的最后一个话题。
🌐 This unlocks a useful feature where we can print all diagnostics in a single thread, which leads us to the final topic of this article.
打印很慢
🌐 Printing is slow
打印诊断信息很快,但我已经在这个项目上工作了很久,以至于每次在庞大的多仓库上运行 linter 打印成千上万条诊断信息时,都感觉像是过了一个世纪。 所以我开始在 Rust 的 GitHub 问题中搜索,最终找到了相关的问题:
🌐 Printing the diagnostics was fast, but I have been working on this project for so long that it felt like an eternity to print thousands of diagnostic messages every time I run the linter on huge monorepos. So I started searching through the Rust GitHub issues and eventually found the relevant ones:
总之,每当遇到换行符时,println! 调用都会锁定 stdout,这称为行缓冲。为了让打印更快,我们需要选择块缓冲,相关文档见 这里。
🌐 In summary, a println! call will lock stdout every time it encounters a newline, this is called line buffering. To make things print faster, we need to opt-in for block buffering which is documented here.
use std::io::{self, Write};
let stdout = io::stdout(); // get the global stdout entity
let mut handle = io::BufWriter::new(stdout); // optional: wrap that handle in a buffer
writeln!(handle, "foo: {}", 42); // add `?` if you care about errors here或者获取 stdout 的锁。
🌐 Or acquire the lock on stdout.
let stdout = io::stdout(); // get the global stdout entity
let mut handle = stdout.lock(); // acquire a lock on it
writeln!(handle, "foo: {}", 42); // add `?` if you care about errors here