Learning LLVM and JIT with Rust
Using Rust as the base language to build everything I need from scratch is quite an attractive option. From aerodynamics computation to statistical tools, there exist many possibilities for enhancement and understanding the technology. This document outlines a plan to learn LLVM and Just-In-Time (JIT) compilation by building a small compiler/interpreter for a simple language using Rust.
I. Project Goal
Develop a toy language compiler that takes a simple arithmetic expression language as input, compiles it to LLVM Intermediate Representation (IR), and then uses LLVM’s JIT capabilities to execute the code.
II. Prerequisites
Before diving in, ensure you have a foundational understanding of:
- Rust Programming:
- Proficiency with Rust syntax, ownership, borrowing, traits, and error handling.
- Experience with Cargo and managing dependencies.
- Basic Compiler Concepts:
- Lexical Analysis (Lexing/Tokenizing): Breaking down source code into tokens.
- Syntax Analysis (Parsing): Building an Abstract Syntax Tree (AST) from tokens.
- Intermediate Representation (IR): Understanding the role of IRs like LLVM IR.
- LLVM (Conceptual):
- A high-level understanding of what LLVM is and its architecture (frontends, optimizer, backends).
- Familiarity with the concept of LLVM IR.
- Development Environment:
- Rust toolchain installed (
rustup
). - LLVM installed on your system. The version should be compatible with the LLVM bindings you choose for Rust.
- A good code editor (like VS Code with Rust Analyzer).
- Rust toolchain installed (
III. Detailed Plan & Guide
Phase 1: Foundations & Setup
- Deep Dive into LLVM IR:
- Action: Study the LLVM Language Reference Manual (LangRef). Focus on understanding modules, functions, basic blocks, and common instructions (arithmetic, memory, control flow).
- Goal: Be able to read and understand simple LLVM IR.
- Understanding JIT Compilation:
- Action: Read articles and LLVM documentation about JIT compilation. Understand its benefits (e.g., dynamic optimization, runtime code generation) and trade-offs.
- Goal: Grasp how JIT engines work at a high level.
- Choose LLVM Bindings for Rust:
- Action: Research available Rust crates for LLVM. Popular choices include:
llvm-sys
: Raw FFI bindings to LLVM. Requires more manual work but offers full control.inkwell
: A safer, more idiomatic Rust wrapper aroundllvm-sys
. Often recommended for ease of use.
- Goal: Select and understand the basics of your chosen LLVM binding.
- Action: Research available Rust crates for LLVM. Popular choices include:
- Project Setup:
- Action: Create a new Rust project using
cargo new my_jit_compiler
. - Add your chosen LLVM binding crate to
Cargo.toml
. - Goal: A compilable Rust project with LLVM dependencies correctly configured.
- Action: Create a new Rust project using
- “Hello, LLVM” in Rust:
- Action: Write a minimal Rust program that uses the LLVM bindings to:
- Create an LLVM context and module.
- Define a simple function (e.g., one that returns a constant).
- Print the generated LLVM IR to the console.
- Goal: Successfully generate and view LLVM IR from Rust.
- Action: Write a minimal Rust program that uses the LLVM bindings to:
Phase 2: Building the Language Frontend
Our toy language will initially support basic arithmetic expressions (e.g., 1 + 2 * 3
).
- Lexer (Tokenizer):
- Action: Implement a lexer that takes a string input and produces a stream of tokens (e.g.,
NUMBER(1)
,PLUS
,NUMBER(2)
,STAR
,NUMBER(3)
). - Tools: You can write this manually or explore parser-generator crates like
nom
orpest
for this and the next step. - Goal: A working lexer for your simple arithmetic language.
- Action: Implement a lexer that takes a string input and produces a stream of tokens (e.g.,
- Parser & AST:
- Action: Implement a parser that consumes the token stream from the lexer and builds an Abstract Syntax Tree (AST) representing the expressions.
- Example AST nodes:
Number(f64)
,BinaryOp(Box<Expr>, Op, Box<Expr>)
.
- Example AST nodes:
- Handle operator precedence (e.g.,
*
before+
). - Goal: A parser that correctly transforms token streams into an AST.
- Action: Implement a parser that consumes the token stream from the lexer and builds an Abstract Syntax Tree (AST) representing the expressions.
Phase 3: Code Generation (AST to LLVM IR)
This is where you connect your language frontend to LLVM.
- AST Traversal for Code Generation:
- Action: Write a component (e.g., a
CodeGen
struct) that traverses the AST. - For each AST node, generate the corresponding LLVM IR instructions using your chosen LLVM bindings.
Number(val)
->ConstantFP::get(context, val)
BinaryOp(lhs, op, rhs)
-> Recursively generate IR forlhs
andrhs
, then use LLVM builder methods for arithmetic operations (e.g.,build_float_add
,build_float_mul
).
- Goal: A function that takes an AST and an LLVM
Module
(orFunctionBuilder
) and populates it with LLVM IR.
- Action: Write a component (e.g., a
- Function Definition:
- Action: Wrap your expression evaluation in an LLVM function. This function will take no arguments (for now) and return the result of the expression.
- Goal: Generate a complete LLVM function for a given expression.
Phase 4: JIT Execution
- Setting up the LLVM JIT Engine:
- Action: Use your LLVM bindings to create an
ExecutionEngine
. LLVM provides different JIT engines (e.g., MCJIT, ORC JIT). ORC JIT is more modern and flexible. - Goal: An initialized JIT execution engine.
- Action: Use your LLVM bindings to create an
- Compiling and Running:
- Action:
- Add the LLVM module (containing your generated function) to the JIT engine.
- Look up the compiled function by name from the JIT engine.
- Cast the function pointer to the correct Rust function signature (e.g.,
unsafe extern "C" fn() -> f64
). - Call the function and print its result.
- Goal: Successfully execute the JIT-compiled code and get the correct output for your arithmetic expressions.
- Action:
Phase 5: Enhancements and Further Learning
Once the basic pipeline is working, consider these extensions:
- Adding Variables:
- Extend the language to support variable declarations and usage.
- Requires managing a symbol table during code generation.
- Use LLVM’s
alloca
instruction for stack allocation andload
/store
for variable access.
- Control Flow:
- Add
if/else
statements. - Requires generating LLVM basic blocks and conditional/unconditional branches (
br
instruction).
- Add
- User-Defined Functions:
- Allow defining and calling functions within your toy language.
- This is a significant step, involving managing function signatures, arguments, and return values in LLVM IR.
- LLVM Optimization Passes:
- Action: Explore and apply LLVM’s rich set of optimization passes to your generated IR before JIT compilation.
- Goal: Observe the effect of optimizations on the IR and potentially performance.
- Error Handling:
- Implement robust error reporting for lexing, parsing, and code generation phases.
- Ahead-of-Time (AOT) Compilation:
- Modify your compiler to output an object file or executable instead of JITing.
- Exploring ORC JIT:
- If you started with an older JIT engine, migrate to or explore the features of ORC JIT for more advanced dynamic compilation scenarios.
IV. Resources
- LLVM Language Reference Manual (LangRef): The definitive guide to LLVM IR.
- LLVM Kaleidoscope Tutorial: A classic LLVM tutorial for building a compiler for a toy language (available in C++, can be adapted for Rust concepts).
- Rust LLVM Bindings Documentation:
inkwell
docs: https://thedan64.github.io/inkwell/llvm-sys
docs: https://docs.rs/llvm-sys/
- Books:
- “Engineering a Compiler” by Keith Cooper & Linda Torczon.
- “Compilers: Principles, Techniques, and Tools” (The Dragon Book) by Aho, Lam, Sethi, & Ullman.
- Online Articles and Blogs: Many resources discuss compiler construction and LLVM.
V. Milestones (Example)
- M1: Successfully generate LLVM IR for a single constant value and print it.
- M2: Implement lexer and parser for basic arithmetic expressions (e.g.,
a + b
). - M3: Generate LLVM IR for simple arithmetic expressions.
- M4: JIT compile and execute an arithmetic expression, printing the result.
- M5: (Stretch) Add support for variables.