Learning LLVM and JIT with Rust

Using Rust as the base language to build everything I need from scratch is quite an attractive option. From aerodynamics computation to statistical tools, there exist many possibilities for enhancement and understanding the technology. This document outlines a plan to learn LLVM and Just-In-Time (JIT) compilation by building a small compiler/interpreter for a simple language using Rust.

I. Project Goal

Develop a toy language compiler that takes a simple arithmetic expression language as input, compiles it to LLVM Intermediate Representation (IR), and then uses LLVM’s JIT capabilities to execute the code.

II. Prerequisites

Before diving in, ensure you have a foundational understanding of:

  1. Rust Programming:
    • Proficiency with Rust syntax, ownership, borrowing, traits, and error handling.
    • Experience with Cargo and managing dependencies.
  2. Basic Compiler Concepts:
    • Lexical Analysis (Lexing/Tokenizing): Breaking down source code into tokens.
    • Syntax Analysis (Parsing): Building an Abstract Syntax Tree (AST) from tokens.
    • Intermediate Representation (IR): Understanding the role of IRs like LLVM IR.
  3. LLVM (Conceptual):
    • A high-level understanding of what LLVM is and its architecture (frontends, optimizer, backends).
    • Familiarity with the concept of LLVM IR.
  4. Development Environment:
    • Rust toolchain installed (rustup).
    • LLVM installed on your system. The version should be compatible with the LLVM bindings you choose for Rust.
    • A good code editor (like VS Code with Rust Analyzer).

III. Detailed Plan & Guide

Phase 1: Foundations & Setup

  1. Deep Dive into LLVM IR:
    • Action: Study the LLVM Language Reference Manual (LangRef). Focus on understanding modules, functions, basic blocks, and common instructions (arithmetic, memory, control flow).
    • Goal: Be able to read and understand simple LLVM IR.
  2. Understanding JIT Compilation:
    • Action: Read articles and LLVM documentation about JIT compilation. Understand its benefits (e.g., dynamic optimization, runtime code generation) and trade-offs.
    • Goal: Grasp how JIT engines work at a high level.
  3. Choose LLVM Bindings for Rust:
    • Action: Research available Rust crates for LLVM. Popular choices include:
      • llvm-sys: Raw FFI bindings to LLVM. Requires more manual work but offers full control.
      • inkwell: A safer, more idiomatic Rust wrapper around llvm-sys. Often recommended for ease of use.
    • Goal: Select and understand the basics of your chosen LLVM binding.
  4. Project Setup:
    • Action: Create a new Rust project using cargo new my_jit_compiler.
    • Add your chosen LLVM binding crate to Cargo.toml.
    • Goal: A compilable Rust project with LLVM dependencies correctly configured.
  5. “Hello, LLVM” in Rust:
    • Action: Write a minimal Rust program that uses the LLVM bindings to:
      • Create an LLVM context and module.
      • Define a simple function (e.g., one that returns a constant).
      • Print the generated LLVM IR to the console.
    • Goal: Successfully generate and view LLVM IR from Rust.

Phase 2: Building the Language Frontend

Our toy language will initially support basic arithmetic expressions (e.g., 1 + 2 * 3).

  1. Lexer (Tokenizer):
    • Action: Implement a lexer that takes a string input and produces a stream of tokens (e.g., NUMBER(1), PLUS, NUMBER(2), STAR, NUMBER(3)).
    • Tools: You can write this manually or explore parser-generator crates like nom or pest for this and the next step.
    • Goal: A working lexer for your simple arithmetic language.
  2. Parser & AST:
    • Action: Implement a parser that consumes the token stream from the lexer and builds an Abstract Syntax Tree (AST) representing the expressions.
      • Example AST nodes: Number(f64), BinaryOp(Box<Expr>, Op, Box<Expr>).
    • Handle operator precedence (e.g., * before +).
    • Goal: A parser that correctly transforms token streams into an AST.

Phase 3: Code Generation (AST to LLVM IR)

This is where you connect your language frontend to LLVM.

  1. AST Traversal for Code Generation:
    • Action: Write a component (e.g., a CodeGen struct) that traverses the AST.
    • For each AST node, generate the corresponding LLVM IR instructions using your chosen LLVM bindings.
      • Number(val) -> ConstantFP::get(context, val)
      • BinaryOp(lhs, op, rhs) -> Recursively generate IR for lhs and rhs, then use LLVM builder methods for arithmetic operations (e.g., build_float_add, build_float_mul).
    • Goal: A function that takes an AST and an LLVM Module (or FunctionBuilder) and populates it with LLVM IR.
  2. Function Definition:
    • Action: Wrap your expression evaluation in an LLVM function. This function will take no arguments (for now) and return the result of the expression.
    • Goal: Generate a complete LLVM function for a given expression.

Phase 4: JIT Execution

  1. Setting up the LLVM JIT Engine:
    • Action: Use your LLVM bindings to create an ExecutionEngine. LLVM provides different JIT engines (e.g., MCJIT, ORC JIT). ORC JIT is more modern and flexible.
    • Goal: An initialized JIT execution engine.
  2. Compiling and Running:
    • Action:
      • Add the LLVM module (containing your generated function) to the JIT engine.
      • Look up the compiled function by name from the JIT engine.
      • Cast the function pointer to the correct Rust function signature (e.g., unsafe extern "C" fn() -> f64).
      • Call the function and print its result.
    • Goal: Successfully execute the JIT-compiled code and get the correct output for your arithmetic expressions.

Phase 5: Enhancements and Further Learning

Once the basic pipeline is working, consider these extensions:

  1. Adding Variables:
    • Extend the language to support variable declarations and usage.
    • Requires managing a symbol table during code generation.
    • Use LLVM’s alloca instruction for stack allocation and load/store for variable access.
  2. Control Flow:
    • Add if/else statements.
    • Requires generating LLVM basic blocks and conditional/unconditional branches (br instruction).
  3. User-Defined Functions:
    • Allow defining and calling functions within your toy language.
    • This is a significant step, involving managing function signatures, arguments, and return values in LLVM IR.
  4. LLVM Optimization Passes:
    • Action: Explore and apply LLVM’s rich set of optimization passes to your generated IR before JIT compilation.
    • Goal: Observe the effect of optimizations on the IR and potentially performance.
  5. Error Handling:
    • Implement robust error reporting for lexing, parsing, and code generation phases.
  6. Ahead-of-Time (AOT) Compilation:
    • Modify your compiler to output an object file or executable instead of JITing.
  7. Exploring ORC JIT:
    • If you started with an older JIT engine, migrate to or explore the features of ORC JIT for more advanced dynamic compilation scenarios.

IV. Resources

V. Milestones (Example)

  1. M1: Successfully generate LLVM IR for a single constant value and print it.
  2. M2: Implement lexer and parser for basic arithmetic expressions (e.g., a + b).
  3. M3: Generate LLVM IR for simple arithmetic expressions.
  4. M4: JIT compile and execute an arithmetic expression, printing the result.
  5. M5: (Stretch) Add support for variables.