Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Tuffy is an experimental optimizing compiler written in Rust, developed with LLM assistance.

Note: This project is currently in an experimental stage.

Key Design Ideas

  • Infinite precision integers — The IR uses a single int type with no fixed bitwidth. Arithmetic operates on mathematical integers; signedness and minimum required bits are derived at use sites. This eliminates zext/sext/trunc noise and lets optimization passes focus on mathematical equivalence.
  • Analysis is also a transformation — Static analyses are expressed as transformations on the IR itself, unifying analysis and optimization and reducing phase ordering problems.
  • Declarative, verified optimizations — Transforms are declarative rewrite rules with machine-checked correctness proofs in Lean 4 against the formal IR semantics. A codegen generator produces Rust implementation from verified rules, minimizing the trusted code base.
  • Formal-first — Correctness of optimization passes is backed by formal verification from the start, not bolted on after the fact.
  • Hardware-friendly IR — IR representations designed to be cache-friendly and suitable for modern hardware, reducing pointer chasing and improving data locality during compilation.
  • Policy-mechanism separation — Optimization strategies are decoupled from the compiler engine. Users can specify or pin a particular optimization policy to prevent performance regressions when upgrading the compiler.

Documentation Overview

Tuffy IR Language Reference

This document is the reference manual for the Tuffy IR, the intermediate representation used by the tuffy compiler. It describes the type system, instruction set, control flow structure, and textual format of the IR.

Status: This spec describes the currently implemented IR. Features described in the IR Design RFC but not yet implemented are noted as planned.

Source of truth: The formal definitions in Lean 4 (lean/TuffyLean/IR/) are authoritative. This spec is a human-readable description that must conform to the Lean definitions. When in doubt, consult the Lean code.

Introduction

Tuffy IR is a typed, SSA-based intermediate representation organized around a hierarchical control flow graph with SESE (Single Entry, Single Exit) regions.

Key design principles:

  • Infinite precision integers — no fixed bit widths; range constraints via assert nodes
  • Block arguments instead of PHI nodes (Cranelift/MLIR style)
  • Hierarchical CFG — loops and structured control flow are explicit SESE regions
  • Poison-only UB model — no undef; only poison
  • Single data structure — one IR across all compilation stages

Table of Contents

Type System

Tuffy IR has a minimal type system supporting scalar types and aggregate types (structs, arrays). Aggregate types flow through the IR as single values; ABI-specific decomposition is handled by the backend legalize pass.

int

Infinite precision integer. No fixed bit width, no signedness. Arithmetic on int is mathematical: add 3, 5 produces 8 with no possibility of overflow.

Range constraints are expressed via value annotations rather than the type itself. The decision of sign-extension vs zero-extension is deferred to instruction selection.

bool

Boolean type with two values: true and false. Distinct from integers — boolean logic is not conflated with integer arithmetic. Comparison instructions (icmp) return bool, and control flow instructions (brif, select) require bool conditions.

To convert between bool and int, use bool_to_int (true → 1, false → 0). The reverse conversion uses icmp.ne val, 0.

byte(N)

Raw memory data of N bytes. Distinct from integers. The byte type preserves pointer fragments and supports per-byte poison tracking. Loads and stores operate on byte types.

Note: byte is defined in the type system but not yet used by any implemented instructions. Memory operations currently use int directly.

ptr(AS)

Pointer with address space AS. Address space 0 is the default. Pointers carry provenance (allocation ID + offset). Different address spaces may have different pointer sizes and semantics.

mem

Abstract memory state token for Memory SSA. Memory tokens flow through the function as regular SSA values, making memory dependencies explicit. The entry block receives the initial memory state as a block parameter of type mem; ret carries the final mem token as an operand.

Memory-producing operations (store, call, fence, atomics) consume a mem token and produce a new one. Plain load consumes a mem token but does not produce one. At CFG join points, memory versions merge via regular block parameters of type mem.

float

IEEE 754 floating point type. Variants: bf16 (bfloat16), f16 (half), f32 (single), f64 (double). Floating point operations (fadd, fsub, etc.) operate on float-typed values. The result type of a float operation matches the operand float width.

vec(VT)

Vector type parameterized by total bit-width. Variants: vec<N> (fixed-width, e.g., 128 for SSE), vec<vscale x N> (scalable, e.g., SVE, RVV). Element count is derived from the bit-width and element size.

struct{T0, T1, ...}

Struct type with field types. Represents a heterogeneous aggregate with indexed fields. Struct values flow through the IR as single SSA values. Field access uses extractvalue and insertvalue instructions with index paths.

Padding must be explicitly represented in the struct type. Padding fields are typically represented as [byte(N); M] (array of N-byte chunks) or byte(N) for individual padding bytes.

Example: struct{int, [byte(8); 1], bool} is a struct with an int field, 8 bytes of padding, and a bool field. The padding is explicit and must be accounted for in field indices.

The backend legalize pass expands struct values into register-sized pieces according to the target ABI (e.g., System V AMD64 ABI rules for struct passing and return).

[T; N]

Array type with element type T and count N. Represents a homogeneous aggregate with N elements of type T. Array values flow through the IR as single SSA values. Element access uses extractvalue and insertvalue instructions with index paths.

Example: [int; 10] is an array of 10 integers.

The backend legalize pass expands array values into register-sized pieces according to the target ABI.

Values

A value in tuffy IR is one of five kinds, as defined in Lean (TuffyLean.IR.Value):

KindDescription
int(v)An infinite precision integer with mathematical value v
bool(v)A boolean value (true or false)
bytes(bs)A sequence of abstract bytes
ptr(p)A pointer with provenance (allocation ID + offset)
poisonA poison value — the result of undefined behavior

Abstract Bytes

Each byte in memory is represented as an abstract byte (TuffyLean.IR.AbstractByte):

StateDescription
bits(val)A concrete byte value (0–255)
ptrFragment(allocId, index)The index-th byte of a pointer to allocation allocId
uninitUninitialized memory (a distinct abstract state, not “random bits”)
poisonA poisoned byte

Pointers and Provenance

A pointer consists of an allocation ID and an integer offset (TuffyLean.IR.Pointer). The allocation ID tracks which allocation the pointer belongs to, enabling provenance-based alias analysis.

Functions

A function is the top-level unit of compilation. It has a name, typed parameters, an optional return type, and a body organized as a hierarchical CFG.

func @name(param_types...) -> ret_type {
  ...
}
  • Name: prefixed with @ in text format.
  • Parameters: a list of types. Parameter values are created via param instructions. Parameters may optionally carry source-level names for readability (see below).
  • Return type: optional. Functions with no return type return void.
  • Body: blocks and nested regions directly inside the function braces. The implicit top-level function region is not written in the text format.

Named Parameters

When source-level parameter names are available (e.g., extracted from MIR debug info), they are displayed in the function signature and param instructions:

func @add(%a: int, %b: int) -> int {
  bb0:
    v0 = param %a
    v1 = param %b
    v2 = add v0, v1
    ret v2
}

Parameter names are prefixed with % in text format. When no name is available, the numeric ABI index is used as a fallback:

func @add(int, int) -> int {
  bb0:
    v0 = param 0
    v1 = param 1

Named parameters are display-only — the internal representation uses numeric ABI indices (Op::Param(u32)) for code generation.

Regions

The CFG is organized as a tree of SESE (Single Entry, Single Exit) regions. Each region contains an ordered sequence of basic blocks and child regions.

Region Kinds

KindDescription
functionTop-level function region. Every function has exactly one.
loopLoop region. The entry block is the loop header. Backedges use continue.
plainGeneric SESE region for structured control flow.

Nesting and Implicit Capture

Regions nest to form a tree. Values defined in an outer region are visible in inner regions via implicit capture (VPlan style). There is no explicit capture list — the live-in set is computed on demand.

func @example(int) -> int {
  bb0:
    v0 = iconst 10       // v0 defined in function body

  region loop {
    bb1(v1: int):
      v2 = add v1, v0    // v0 captured implicitly from outer scope
      ...
  }
}

Loop Regions

A loop region’s entry block is the loop header. The continue terminator transfers control back to the loop header, passing new values for the header’s block arguments. Exiting a loop is done by branching to a block outside the loop region.

Basic Blocks

A basic block is a straight-line sequence of instructions with no internal control flow. Blocks are named bb0, bb1, etc. in text format.

Block Arguments

Tuffy IR uses block arguments instead of PHI nodes to represent values that merge at control flow join points. Block arguments are declared in the block header and receive values from predecessor branches:

bb0:
  v0 = iconst 1
  br bb1(v0)

bb1(v1: int):       // v1 is a block argument
  ret v1

Each branch instruction (br, brif, continue) passes values to the target block’s arguments. The number and types of passed values must match the target block’s argument list.

Instruction Ordering

Instructions within a block execute sequentially. The last instruction in a block must be a terminator (ret, br, brif, continue, region_yield).

Instructions

Every instruction produces a value (named vN in text format) and has a result type. Terminators are not separated from regular instructions — they are simply the last instruction in a block by convention.

Constants

iconst

vN = iconst <immediate>

Produces an integer constant. The immediate is an arbitrary-precision integer, matching the Lean spec’s Int type.

Semantics: Value.int(immediate)

param

vN = param <index>
vN = param %name

References a function parameter by index (0-based) or by name. When source-level parameter names are available, the named form param %name is used in the text format. The internal representation always uses the numeric ABI index. This is not a true constant but creates a named value for the parameter within the instruction stream.

Arithmetic

All integer arithmetic operates on mathematical integers with infinite precision. There is no overflow. The formal semantics are defined in TuffyLean.IR.Semantics.

add

vN = add vA, vB

Integer addition. Semantics: evalAdd(a, b) = a + b

sub

vN = sub vA, vB

Integer subtraction. Semantics: evalSub(a, b) = a - b

mul

vN = mul vA, vB

Integer multiplication. Semantics: evalMul(a, b) = a * b

div

vN = div vA, vB

Integer division. Produces poison if vB is zero. Signedness is a property of operand annotations, not the operation itself — in the infinite precision integer model, signed and unsigned division are mathematically identical. Semantics: evalDiv(a, b) = if b = 0 then poison else a / b

rem

vN = rem vA, vB

Integer remainder. Produces poison if vB is zero. Signedness is a property of operand annotations, not the operation itself. Semantics: evalRem(a, b) = if b = 0 then poison else a % b

and

vN = and vA, vB

Bitwise AND on infinite precision two’s complement integers. Semantics: evalAnd(a, b) = Int.land a b

or

vN = or vA, vB

Bitwise OR on infinite precision two’s complement integers. Semantics: evalOr(a, b) = Int.lor a b

xor

vN = xor vA, vB

Bitwise XOR on infinite precision two’s complement integers. Semantics: evalXor(a, b) = Int.xor a b

shl

vN = shl vA, vB

Left shift. Produces poison if the shift amount vB is negative. Semantics: evalShl(a, b) = if b < 0 then poison else a <<< b

shr

vN = shr vA, vB

Right shift. Produces poison if the shift amount vB is negative. Signedness is a property of operand annotations, not the operation. In infinite precision, logical and arithmetic right shift are identical. Semantics: evalShr(a, b) = if b < 0 then poison else a >>> b

count_ones

vN = count_ones vA

Population count: returns the number of set bits in the binary representation of vA. Produces poison if vA is negative. Semantics: evalCountOnes(a) = if a < 0 then poison else popcount(a)

merge

vN = merge.<width> vA, vB

Replace the low width bits of vA with the low width bits of vB, producing a single integer. Produces poison if width is 0. Semantics: evalMerge(a, b, width) = if width = 0 then poison else (a with low width bits cleared) | (b AND (2^width - 1))

split

vHi, vLo = split.<width> vA

Decompose vA at bit position width. Produces two results:

  • vLo = the low width bits of vA (zero-extended)
  • vHi = vA right-shifted by width bits

Produces poison if width is 0. This is a multi-result instruction. Semantics: evalSplitHi(a, width) = a >>> width, evalSplitLo(a, width) = a mod 2^width

clmul

vN = clmul vA, vB

Carry-less multiplication (polynomial multiplication over GF(2)). Multiplies vA and vB using XOR instead of addition for accumulating partial products. Produces poison if either operand is negative. Semantics: evalClmul(a, b) = if a < 0 ∨ b < 0 then poison else clmulNat(a, b)

Floating Point Arithmetic

Floating point operations operate on values of float type (bf16, f16, f32, f64). The formal semantics use Lean 4’s native Float type (IEEE 754 binary64) as a placeholder model. The formal float model will be refined when full NaN payload and denormal semantics are decided.

Rewrite flags

Binary floating-point arithmetic instructions (fadd, fsub, fmul, fdiv) may carry optional rewrite flags that grant the optimizer permission to apply algebraic transformations. These flags do not change the instruction’s operational semantics — they only widen the set of legal rewrites.

FlagMeaning
reassocAllow associativity / commutativity reordering
contractAllow contraction (e.g., a*b+c → fma(a,b,c))

Flags appear between the opcode and the operands:

vN = fadd reassoc contract vA, vB

When no flags are present the instruction follows strict IEEE 754 semantics. Mirrors TuffyLean.IR.FpRewriteFlags.

fadd

vN = fadd [flags] vA, vB

Floating point addition. Semantics: evalFAdd(a, b) = a + b

fsub

vN = fsub [flags] vA, vB

Floating point subtraction. Semantics: evalFSub(a, b) = a - b

fmul

vN = fmul [flags] vA, vB

Floating point multiplication. Semantics: evalFMul(a, b) = a * b

fdiv

vN = fdiv [flags] vA, vB

Floating point division. Semantics: evalFDiv(a, b) = a / b

fneg

vN = fneg vA

Floating point negation. Semantics: evalFNeg(a) = -a

fabs

vN = fabs vA

Floating point absolute value. Semantics: evalFAbs(a) = Float.abs a

copysign

vN = copysign vMag, vSign

Produce a value with the magnitude of vMag and the sign bit of vSign. Semantics: evalCopySign(mag, sign) = if sign < 0 then -(abs mag) else abs mag

Comparison

icmp

vN = icmp.<pred> vA, vB

Integer comparison. Returns a bool value: true if the comparison holds, false otherwise.

Predicates:

PredicateDescription
eqEqual
neNot equal
ltLess than (signedness from annotation)
leLess than or equal (signedness from annotation)
gtGreater than (signedness from annotation)
geGreater than or equal (signedness from annotation)

Semantics: evalICmp(op, a, b) = bool(op(a, b))

select

vN = select vCond, vTrue, vFalse

Conditional select. If vCond is true, produces vTrue; otherwise produces vFalse. The condition must be of type bool. The result type matches the type of vTrue/vFalse.

Semantics: evalSelect(cond, tv, fv) = if cond then tv else fv

bool_to_int

vN = bool_to_int vA

Convert a bool to an int: true → 1, false → 0.

Semantics: evalBoolToInt(b) = if b then 1 else 0

Value Annotations

Range constraints and bit-level facts are encoded as annotations on value definitions (result-side) and use edges (use-side), rather than as standalone instructions. See RFC: at-use-annotations for the full design.

Annotation types

AnnotationMeaning
:s<N>Value is in signed N-bit range [-2^(N-1), 2^(N-1)-1]
:u<N>Value is in unsigned N-bit range [0, 2^N-1]
:known(<ternary>)Per-bit four-state constraint
:nofpclass(<classes>)Float value must not belong to the listed FP classes
(none)No constraint; unconstrained int

Annotations are composable: :s32:known(...) applies both constraints simultaneously.

Known bits encoding

Each bit in a known annotation is one of four states:

SymbolMeaning
0Bit is known to be 0
1Bit is known to be 1
?Unknown — bit is demanded but value not determined
xDon’t-care — bit is not demanded

nofpclass

Constrains which IEEE 754 floating-point value classes a float value may belong to. If the value falls into an excluded class, the result is poison. This is separate from integer annotations (:s<N>, :u<N>) and mirrors LLVM’s nofpclass attribute.

The 10 individual FP classes (mirroring LLVM’s FPClassTest):

ClassDescription
snanSignaling NaN
qnanQuiet NaN
ninfNegative infinity
nnormNegative normal
nsubNegative subnormal
nzeroNegative zero
pzeroPositive zero
psubPositive subnormal
pnormPositive normal
pinfPositive infinity

Convenience groups:

GroupExpands to
nansnan qnan
infninf pinf
zeronzero pzero

Syntax:

v0:nofpclass(nan inf) = fadd v1, v2   // result must not be NaN or ±Inf
v1 = fsub v0:nofpclass(nzero), v2     // use-side: v0 must not be -0.0

Mirrors TuffyLean.IR.FpClassMask.

Result-side annotations

Declared on the value produced by an instruction. Violation causes the instruction to produce poison.

v0:s32 = add v1, v2    // if true result outside [-2^31, 2^31-1], v0 is poison

Use-side annotations

Declared on an operand of a consuming instruction. Violation causes the consuming instruction to produce poison. May be stronger than the result-side annotation.

v1 = foo v0:u8          // if v0 is outside [0, 255], foo produces poison

Function signature annotations

Function signatures carry annotations as ABI-level contracts:

func @add_i32(int:s32, int:s32) -> int:s32 { ... }

Parameter annotations are caller guarantees; return annotations are callee guarantees.

Memory Operations

All memory operations participate in Memory SSA. Operations that modify memory state consume a mem token and produce a new one. Plain loads consume a mem token but do not produce one. See types.md for details on the mem type.

load

vN = load.<size> vPtr, vMem

Load size bytes from the address pointed to by vPtr. Takes a mem token as input (MemoryUse). Returns a data value only — does not produce a new mem token.

Semantics: evalLoad(mem, addr, size) = bytes [mem[addr], mem[addr+1], ..., mem[addr+size-1]]

store

vN = store.<size> vVal, vPtr, vMem

Store a value to the address pointed to by vPtr. Takes a mem token as input (MemoryDef) and produces a new mem token as its result.

Semantics: evalStore(mem, addr, bs) = mem[addr..addr+len(bs)] := bs

stack_slot

vN = stack_slot <bytes>

Allocate bytes bytes on the stack. Returns a ptr(0) to the allocated memory.

memcpy

vN = memcpy vDst, vSrc, vCount, align=<N>, vMem

Copy vCount bytes from vSrc to vDst. The source and destination regions must not overlap (non-overlapping memcpy semantics). align is the minimum alignment hint for both pointers (must be a power of two). Takes a mem token as input (MemoryDef) and produces a new mem token.

Semantics: evalMemCopy(mem, dst, src, count) = evalStore(mem, dst, readBytes(mem, src, count))

memmove

vN = memmove vDst, vSrc, vCount, align=<N>, vMem

Copy vCount bytes from vSrc to vDst. Unlike memcpy, overlapping regions are handled correctly: all source bytes are read before any destination bytes are written. align is the minimum alignment hint for both pointers. Takes a mem token as input (MemoryDef) and produces a new mem token.

Semantics: evalMemMove(mem, dst, src, count) = evalStore(mem, dst, readBytes(mem, src, count)) (bytes are captured from mem before the store, making overlap safe)

memset

vN = memset vDst, vVal, vCount, align=<N>, vMem

Fill vCount bytes at vDst with the byte value vVal (low 8 bits). align is the minimum alignment hint for the destination pointer. Takes a mem token as input (MemoryDef) and produces a new mem token.

Semantics: evalMemSet(mem, dst, val, count) = evalStore(mem, dst, replicate(count, val & 0xFF))

Atomic Operations

Atomic operations provide thread-safe memory access with explicit memory ordering constraints. The formal semantics in Lean define sequential-only behavior; a formal concurrency model (e.g., based on C11) is TBD.

Enumerations

MemoryOrdering — memory ordering constraints for atomic operations:

OrderingDescription
relaxedNo ordering constraints
acquireSubsequent reads see writes from the releasing thread
releasePrior writes are visible to the acquiring thread
acqrelCombined acquire + release
seqcstSequentially consistent total order

AtomicRmwOp — read-modify-write operation kinds:

OpDescription
xchgExchange (swap)
addInteger addition
subInteger subtraction
andBitwise AND
orBitwise OR
xorBitwise XOR

load.atomic

vM, vN = load.atomic.<ordering> vPtr, vMem

Atomic load from pointer vPtr with the specified memory ordering. Takes a mem token as input (MemoryDef) and produces two results: a new mem token and the loaded data value.

store.atomic

vN = store.atomic.<ordering> vVal, vPtr, vMem

Atomic store of vVal to pointer vPtr with the specified memory ordering. Takes a mem token as input (MemoryDef) and produces a new mem token.

rmw

vM, vN = rmw.<op>.<ordering> vPtr, vVal, vMem

Atomic read-modify-write. Atomically reads the value at vPtr, applies <op> with vVal, writes the result back. Takes a mem token as input (MemoryDef) and produces two results: a new mem token and the original value.

cmpxchg

vM, vN = cmpxchg.<success_ord>.<failure_ord> vPtr, vExpected, vDesired, vMem

Atomic compare-and-exchange. Atomically compares the value at vPtr with vExpected; if equal, writes vDesired. Takes a mem token as input (MemoryDef) and produces two results: a new mem token and the old value. The caller uses icmp to determine whether the exchange succeeded. <success_ord> applies on successful exchange; <failure_ord> applies on failure.

fence

vN = fence.<ordering> vMem

Memory fence. Establishes ordering constraints without accessing memory. Takes a mem token as input (MemoryDef) and produces a new mem token.

Type Conversion

bitcast

vN = bitcast vA

Reinterpret the bits of vA as a different type. The result type is determined by the instruction’s type annotation.

sext

vN = sext vA, <bits>

Sign-extend vA to bits bits. Used during lowering to make bit widths explicit for instruction selection.

zext

vN = zext vA, <bits>

Zero-extend vA to bits bits. Used during lowering to make bit widths explicit for instruction selection.

Pointer Operations

Pointer operations manipulate pointers with explicit provenance tracking. The formal semantics are defined in TuffyLean.IR.Semantics.

ptradd

vN = ptradd vPtr, vOffset

Pointer addition. Offsets the pointer vPtr by vOffset bytes. The result preserves the provenance of the base pointer. Semantics: evalPtrAdd(base, offset) = ptr { allocId = base.allocId, offset = base.offset + offset }

ptrdiff

vN = ptrdiff vA, vB

Pointer difference. Computes the byte offset between two pointers. Both pointers must belong to the same allocation; otherwise the result is poison. Semantics: evalPtrDiff(a, b) = if a.allocId = b.allocId then a.offset - b.offset else poison

ptrtoint

vN = ptrtoint vPtr

Convert a pointer to an integer. The provenance is captured — the resulting integer retains knowledge that it came from a pointer. Semantics: evalPtrToInt(p) = p.offset

ptrtoaddr

vN = ptrtoaddr vPtr

Extract the address from a pointer, discarding provenance. Returns a plain integer with no provenance information. Semantics: evalPtrToAddr(p) = p.offset

inttoptr

vN = inttoptr vAddr

Create a pointer from an integer address. The resulting pointer has no valid provenance (wildcard provenance). Semantics: evalIntToPtr(addr, allocId) = ptr { allocId, offset = addr }

Function Calls

symbol_addr

vN = symbol_addr @name

Load the address of a named symbol (function or static data). The @name refers to an entry in the module’s symbol table. The result type is ptr.

This instruction makes function calls and static data references self-contained in the IR — the callee of a call is a value produced by symbol_addr rather than a side-channel mapping.

Semantics: Produces a pointer to the symbol identified by @name.

call

vM = call vCallee(vArg0, vArg1, ...), vMem
vM, vN = call vCallee(vArg0, vArg1, ...), vMem -> <ret_type>

Call the function pointed to by vCallee with the given arguments. Takes a mem token as input (MemoryDef). For void calls, produces a single mem token result. For non-void calls, produces two results: a new mem token and the return value.

Terminators

Terminators are instructions that end a basic block by transferring control flow. By convention, they are the last instruction in a block.

ret

ret vA, vMem
ret vMem

Return from the function. Takes a mem token as the final memory state operand. Optionally returns a data value. A function with a return type must return a value; a void function uses ret vMem with only the mem token.

br

br bbN(vA, vB, ...)
br bbN

Unconditional branch to block bbN, passing values as block arguments. If the target block has no arguments, the argument list is omitted.

brif

brif vCond, bbThen(args...), bbElse(args...)
brif vCond, bbThen, bbElse

Conditional branch. The condition vCond must be of type bool. If true, branches to bbThen; otherwise branches to bbElse. Each target may receive block arguments.

continue

continue vA, vB, ...
continue

Loop backedge. Transfers control back to the entry block of the enclosing loop region, passing values as the header block’s arguments. Only valid inside a loop region.

region_yield

region_yield vA, vB, ...
region_yield

Exit the enclosing region, yielding values to the parent region. Used for structured control flow where a region produces result values.

unreachable

unreachable

Indicates that control flow should never reach this point. If executed, the behavior is undefined (the optimizer may assume this path is dead).

trap

trap

Unconditionally abort execution. Used for runtime checks such as failed assertions (e.g., division-by-zero guards). Unlike unreachable, reaching a trap is not UB — it is a well-defined program abort.

Text Format

Tuffy IR uses a Cranelift-inspired text format for human-readable output.

Naming Conventions

EntityFormatExample
Functions@name@add, @factorial
ValuesvNv0, v1, v2
BlocksbbNbb0, bb1
Regionsregion <kind>region loop, region plain

Values are numbered sequentially (v0, v1, v2, …) in region tree order. Within each block, block arguments are numbered before instruction results.

Grammar

function     ::= 'func' '@' name '(' param_list ')' ('->' type annotation)? '{' body '}'
param_list   ::= (type annotation (',' type annotation)*)?
body         ::= (block | region)*
region       ::= 'region' region_kind '{' region_body '}'
region_kind  ::= 'loop' | 'plain'
region_body  ::= (block | region)*
block        ::= block_header instruction*
block_header ::= 'bb' N block_args? ':'
block_args   ::= '(' block_arg (',' block_arg)* ')'
block_arg    ::= value ':' type
instruction  ::= (value annotation '=')? opcode operands
operand      ::= value annotation
annotation   ::= (':' range_ann)*
range_ann    ::= 's' N | 'u' N | 'known' '(' ternary ')'
ternary      ::= [01?x_]+
value        ::= 'v' N
type         ::= 'int' | 'byte' | 'ptr'

Complete Example

A factorial function demonstrating nested regions, block arguments, and control flow:

func @factorial(int:s32) -> int:s32 {
  bb0:
    v0:s32 = param 0
    v1 = iconst 1
    v2 = iconst 1
    br bb1(v2, v1)

  region loop {
    bb1(v4: int, v5: int):
      v6 = icmp.le v5:s32, v0:s32
      brif v6, bb2, bb3

    bb2:
      v8 = mul v4, v5
      v9 = sub v5, v1
      continue v8, v9
  }

  bb3:
    ret v4:s32
}

Planned Features

The following features are described in the IR Design RFC but are not yet implemented in the Rust codebase.

Floating Point Semantics

The IR adopts IEEE 754-2008 as the floating-point semantics standard. Basic operations (fadd, fsub, fmul, fdiv, fneg, fabs, copysign) are implemented. Per-instruction rewrite flags (reassoc, contract) and float value class constraints (nofpclass) are defined.

IEEE 754 Lean library

The current Lean model uses Lean 4’s native Float type (IEEE 754 binary64) as a placeholder. Float is opaque — it maps directly to hardware double and cannot be unfolded in proofs. Properties that follow from the IEEE 754 standard (e.g., “fadd produces -0.0 only when both operands are -0.0”) must be axiomatized (see IR/FloatAxioms.lean). This is unsatisfactory for a formal-first project.

We need a Lean 4 library that provides a transparent, proof-friendly model of IEEE 754 floating-point arithmetic covering all widths (f16, bf16, f32, f64), rounding modes, NaN payloads, subnormals, and signed zeros. Existing efforts:

  • HOLFloat-Lean — formalizes FP semantics in Lean 4, building on prior HOL formalizations.
  • Flean — individual effort for IEEE 754 in Lean; requires rewrite.
  • Flocq port — the Coq Flocq library is the most mature FP formalization; porting it to Lean 4 is a community proposal.

Until a suitable library is available, tuffy will continue axiomatizing IEEE 754 properties as needed. The axiom set should be kept minimal and well-documented so that it can be discharged once a real library is adopted.

Scalable Vector Types

Vector types parameterized by total bit-width, not element count. Element count is derived: count = total_bits / element_bits. Scalable vectors use vscale × base_width, where vscale is a runtime constant determined by hardware (cf. SVE, RVV). Well-formedness: total_bits % element_bits == 0 and lane count must be a power of two. See the scalable vector RFC (planned).

Memory SSA

Memory dependencies encoded directly into the IR as memory version tokens on load/store operations, enabling progressive refinement as alias analysis improves.

Atomic Operations

Atomic operations (load.atomic, store.atomic, rmw, cmpxchg, fence) are implemented with MemoryOrdering and AtomicRmwOp enumerations. See Atomic Operations. The formal concurrency model is TBD — current Lean semantics define sequential behavior only.

Control Flow Structuralization

Translation from rustc MIR to tuffy IR with automatic structuralization of loops and branches into SESE regions. Irreducible control flow remains as flat CFG within a region.

Initial Design Discussion

  • Created: 2026-02-07
  • Status: Frozen — This document is a historical record. Do not modify.

Background

Tuffy is an experimental optimizing compiler written in Rust with LLM assistance. This document records the initial design discussion and key decisions made before implementation begins.

Long-term Vision

Tuffy aims to be a production-grade, language-agnostic optimizing compiler backend:

  • Multi-language backend — Serve as a codegen backend for multiple programming languages, not just Rust
  • Custom IR — Design and implement a custom intermediate representation with optimization passes
  • Multi-platform — Support multiple target architectures from the ground up
  • Balanced optimization — Pursue both fast compilation and competitive runtime performance
  • Formal verification — Provide formal correctness guarantees for optimization passes

Short-term Milestone

The first milestone is rustc_codegen_tuffy, a codegen backend for rustc as an alternative to LLVM.

Key Design Decisions

  1. Target platform: Multi-platform architecture from the start, with x86-64 as the initial implementation target
  2. Optimization direction: Balance between compilation speed and runtime performance (not exclusively pursuing either)
  3. IR design: Custom IR — build our own intermediate representation, perform optimizations on it, then lower to machine code (rather than translating directly from rustc MIR)
  4. IR interpreter: Implement a tuffy IR interpreter for testing and validation, similar to Miri
  5. Backend strategy: Implement rustc_codegen_ssa traits to reuse existing MIR lowering code (~10,000 lines). Builder generates tuffy IR instead of LLVM IR. Same approach as rustc_codegen_cranelift. May switch to direct MIR traversal later if more control is needed.

IR Design

Structure:

  • Single data structure for all stages — no separate SDAG/MIR like LLVM. Different stages have different normalization (legalization) constraints on the same IR, progressively lowering toward machine code.

UB model (following Alive2):

  • Only poison — no undef. Simplifies semantics and avoids the well-known undef/poison confusion in LLVM.
  • Uninitialized memory is allowed at the memory level, but values are either concrete or poison.

Byte type (b<N>):

  • Introduced from the start, based on the LLVM byte type RFC. Represents raw memory data distinct from integers, with per-byte poison tracking. Enables sound memcpy lowering and load merging.

Integer type:

  • A single int type with infinite precision — no fixed bitwidth (no i8/i32/i64).
  • No overflow: arithmetic operations are mathematical (e.g., add is true addition).
  • Range constraints via assert nodes: e.g., assertsext(val, 32) constrains a value to the 32-bit signed range, producing poison if violated.
  • zext/sext/trunc instructions are eliminated from the IR.
  • Bitwise operations (and/or/xor/shift) are also defined on infinite precision integers.

At-use bit-level analysis (encoded in IR):

  • Each use edge carries bit-level annotations: which bits are demanded, and which bits are known to be 0 or 1.
  • UB constraints (from assert nodes, memory operations, ABI boundaries) propagate backward through the IR, refining known bits on upstream values.
  • This is analogous to LLVM’s KnownBits and DemandedBits analyses, but promoted to first-class IR constructs rather than side analyses.
  • Follows the “analysis is also a transformation” principle: analysis results are encoded directly into the IR, not stored in external tables.
  • Instruction selection in the final stage derives concrete machine types from these at-use annotations.

Representation form:

  • Hierarchical CFG with nested SESE regions. Loops and branches are explicitly represented as regions, not inferred from CFG back-edges.
  • Region internals use traditional SSA basic blocks with fixed instruction ordering.
  • Translation from rustc MIR requires control flow structuralization to recover nested regions from the flat CFG.
  • Enables early loop optimizations (vectorization, LICM, unrolling) by preserving loop structure in the IR.
  • Inspired by LLVM VPlan’s hierarchical CFG and recipe-based transformation model.

Memory model:

  • Abstract pointer model: pointer = allocation ID + offset, preserving full provenance.
  • Two ptr-to-int instructions with distinct semantics:
    • ptrtoint: converts pointer to integer, capturing provenance. inttoptr(ptrtoint(p)) is a safe roundtrip.
    • ptrtoaddr: extracts only the address portion, discarding provenance.
  • Four-state byte representation: Bits(u8) | PtrFragment(Pointer, n) | Uninit | Poison. Each byte in memory is one of these states.
  • Uninitialized memory (Uninit) is a distinct abstract state, not “random bits”. Any operation on uninit values is UB (produces poison).
  • Poison can exist at the byte level in memory, aligned with the per-byte poison tracking of the byte type.

Testing Strategy

Validation and benchmarking will use the following targets (in progressive order):

  1. Internal unit tests — small test cases for basic codegen correctness (arithmetic, control flow, function calls)
  2. no_std programs — minimal programs to validate the backend without standard library dependencies
  3. serde — widely used crate with heavy generics and macros, tests monomorphization and trait dispatch
  4. rustc test suite — rustc’s ui tests for broad correctness coverage
  5. ripgrep — real-world CLI project for end-to-end correctness and performance comparison against LLVM backend
  6. clippy — later-stage validation target due to tight coupling with rustc internals

Notes

  • Self-hosting is not a separate goal since tuffy is written in Rust — once the backend can compile Rust, self-hosting follows naturally
  • JIT support and outperforming LLVM are not current goals but may be revisited later

Target Abstraction

  • Triple-based target identification, following LLVM’s convention.
  • TargetLowering design follows LLVM’s approach (legal types, operation lowering, register classes).
  • Sub-method override via command-line: individual TargetLowering properties (pointer size, legal integer widths, max vector width, etc.) can be overridden via --target-override key=value flags. This enables testing IR-level passes against specific target properties without implementing a full architecture backend.
  • Cost model: deferred, not yet designed.

IR Memory Layout

  • Encapsulation: memory layout is an implementation detail, hidden behind a stable API. The rest of the compiler operates on opaque handles, not raw pointers. Layout can be changed without affecting pass code.
  • Arena + index: IR nodes are allocated in arenas. References are u32 indices (opaque handles), not pointers. Reduces reference size and enables serialization.
  • Contiguous instruction storage: instructions within the same basic block are stored contiguously in the arena. Iterating a BB’s instructions is a linear memory scan, not pointer chasing.
  • Tombstone + periodic compaction: deleted instructions are marked as tombstones. Periodic compaction restores contiguity.
  • No SOA: struct-of-arrays layout rejected; standard AOS with contiguous storage is sufficient.

Crate Structure

Tuffy is a language-agnostic compiler backend. rustc_codegen_tuffy is one frontend adapter; future language frontends reuse the entire optimization and backend stack.

tuffy/
├── tuffy_ir/              — IR definition: types, instructions, CFG, memory model
├── tuffy_ir_interp/       — IR interpreter (miri-like, UB-aware)
├── tuffy_opt/             — optimization passes (generated from declarative rewrite rules)
├── tuffy_target/          — target abstraction layer (trait + CLI override)
├── tuffy_target_x86/      — x86 backend (i386 + x86-64)
├── tuffy_codegen/         — instruction selection, register allocation, machine code emission
├── tuffy_trace/           — rewrite path tracing
├── tuffy_tv/              — translation validation (Alive2-style SMT verifier)
├── rustc_codegen_tuffy/   — rustc frontend adapter (MIR → tuffy IR)

Dependencies: tuffy_ir is the bottom layer with no internal dependencies. All other crates depend on it directly or transitively.

Object file emission: use the object crate (same as Cranelift and rustc) for writing ELF, Mach-O, COFF object files. No custom object file handling.

References

A curated list of reference documents and resources for the tuffy project.

Compiler Optimization

  • Future Directions for Optimizing Compilers — Nuno P. Lopes, John Regehr, 2018. Discusses challenges in making optimizing compilers faster, less buggy, and more capable.
  • Cranelift: Using E-Graphs for Verified, Cooperating Middle-End Optimizations — Chris Fallin, Bytecode Alliance. Proposes acyclic e-graphs (aegraphs) as a unified middle-end optimization framework for Cranelift, replacing separate GVN/LICM/constant-folding passes. Key ideas: pure nodes float above a side-effect skeleton (the original CFG), scoped elaboration naturally subsumes GVN and LICM, and ISLE-based declarative rewrite rules enable verifiable optimizations. Directly relevant to tuffy’s declarative rewrite rules and “analysis is also a transformation” design.
  • optir — Jamey Sharp. Research prototype combining e-graph equality saturation (via egg) with RVSDG (Regionalized Value State Dependence Graph). Control flow is encoded structurally via switch/loop/func operators rather than explicit branches, enabling optimizations across complex control flow including irreducible CFGs. Relevant to tuffy’s hierarchical CFG regions and declarative rewrite approach.
  • egg: Fast and Extensible Equality Saturation — Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, Pavel Panchekha, POPL 2021. Introduces the egg Rust library for e-graphs and equality saturation with novel techniques: rebuilding for amortized invariant maintenance, and e-class analyses for domain-specific data attached to equivalence classes. Relevant to tuffy’s declarative rewrite rules — equality saturation is a candidate execution model for applying verified rewrite rules without phase-ordering concerns.

Memory Analysis

  • MemorySSA — LLVM. Virtual SSA form for memory operations using a single memory partition. Three node types: MemoryDef (stores, calls, fences, atomics), MemoryUse (loads), MemoryPhi (CFG joins). Each MemoryDef creates a new version of “all memory”; precision is recovered via a clobber walker backed by alias analysis rather than upfront partitioning. Key design insight: precise memory partitioning is impractical and not worth the cost — a single chain with on-demand disambiguation is more efficient. Directly informs tuffy’s MemSSA design: single memory token chain encoded in IR, with progressive refinement as alias analysis improves.

IR Design and Semantics

  • RVSDG: An Intermediate Representation for Optimizing Compilers — Nico Reissmann, Jan Christian Meyer, Helge Bahmann, Magnus Själander, 2019. Formalizes the Regionalized Value State Dependence Graph, a data-flow-centric IR where nodes represent computations, edges represent dependencies, and regions capture hierarchical program structure. Demonstrates Dead Node and Common Node Elimination on the representation. Directly relevant to tuffy’s hierarchical CFG with SESE regions and the optir approach built on top of RVSDG.

  • RFC: Add a New Byte Type to LLVM IR — Juneyoung Lee, Pedro Lobo, Nuno Lopes, George Mitenkov. Proposes b<N> byte type to represent raw memory data, enabling per-byte poison tracking and paving the way for undef removal.

  • Leaving the Sea of Nodes — V8 team. Post-mortem on 10 years of Sea of Nodes in TurboFan: effect chain complexity, poor cache locality, limited floating benefits for effectful operations, compilation speed issues. Replaced with CFG-based Turboshaft, halving compile time.

  • A Simple Reply — Cliff Click. Rebuttal to V8’s SoN criticism: argues problems stem from JS’s lack of strong typing, not SoN itself. Strong typing (Java, Rust) provides ECA naturally, simplifying effect management. Worklist algorithms solve visitation order. Dead code elimination is near-free with reference counting.

  • VPlan: Vectorization Plan — LLVM. Hierarchical CFG with nested SESE regions and recipe-based transformations for vectorization. Key ideas: plan-then-execute (analysis doesn’t modify IR), multiple candidates with cost estimation, recipes naturally record transformation provenance. Relevant to tuffy’s “analysis is also a transformation” and rewrite path tracing goals.

Memory Model and Pointer Semantics

  • Pointers Are Complicated, or: What’s in a Byte? — Ralf Jung, 2018. Argues pointers are not integers; proposes abstract pointer model (allocation ID + offset) preserving provenance. Defines bytes as Bits(u8) | PtrFragment(Pointer, n) | Uninit. Directly relevant to tuffy’s byte type and uninitialized memory model.
  • Pointers Are Complicated II, or: We Need Better Language Specs — Ralf Jung, 2020. Shows three individually-correct LLVM optimizations that produce wrong results when combined, due to unspecified pointer provenance. Concludes: pointers have provenance, integers do not; ptr-to-int-to-ptr roundtrips are not no-ops.
  • Pointers Are Complicated III, or: Pointer-Integer Casts Exposed — Ralf Jung, 2022. Proposes “provenance exposure” model: ptr-to-int casts have side effects (exposing provenance), int-to-ptr casts are non-deterministic. Rust’s Strict Provenance API (ptr.addr(), with_addr()) offers a cleaner alternative. Relevant to tuffy’s pointer representation design.
  • Uninit — Ralf Jung, 2019. Uninitialized memory is not “random bits” but a distinct abstract state (None). Any operation on uninit values is UB. Validates tuffy’s decision to allow uninitialized memory at the memory level while keeping values as concrete-or-poison.
  • Do We Really Need Undefined Behavior? — Ralf Jung, 2021. Argues unrestricted UB is necessary for practical optimization (register allocation, constant folding). Platform-specific UB would make even basic codegen impossible. Supports tuffy’s poison-based UB model as a principled middle ground.
  • Clarifying the Semantics of ptrtoint — LLVM discussion. Proposes separating ptrtoint (full bitwise representation) from ptrtoaddr (address only). Highlights that compiler-level provenance and hardware provenance (CHERI) are distinct. Relevant to tuffy’s pointer representation: need to decide whether ptr-to-int exposes provenance or just address.
  • RFC: Replacing getelementptr with ptradd — nikic (Nikita Popov). Proposes replacing type-based GEP with offset-based ptradd instruction. Eliminates redundant encodings, makes address arithmetic explicit and visible to CSE/LICM, simplifies analysis by removing expensive GEP decomposition. Tuffy adopts ptradd directly instead of GEP.

ABI and Calling Conventions

  • RFC: An ABI Lowering Library for LLVM — nikic (Nikita Popov). Proposes a standalone LLVMABI library with its own type system dedicated to ABI lowering, extracting logic currently duplicated across frontends (Clang, Rust, etc.). Key design: independent type system capturing only ABI-relevant information, per-target calling convention implementations, ABIArgInfo results guiding IR generation. Layered to depend only on LLVM Support, with a C API for non-C++ frontends. Directly relevant to tuffy’s plan for a dedicated ABI library that maps int parameters/returns to concrete register classes and calling conventions.

Testing and Validation

  • RFC: Upstreaming LLVM UB-Aware Interpreter — dtcxzyw. Proposes upstreaming llubi, a UB-aware LLVM IR interpreter that detects undefined behavior during execution and properly propagates poison values. Key motivation: reducing miscompilation reproducers directly on LLVM IR (minutes vs hours with C-level tools). Separate implementation from legacy ExecutionEngine because GenericValue cannot represent poison or pointer provenance. Relevant to tuffy’s IR interpreter design.
  • llvm-ub-aware-interpreter (llubi) — dtcxzyw. Implementation of UB-aware LLVM IR interpreter. Detects guardable UBs at execution time, tracks poison propagation, integrates with llvm-reduce for test case minimization and Csmith/Rustlantis for fuzzing. Treats undef as zero (practical simplification). Does not model pointer aliasing or full provenance. Directly informs tuffy’s interpreter design, though tuffy’s poison-only model (no undef) and four-state bytes enable a cleaner implementation.

Rustc Backend Integration

  • Lowering MIR — rustc dev guide. MIR lowering flow: codegen_cratecodegen_mir per function, with modules for blocks, statements, operands, places, rvalues. SSA analysis runs before translation. One MIR basic block generally maps to one backend basic block.
  • Backend Agnostic Codegen — rustc dev guide. Trait-based backend interface: CodegenBackend, ExtraBackendMethods, CodegenMethods (CodegenCx), BuilderMethods (Builder). Associated types: Value, BasicBlock, Type, Function. ~10,000 lines of reusable backend-agnostic code in rustc_codegen_ssa. Custom backends implement these traits to plug in.

Feature Name: tuffy_ir

  • Status: Draft
  • Created: 2026-02-07
  • Completed: N/A

Summary

This RFC defines the intermediate representation (IR) for the tuffy compiler. The IR uses a single data structure across all compilation stages with stage-dependent normalization, infinite precision integers with at-use bit-level analysis, a four-state byte memory model, and a hierarchical CFG with nested SESE regions.

Motivation

Existing compiler IRs (LLVM IR, Cranelift CLIF) carry historical design baggage that limits optimization potential and complicates formal verification:

  • Fixed-width integers force premature commitment to bit widths, introducing zext/sext/trunc noise that obscures mathematical equivalences.
  • undef/poison duality in LLVM creates well-documented semantic confusion and miscompilation bugs. Approximately 18% of detected LLVM miscompilations stem from undef alone.
  • Flat CFGs lose structural information (loops, branches) that must be expensively recovered for high-level optimizations.
  • Side-table analyses (KnownBits, DemandedBits) are disconnected from the IR, creating phase ordering problems and stale analysis results.
  • Multiple IR layers (LLVM IR → SelectionDAG → MachineIR) require costly translations with potential information loss at each boundary.

Tuffy’s IR addresses these issues from first principles, designed for formal verification, efficient compilation, and clean semantics.

Guide-level explanation

Type system

Tuffy IR has a minimal type system:

  • int — Infinite precision integer. No bit width, no signedness. Arithmetic is mathematical: add(3, 5) is 8, period. There is no overflow.
  • b<N> — Byte type representing raw memory data, where N is a multiple of 8. Distinct from integers; preserves pointer fragments and per-byte poison tracking.
  • ptr<AS> — Pointer type carrying provenance (allocation ID + offset) in address space AS. Default address space is 0. Different address spaces may have different pointer sizes and semantics.
  • float / double — IEEE 754 floating point types.
  • vec<vscale x N x T> — Scalable vector type. vscale is a runtime constant determined by hardware; N is the minimum number of elements; T is the element type (int, float, double, b<N>, ptr<AS>). Fixed-width vectors are represented as vscale=1 (e.g., vec<1 x 4 x b32> is a 128-bit fixed vector of 4 x 32-bit elements). When the element type is int, the actual vector width is unknown until instruction selection, since int has no fixed bit width; the element count is determined by the IR, but the physical vector register mapping is deferred.

Infinite precision integers

Instead of i32 or i64, all integer values are simply int. Range constraints are expressed via at-use annotations on value definitions and use edges (see RFC: at-use-annotations):

%raw = load b32, ptr %p         // load 32 bits of raw bytes from memory
%x:s32 = bitcast %raw to int   // cast to int, result annotated as signed 32-bit
%y:s32 = add %x:s32, 1         // mathematical addition, no overflow; result in s32 range
%out = bitcast %y:s32 to b32   // cast back to bytes for storing
store b32 %out, ptr %q

Loads always operate on byte types (b<N>). The byte type carries the bit width; integers do not. The decision of sign-extension vs zero-extension, and whether to use a GPR or FPR, is deferred entirely to instruction selection based on at-use annotations.

There are no zext, sext, or trunc instructions. Bitwise operations (and, or, xor, shift) are defined on infinite precision integers.

At-use annotations

Range constraints and bit-level analysis results are encoded directly into the IR as annotations on value definitions (result-side) and use edges (use-side). See RFC: at-use-annotations for the full design.

Each annotation carries a four-state per-bit encoding:

  • 0: bit is known to be 0
  • 1: bit is known to be 1
  • ?: bit is demanded but value unknown
  • x: bit is don’t-care (not demanded)

Shorthand annotations :s32, :u8 etc. are sugar over the full known-bits representation.

Result-side annotations are generated by the frontend from source language type information. Use-side annotations may be added by optimization passes when they discover stronger constraints at specific use sites. This follows the “analysis is also a transformation” principle — analysis results are encoded directly into the IR, not stored in side tables.

Hierarchical CFG

The IR uses a hierarchical control flow graph:

func @example(int:s32) -> int:s32 {
  region loop {
    bb0(i: int):
      %cond = cmp.lt %i:s32, %arg:s32
      br %cond, bb1, exit

    bb1:
      %i.next = add %i, 1
      br bb0(%i.next)
  }
  exit:
    ret %i:s32
}

Loops and branches are explicitly represented as nested SESE (Single-Entry Single-Exit) regions. Inside each region, instructions are organized in traditional SSA basic blocks with fixed ordering.

Memory model

Pointers are abstract values with provenance:

%p = alloca int              // %p has provenance tied to this allocation
%addr = ptrtoaddr %p         // extracts address only, discards provenance
%bits = ptrtoint %p          // converts to integer, captures provenance
%q = inttoptr %bits          // safe roundtrip: recovers provenance

Each byte in memory is one of four states:

  • Bits(u8): a concrete byte value (0–255)
  • PtrFragment(Pointer, n): the nth byte of an abstract pointer
  • Uninit: uninitialized memory (not “random bits”; a distinct abstract state)
  • Poison: a poisoned byte

Memory SSA

Memory dependencies are encoded directly into the IR as Memory SSA (MemSSA). Each memory operation (load, store, call) carries a memory version token representing the memory state it depends on or produces.

%mem0 = entry_mem                    // initial memory state
%mem1 = store b32 %v, ptr %p, %mem0 // store produces new memory version
%raw = load b32, ptr %q, %mem1      // load depends on memory version

MemSSA supports progressive refinement: initially a conservative single memory chain, then split into finer-grained versions as alias analysis improves. This follows the “analysis is also a transformation” principle — memory dependency information lives in the IR, not in side tables.

Atomic operations

Atomic memory operations are regular load/store/rmw/cmpxchg instructions annotated with a memory ordering:

%raw = load.atomic b32, ptr %p, ordering=acquire, %mem0
%mem1 = store.atomic b32 %v, ptr %p, ordering=release, %mem0
%old = rmw.add b32, ptr %p, %delta, ordering=seqcst, %mem0
%res, %ok = cmpxchg b32, ptr %p, %expected, %desired, success=acqrel, failure=acquire, %mem0

Supported orderings: relaxed, acquire, release, acqrel, seqcst. Fence instructions are also supported as standalone operations.

Atomic operations participate in MemSSA like any other memory operation, with ordering constraints restricting reordering during optimization.

UB model

The IR uses only poison for undefined behavior. There is no undef.

  • Operations with UB (e.g., division by zero) produce poison.
  • Annotation violations produce poison: result-side violations at the defining instruction, use-side violations at the consuming instruction.
  • Any operation on Uninit memory produces poison.
  • poison propagates through computations but is only UB when “observed” (e.g., used as a branch condition, stored to observable memory).

Single data structure, multiple stages

The IR uses one data structure throughout compilation. Different stages impose different normalization constraints:

  1. High-level: close to MIR translation, high-level operations allowed (e.g., aggregate operations, generic intrinsics).
  2. Optimized: after optimization passes, IR is normalized but still target-independent.
  3. Legalized: operations are lowered to target-legal forms (e.g., 64-bit multiply on 32-bit targets becomes a sequence of 32-bit operations).
  4. Pre-emission: fully scheduled, register-allocated, ready for machine code emission.

No separate SDAG or MachineIR. The same graph structure is progressively constrained.

MIR translation

Translation from rustc MIR to tuffy IR involves:

  • Control flow structuralization: attempt to normalize to natural loops and promote to SESE regions. Irreducible control flow is left as flat CFG within a region.
  • mem2reg: MIR uses mutable Place values (not pure SSA). A mem2reg pass promotes stack allocations to SSA values where possible.
  • Drop/unwind: Rust’s drop glue and panic unwind edges are preserved in CFG form, not promoted to SESE regions. Unwind paths remain as explicit CFG edges.
  • Aggregate scalarization: structs, enums, arrays, and tuples are fully decomposed into scalar operations (analogous to LLVM’s SROA). No aggregate types exist in tuffy IR.
  • Type translation: MIR fixed-width integers (u8, i32, u64, etc.) are translated to int with corresponding :u<N>/:s<N> result-side annotations at definition points.

Machine code lowering

Lowering from tuffy IR to machine code:

  • ABI: a dedicated ABI library maps int parameters/returns to concrete register classes and calling conventions based on the target triple.
  • Stack frame layout: derived from the target triple; alloca sizes are resolved after instruction selection determines concrete bit widths.
  • Instruction selection: at-use bit-level annotations determine concrete types; the instruction selector chooses sext/zext and GPR/FPR placement.
  • Debug info: DWARF generation follows LLVM’s approach.
  • Unwind tables: exception handling table generation follows LLVM’s approach.

Reference-level explanation

Type system details

TypeDescription
intInfinite precision integer, no fixed bit width
b<N>Byte type, N must be a multiple of 8
ptrPointer with provenance (allocation ID + offset)
floatIEEE 754 single precision
doubleIEEE 754 double precision
vec<vscale x N x T>Scalable vector, vscale is runtime constant, N is min element count, T is element type

Integer operations

All integer arithmetic operates on mathematical integers:

OperationSemantics
add %a, %bMathematical addition
sub %a, %bMathematical subtraction
mul %a, %bMathematical multiplication
div %a, %bInteger division (poison if %b == 0)
rem %a, %bInteger remainder (poison if %b == 0)
and %a, %bBitwise AND on infinite precision
or %a, %bBitwise OR on infinite precision
xor %a, %bBitwise XOR on infinite precision
shl %a, %bLeft shift
shr %a, %bRight shift (signedness from annotation)

Value annotations

Annotations on value definitions and use edges:

AnnotationSemantics
:s<N>Poison if value outside signed N-bit range [-2^(N-1), 2^(N-1)-1]
:u<N>Poison if value outside unsigned N-bit range [0, 2^N-1]
:known(<ternary>)Per-bit four-state constraint (0, 1, ?, x)

Result-side violation → instruction produces poison. Use-side violation → consuming instruction produces poison.

Pointer operations

OperationSemantics
ptrtoint %pConvert pointer to integer, capturing provenance
ptrtoaddr %pExtract address only, discard provenance
inttoptr %iConvert integer to pointer, recovering captured provenance
ptradd %p, %offsetPointer arithmetic within an allocation (provenance-preserving offset addition)
ptrdiff %p, %qPointer difference; returns %p.offset - %q.offset as int; poison if %p and %q do not belong to the same allocation

Byte type operations

OperationSemantics
bitcast %v to b<N>Type-punning conversion between byte type and other types
load b<N>, ptr %pLoad raw bytes from memory
store b<N> %v, ptr %pStore raw bytes to memory

Vector operations

All vector operations work on vec<vscale x N x T> types. A mask operand (vec<vscale x N x int>) is optional on most operations to support predicated execution (SVE/RVV style).

Elementwise arithmetic:

OperationSemantics
vec.add %a, %bElementwise addition
vec.sub %a, %bElementwise subtraction
vec.mul %a, %bElementwise multiplication
vec.div %a, %bElementwise division
vec.rem %a, %bElementwise remainder
vec.and %a, %bElementwise bitwise AND
vec.or %a, %bElementwise bitwise OR
vec.xor %a, %bElementwise bitwise XOR
vec.shl %a, %bElementwise left shift
vec.shr %a, %bElementwise right shift (signedness from annotation)
vec.fadd %a, %bElementwise floating point addition
vec.fmul %a, %bElementwise floating point multiplication

Horizontal reductions:

OperationSemantics
vec.reduce.add %vSum all elements
vec.reduce.mul %vMultiply all elements
vec.reduce.smin %vSigned minimum across elements
vec.reduce.smax %vSigned maximum across elements
vec.reduce.umin %vUnsigned minimum across elements
vec.reduce.umax %vUnsigned maximum across elements
vec.reduce.and %vBitwise AND across elements
vec.reduce.or %vBitwise OR across elements
vec.reduce.xor %vBitwise XOR across elements
vec.reduce.fadd %vFloating point sum (ordered or unordered via flag)
vec.reduce.fmin %vFloating point minimum across elements
vec.reduce.fmax %vFloating point maximum across elements

Scatter/gather (indexed vector memory access):

OperationSemantics
vec.gather %base, %indices, %mask, %memLoad elements from base + indices[i] for each lane where mask is true; produces vector + memory token
vec.scatter %v, %base, %indices, %mask, %memStore elements to base + indices[i] for each lane where mask is true; produces memory token

Masking/predication:

OperationSemantics
vec.select %mask, %a, %bPer-lane select: mask[i] ? a[i] : b[i]
vec.compress %v, %maskPack active lanes (mask=true) to front
vec.expand %v, %maskScatter contiguous elements to active lanes

Lane manipulation:

OperationSemantics
vec.splat %scalarBroadcast scalar to all lanes
vec.extract %v, %idxExtract scalar from lane
vec.insert %v, %scalar, %idxInsert scalar into lane
vec.shuffle %a, %b, %indicesPermute lanes from two vectors

Contiguous vector memory access:

OperationSemantics
vec.load %ptr, %mask, %memMasked contiguous load
vec.store %v, %ptr, %mask, %memMasked contiguous store

Platform intrinsics:

Platform-specific intrinsics (e.g., std::arch::x86_64::*) support two modes, selectable via compiler flag or function-level attribute:

  • Map mode (default): recognized intrinsics are mapped to tuffy vector operations, enabling further optimization by the compiler.
  • Passthrough mode: intrinsics are preserved as opaque target-specific calls (target.intrinsic "x86.avx2.add.ps" %a, %b), emitted as-is during instruction selection. Suitable for hand-tuned SIMD code where the user wants exact instruction control.

Unrecognized intrinsics always use passthrough mode.

Memory byte states

Each byte in the abstract memory is represented as:

enum AbstractByte {
    Bits(u8),                    // concrete value 0-255
    PtrFragment(AllocId, usize), // nth byte of a pointer
    Uninit,                      // uninitialized
    Poison,                      // poisoned
}

Hierarchical CFG structure

enum CfgNode {
    BasicBlock(Vec<Instruction>, Terminator),
    Region(RegionKind, Vec<CfgNode>),
}

enum RegionKind {
    Loop { header: BasicBlockId },
    Branch { condition: Value, then_region: RegionId, else_region: RegionId },
    Sequence,
}

Regions are always SESE. Translation from rustc MIR requires control flow structuralization.

At-use annotation representation

Each use edge and each instruction result carries an optional annotation:

struct Annotation {
    known_ones: BigUint,     // bits known to be 1
    known_zeros: BigUint,    // bits known to be 0
    demanded: BigUint,       // bits that are demanded
}

Invariant: known_ones & known_zeros == 0.

Shorthand :s<N> and :u<N> are sugar over this representation. No annotation means unconstrained int; the verifier reports an error if instruction selection cannot determine a concrete machine type.

Drawbacks

  • Infinite precision integers add complexity to instruction selection. The final stage must derive concrete bit widths from at-use annotations, which is a non-trivial analysis.
  • Hierarchical CFG requires control flow structuralization from MIR, which may introduce extra blocks for irreducible control flow.
  • At-use annotations on every edge increase memory footprint of the IR.
  • Single data structure means late-stage IR carries fields irrelevant to early stages and vice versa, potentially wasting memory.
  • Novel design means less existing tooling and literature to draw from compared to traditional SSA CFG.

Rationale and alternatives

Why infinite precision integers?

Two equally important motivations:

  1. Formal verification: Infinite precision integers are mathematical integers — exactly what theorem provers (Lean 4, Mathlib) natively support. Rewrite rule proofs operate on standard integer arithmetic with induction, algebraic properties, and equivalences available out of the box. Fixed-width integers would require every proof to reason about overflow, wrapping semantics, and sign extension, dramatically increasing proof complexity.

  2. Pattern recognition: Source languages (especially C) introduce redundant integer conversions due to implicit promotion rules (e.g., C’s integer promotion widens char and short to int). In LLVM IR, these become chains of zext/sext/trunc that obscure the underlying mathematical equivalence. Pattern matching in optimization passes fails to see through the conversion noise, causing missed optimizations. Infinite precision integers eliminate this noise entirely — the mathematical structure is directly visible in the IR.

Together, these make both the proof side and the optimization side simpler and more effective.

Alternative: arbitrary-width integers (LLVM-style). Rejected because they still require zext/sext/trunc and overflow reasoning in every pass and every proof.

Alternative: fixed set of widths (Cranelift-style). Simpler but still carries the same fundamental problems.

Why hierarchical CFG?

Flat CFGs lose loop structure, requiring expensive loop detection passes. Since tuffy aims to perform loop optimizations early, preserving this structure in the IR is essential.

Alternative: flat SSA CFG. Rejected because loop detection is expensive and error-prone.

Alternative: Sea of Nodes. Considered seriously. Advantages for strongly-typed languages (Cliff Click’s arguments), but engineering complexity (scheduling, debugging, cache locality) outweighs benefits. V8’s experience is cautionary even if their problems were partly JS-specific.

Alternative: E-graph. Attractive for formal verification but immature for handling side effects and control flow at scale.

Why four-state bytes?

The Uninit state (separate from Poison) correctly models uninitialized memory as an abstract concept rather than “random bits.” The PtrFragment state preserves pointer provenance through byte-level memory operations like memcpy. Poison at the byte level enables per-byte poison tracking aligned with the byte type design.

Why ptrtoint captures provenance?

This enables safe inttoptr(ptrtoint(p)) roundtrips while ptrtoaddr provides a provenance-stripping alternative. This avoids the LLVM problem where three individually-correct optimizations produce wrong results when combined (Ralf Jung, 2020).

Prior art

  • LLVM IR: SSA CFG with fixed-width integers, undef/poison duality, flat CFG. Tuffy’s IR addresses known LLVM pain points (undef removal, byte type, phase ordering).
  • Cranelift CLIF: Single IR with stage-dependent legalization. Tuffy adopts this “single data structure” approach but adds hierarchical CFG and infinite precision integers.
  • LLVM VPlan: Hierarchical CFG with SESE regions for vectorization. Inspires tuffy’s region-based structure.
  • V8 TurboFan / Turboshaft: Sea of Nodes abandoned in favor of CFG after 10 years. Informs tuffy’s decision to use SSA basic blocks inside regions.
  • Cliff Click’s Sea of Nodes: Strong typing (Java, Rust) mitigates many of V8’s problems. Tuffy acknowledges this but prioritizes engineering simplicity.
  • Alive2: Poison-only UB model, byte type proposal. Directly informs tuffy’s UB and type design.
  • Miri: Abstract pointer model (allocation ID + offset), abstract byte representation. Informs tuffy’s memory model.

Unresolved questions

  • Floating point semantics: How should NaN payloads and denormals be handled? Are there opportunities for infinite-precision-like simplification?
  • SIMD/vector types: How do vector operations fit into the type system and hierarchical CFG?
  • Aggregate types: Should structs/arrays exist in the IR or be fully decomposed into scalar operations?
  • Exception handling: How do Rust panics and unwinding interact with SESE regions?
  • Concurrency model: How are atomic operations and memory ordering represented?
  • Stack slot lifetime and coloring: stack_slot currently takes only a size parameter. To support stack coloring (overlapping non-live slots), lifetime information is needed. The preferred direction is region-based: stack slots belong to a SESE region and are automatically dead when control exits the region. Open question: after lowering to a flat CFG, how is region-based lifetime information preserved? Options include (a) materializing lifetime markers at region boundaries during lowering, or (b) maintaining a region tree alongside the flat CFG. The lifetime.start/end intrinsic approach (LLVM-style) is not preferred.

Future possibilities

  • E-graph integration: Equality saturation could be applied as an optimization strategy on subgraphs within the hierarchical CFG, without replacing the core IR representation.
  • Superoptimizer: The formal-first design and rewrite path tracing make it natural to integrate solver-based optimization discovery (Souper-style).
  • Multi-language frontend: The IR is designed to be language-agnostic. Future frontends for other languages can target the same IR.
  • JIT compilation: The single data structure with stage-dependent normalization could support a JIT mode that skips late-stage optimizations for faster compilation.
  • Translation validation: Rewrite path traces provide the foundation for end-to-end translation validation of the compilation pipeline.

Feature Name: tuffy_pipeline

  • Status: Draft
  • Created: 2026-02-07
  • Completed: N/A

Summary

This RFC defines the compilation pipeline design for the tuffy compiler, focusing on four core aspects: value-level analysis encoded as IR transformations, automatic debug info preservation via origin chains, automatic profiling info preservation with merging semantics, and a hybrid pass ordering strategy with fine-grained user control.

Motivation

Compiler pipelines face recurring engineering problems that erode correctness and developer productivity:

  • Phase ordering: In LLVM, analyses like KnownBits and DemandedBits live in side tables disconnected from the IR. When a transform modifies the IR without updating these tables, analysis results become stale. Subsequent passes either use stale data (miscompilation) or must conservatively discard and recompute (lost optimization opportunities). This is the root cause of many phase ordering bugs.
  • Debug info loss: LLVM requires each optimization pass to manually preserve debug info. In practice, many passes silently drop it. The result is poor debugging experience at higher optimization levels — a well-known pain point for all LLVM-based compilers.
  • Profiling info loss: Profile-guided optimization data (branch weights, execution counts) suffers the same manual-maintenance problem. Transforms that restructure control flow or merge/split blocks often discard profile annotations.

Tuffy’s pipeline addresses these by making preservation automatic at the infrastructure level, rather than relying on per-pass discipline.

Guide-level explanation

Pass classification

Tuffy distinguishes two categories of analysis:

Structural analyses (side table, on-demand):

  • Dominator tree
  • Loop tree
  • Post-dominator tree

These are cheaply recomputable from the CFG and do not carry semantic information that could be irreversibly lost by transforms. They live in traditional side tables and are invalidated/recomputed as needed.

Value-level analyses (encoded in IR, transform passes):

  • KnownBits — which bits of a value are provably 0 or 1
  • DemandedBits — which bits of a value are actually needed at each use site
  • MemSSA refinement — splitting memory version chains as alias analysis improves

These carry semantic facts about values that, if lost, cannot be recovered without re-deriving from first principles. In tuffy, these are transform passes — they read the IR, compute results, and write those results back into the IR (on use-edge annotations or memory tokens). They are scheduled in the pipeline alongside optimization passes, not invoked as side analyses.

This follows the “analysis is also a transformation” principle: there is no distinction between “analysis pass” and “optimization pass” for value-level information. Both modify the IR.

Origin chain

Every instruction in the IR carries an origin — a reference to the instruction(s) it was derived from. When a transform creates a new instruction, the builder API requires specifying an origin. This is a mandatory parameter, not optional metadata.

// Builder API enforces origin
let new_add = builder.build_add(lhs, rhs, origin: old_mul.origin());

The origin chain serves two purposes:

  1. Debug info derivation: Debug information (source locations, variable mappings) is not stored on every instruction. Instead, it is computed from the origin chain and current IR values. As long as the chain is unbroken, debug info can be derived for any instruction by tracing back to its origin.

  2. Profiling info association: Profiling data (execution counts, branch weights) is associated with instructions via origin. When transforms merge or split instructions, profile data is merged accordingly — this is valid because profile data ultimately serves code generation decisions, not source-level tracing.

Origin in practice

Most transforms have a natural origin for new instructions:

TransformNew instructionOrigin source
Strength reductionshl %x, 1 replacing mul %x, 2The original mul
Loop canonicalizationPreheader branchLatch branch
Instruction combiningadd %a, %c from add(add(%a, %b), %c)Both original add instructions
Dead code elimination(no new instructions)N/A
Loop unrollingDuplicated loop bodyOriginal loop body instructions

When multiple origins contribute, the origin is a set. The debug info derivation picks the most specific source location from the set.

Pipeline structure

A tuffy compilation pipeline is a sequence of passes:

MIR translation
  → [value-level analysis: initial KnownBits/DemandedBits]
  → [optimization passes interleaved with value-level analysis updates]
  → legalization
  → instruction selection
  → register allocation
  → machine code emission

Value-level analysis passes can be scheduled at any point. Since they are transforms (they modify the IR), the pass manager treats them identically to optimization passes.

Pass ordering

Tuffy uses a hybrid pass ordering strategy: a fixed high-level framework with local iteration, combined with fine-grained user control to prevent performance regressions.

High-level framework: The pipeline has a fixed sequence of major phases (MIR translation → optimization → legalization → instruction selection → register allocation → emission). Within the optimization phase, passes are grouped into stages (e.g., early simplification, loop optimization, late cleanup), each stage iterating locally until the IR stabilizes or an iteration limit is reached.

User control granularity:

LevelMechanismExample
Version lockPin entire pipeline to a versioned strategy--opt-policy=v1.2.3
Preset profileSelect a predefined pipeline-O2, -Os, -Ofast
Pass-levelEnable/disable individual passes--disable-pass=loop-unroll
Iteration controlCap local iteration counts--max-instcombine-iterations=3
Function-levelPer-function attribute overrides#[tuffy::opt_policy("v1.2.3")]

Function-level control enables mixed strategies within a single compilation unit:

#[tuffy::opt_policy("v1.2.3")]      // lock to specific version
fn stable_hot_path() { ... }

#[tuffy::opt_level("aggressive")]   // aggressive optimization
fn compute_kernel() { ... }

#[tuffy::opt_level("fast-compile")] // minimal passes, fast compilation
fn cold_error_handler() { ... }

Dirty bit mechanism

Value-level analyses (KnownBits, DemandedBits, MemSSA) are scheduled on demand using a dirty bit protocol:

  1. When an optimization pass modifies an instruction, the infrastructure automatically marks affected use edges as dirty.
  2. When an optimization pass performs a lossy transform (refinement, not equivalence), downstream annotations are also marked dirty — the existing annotations may be overly precise and no longer sound.
  3. A subsequent pass that reads a dirty annotation triggers lazy recomputation on that annotation only. Clean annotations are never recomputed.

This is incremental lazy evaluation: mark dirty on write, recompute on read.

Transform equivalence declaration

Each transform pass declares whether it performs equivalence-preserving or lossy (refinement) transformations:

  • Equivalence: the transform preserves all semantic properties (e.g., strength reduction mul %x, 2shl %x, 1). Existing value-level annotations remain sound; no dirty marking needed beyond modified instructions.
  • Lossy refinement: the transform narrows behavior (e.g., range narrowing, speculative execution). Downstream annotations may be invalidated and must be marked dirty.

This declaration is manual, but the formal verification / proof framework can verify its correctness.

Other pipeline aspects

The following aspects follow LLVM’s established approach and are not novel to tuffy:

  • Legalization: progressive lowering of target-illegal operations to target-legal forms, following LLVM’s legalization model.
  • Pass manager: pass registration, scheduling, and region-aware execution follow LLVM’s new pass manager design.
  • Interprocedural optimization: inlining, cross-function analysis, and call graph management follow LLVM’s approach.
  • Verification passes: IR invariant checks inserted between passes in debug builds, following LLVM’s verifier model.

Reference-level explanation

Origin representation

struct Origin {
    sources: SmallVec<[InstructionId; 1]>,
}

Most instructions have a single origin (1-element SmallVec). Merged instructions carry multiple origins. The SmallVec<[_; 1]> avoids heap allocation in the common case.

Debug info derivation

Debug info is stored in a separate table indexed by InstructionId:

struct DebugInfoTable {
    /// Maps original (pre-optimization) instruction IDs to source locations
    source_locations: HashMap<InstructionId, SourceLocation>,
    /// Maps original instruction IDs to variable bindings
    variable_bindings: HashMap<InstructionId, Vec<VariableBinding>>,
}

To query the source location of a current instruction:

  1. Follow its origin chain to find original instruction IDs
  2. Look up those IDs in the debug info table
  3. If multiple origins, pick the most specific (innermost) source location

This design means optimization passes never touch debug info directly. The table is write-once (populated during MIR translation) and read-only thereafter.

Profiling info

Profile data is stored similarly:

struct ProfileData {
    /// Execution counts per original basic block
    block_counts: HashMap<BasicBlockId, u64>,
    /// Branch weights per original terminator
    branch_weights: HashMap<InstructionId, Vec<(BasicBlockId, u32)>>,
}

When a transform merges blocks or restructures control flow, the profile data for new blocks is computed by summing or averaging the profile data of their origins. This is done automatically by the infrastructure when blocks are created via the builder API.

Builder API enforcement

The instruction builder requires origin at construction:

impl Builder {
    fn build_add(&mut self, lhs: Value, rhs: Value, origin: Origin) -> Value;
    fn build_load(&mut self, ty: ByteType, ptr: Value, mem: MemToken, origin: Origin) -> Value;
    // ... all build methods require origin
}

There is no way to create an instruction without specifying an origin. This is a compile-time guarantee, not a runtime check.

Value-level analysis as transform

A KnownBits analysis pass operates as:

fn run_known_bits_analysis(func: &mut Function) {
    for inst in func.instructions_reverse_postorder() {
        let known = compute_known_bits(inst, func);
        for use_edge in func.uses_of(inst) {
            use_edge.update_known_bits(known);  // modifies IR in-place
        }
    }
}

This is registered in the pass manager as a regular transform pass. The pass manager does not distinguish it from an optimization pass.

Dirty bit protocol

Each use-edge annotation carries a dirty flag:

struct UseAnnotation {
    demanded_bits: BitSet,
    known_zeros: BitSet,
    known_ones: BitSet,
    dirty: bool,
}

The infrastructure manages dirty flags automatically:

impl Function {
    /// Called by infrastructure when an instruction is modified
    fn mark_uses_dirty(&mut self, inst: InstructionId) {
        for use_edge in self.uses_of(inst) {
            use_edge.annotation.dirty = true;
        }
    }

    /// Called by infrastructure for lossy transforms
    fn mark_downstream_dirty(&mut self, inst: InstructionId) {
        // Transitively marks all downstream use edges dirty
        let mut worklist = vec![inst];
        while let Some(id) = worklist.pop() {
            for user in self.users_of(id) {
                for use_edge in self.uses_of(user) {
                    if !use_edge.annotation.dirty {
                        use_edge.annotation.dirty = true;
                        worklist.push(user);
                    }
                }
            }
        }
    }
}

Transform equivalence declaration

Each pass declares its transform category via a trait:

enum TransformKind {
    Equivalence,
    Refinement,
}

trait Pass {
    fn name(&self) -> &str;
    fn transform_kind(&self) -> TransformKind;
    fn run(&self, func: &mut Function);
}

The pass manager uses transform_kind() to determine dirty propagation scope after each pass runs. The proof framework can verify that a pass declared as Equivalence does not perform lossy transforms.

Pass ordering configuration

Pipeline configuration is a serializable data structure:

struct PipelineConfig {
    version: String,
    stages: Vec<Stage>,
}

struct Stage {
    name: String,
    passes: Vec<PassConfig>,
    max_iterations: u32,
}

struct PassConfig {
    name: String,
    enabled: bool,
    params: HashMap<String, String>,
}

Users can export, modify, and lock pipeline configurations. Function-level attributes override the global config for specific functions.

Drawbacks

  • Origin overhead: Every instruction carries an origin reference. For single-origin instructions this is one InstructionId (typically 4 bytes), but merged instructions allocate. In practice, most instructions have single origins, so the overhead is modest.
  • Debug info derivation cost: Computing debug info on demand (rather than storing it per-instruction) adds cost at debug info emission time. This is acceptable because debug info emission happens once at the end of compilation.
  • Profile merging heuristics: Automatically merging profile data during transforms may produce inaccurate estimates. However, this is no worse than the status quo (where profile data is often silently dropped), and the merged data is still useful for code generation heuristics.
  • Dirty bit propagation cost: Lossy transforms trigger transitive downstream dirty marking, which in the worst case touches a large portion of the IR. In practice, most transforms are equivalence-preserving, limiting the frequency of transitive propagation.
  • Transform declaration burden: Pass authors must correctly declare equivalence vs refinement. Incorrect declarations compromise soundness (under-marking) or efficiency (over-marking). The proof framework mitigates this but adds verification overhead.

Rationale and alternatives

Why encode value-level analysis in IR?

The alternative is LLVM’s approach: side-table analyses invalidated by transforms. This creates phase ordering problems — the order in which analyses and transforms run affects the final result. By encoding analysis results in the IR, every pass sees the most recent information, and transforms that don’t affect a particular annotation leave it intact.

Why origin chains instead of per-instruction debug info?

Alternative: LLVM’s DebugLoc approach. Each instruction carries a DebugLoc that passes must manually preserve. Rejected because it relies on per-pass discipline and fails in practice — many LLVM passes silently drop debug locations.

Alternative: Debug info as IR instructions (LLVM’s dbg.value). Intrinsic calls represent debug info in the IR. Rejected because these pseudo-instructions complicate every transform (must be moved, cloned, or deleted alongside real instructions) and are still frequently mishandled.

Origin chains move the burden from pass authors to the infrastructure. Pass authors specify “where did this instruction come from” (which they always know), and the infrastructure handles debug/profile info automatically.

Why allow profile merging?

Profile data is inherently approximate (sampled, potentially stale). Merging during transforms preserves the relative hotness information that matters for code generation, even if absolute counts shift. The alternative — dropping profile data when transforms can’t preserve it exactly — is strictly worse.

Prior art

  • LLVM DebugLoc: Per-instruction debug location, manually maintained by passes. Known to be fragile; significant ongoing effort to improve preservation.
  • LLVM dbg.value/dbg.declare: Debug info as intrinsic calls in IR. Adds complexity to every transform.
  • GCC tree-ssa: Similar side-table analysis approach with phase ordering issues.
  • MLIR: Operation-level location tracking with fusion semantics for merged operations. Closer to tuffy’s origin chain concept.

Unresolved questions

  • Origin chain compression: Long optimization pipelines may create deep origin chains. Should origins be transitively compressed (point directly to the original MIR instruction) or kept as a full chain?
  • Profile data merging policy: What specific merging rules apply for different transform types (block merging, loop unrolling, function inlining)?
  • Dirty bit granularity: Should dirty marking be per-use-edge (current design) or per-instruction? Per-instruction is coarser but cheaper to maintain.
  • Pipeline config format: What serialization format for pipeline configurations? TOML, JSON, or a custom DSL?

Future possibilities

  • Differential debugging: Origin chains enable automatic bisection — given a miscompiled instruction, trace its origin chain to identify which transform introduced the bug.
  • Optimization reports: Origin chains combined with rewrite path traces provide detailed optimization reports showing what happened to each source-level operation.
  • Selective optimization: Profile data preservation enables fine-grained PGO where hot and cold paths receive different optimization strategies throughout the pipeline.

Feature Name: tuffy_interpreter

  • Status: Draft
  • Created: 2026-02-07
  • Completed: N/A

Summary

This RFC defines the design of the tuffy IR interpreter, a miri-like execution engine that directly interprets tuffy IR with full UB detection, poison tracking, pointer provenance enforcement, and four-state byte memory semantics. The interpreter serves as the primary testing and validation tool for transform correctness.

Motivation

Compiler testing traditionally relies on end-to-end execution: compile a program, run it, check the output. This approach has two fundamental limitations:

  • Coarse granularity: end-to-end tests only detect bugs that manifest as observable output differences. Silent UB, provenance violations, and poison mishandling may not produce wrong outputs on any particular test input.
  • Slow reduction: when a miscompilation is found, reducing the reproducer from a full program to a minimal IR fragment requires expensive tools (creduce, llvm-reduce) operating at the source or LLVM IR level.

The LLVM UB-aware interpreter (llubi) demonstrated that an IR-level interpreter with UB detection can reduce miscompilation reproducers from hours to minutes. Tuffy’s IR design — poison-only UB model, four-state bytes, abstract pointer provenance — is inherently more amenable to precise interpretation than LLVM IR.

Guide-level explanation

Core capabilities

The tuffy interpreter executes IR functions directly, maintaining a concrete machine state:

  • Infinite precision integers: implemented via arbitrary-precision integer library (e.g., num-bigint). No overflow, no wrapping — arithmetic is mathematical.
  • Four-state byte memory: each byte in the interpreter’s memory is one of Bits(u8), PtrFragment(AllocId, usize), Uninit, or Poison.
  • Pointer provenance tracking: every pointer carries an (AllocId, offset) pair. Out-of-bounds access, use-after-free, and provenance violations are detected.
  • Poison propagation: poison values propagate through computations. Using poison as a branch condition, storing poison to observable memory, or violating an assert node is flagged as UB.
  • Assert node enforcement: assertsext(v, N) and assertzext(v, N) are checked at runtime; violations produce poison (and are reported as UB).

Execution modes

The interpreter supports multiple execution modes:

ModePurpose
NormalExecute and produce output, report UB as warnings
StrictAbort on first UB detection
DifferentialExecute two IR versions (pre/post optimization), compare results
ReduceDeterministic execution for use with IR reducer tools

Differential testing

The primary use case for transform validation:

tuffy-interp --differential --before=pre_opt.tir --after=post_opt.tir --input=test_case

The interpreter runs both IR versions on the same input and reports any behavioral divergence. Combined with a fuzzer, this provides high-confidence testing of individual transforms.

UB detection

The interpreter detects and reports:

UB categoryDetection mechanism
Assert node violationRuntime range check on assertsext/assertzext
Uninit readMemory byte is Uninit at load time
Poison observationPoison used as branch condition or stored to observable memory
Out-of-bounds accessPointer offset exceeds allocation bounds
Use-after-freeAccess to freed allocation
Provenance violationPointer used with wrong AllocId
Division by zeroDivisor is zero
Null dereferencePointer with null address

Integration with fuzzing

The interpreter integrates with fuzzing tools:

  • Rustlantis: generates random Rust programs, compiles with tuffy, runs through interpreter
  • IR-level fuzzer: generates random tuffy IR fragments, runs through optimization pipeline, differential-tests with interpreter
  • Reducer: when a bug is found, an IR reducer (similar to llvm-reduce) minimizes the reproducer using the interpreter as the oracle

Reference-level explanation

Value representation

enum Value {
    Int(BigInt),
    Bytes(Vec<AbstractByte>),
    Ptr(Pointer),
    Float(f32),
    Double(f64),
    Vec(Vec<Value>),
    Poison,
}

enum AbstractByte {
    Bits(u8),
    PtrFragment(AllocId, usize),
    Uninit,
    Poison,
}

struct Pointer {
    alloc_id: AllocId,
    offset: usize,
    address_space: u32,
}

Memory model

struct Memory {
    allocations: HashMap<AllocId, Allocation>,
    next_alloc_id: AllocId,
}

struct Allocation {
    bytes: Vec<AbstractByte>,
    size: usize,
    alive: bool,
}

impl Memory {
    fn alloc(&mut self, size: usize) -> Pointer;
    fn free(&mut self, ptr: Pointer) -> Result<(), UBError>;
    fn load(&self, ptr: Pointer, size: usize) -> Result<Vec<AbstractByte>, UBError>;
    fn store(&mut self, ptr: Pointer, bytes: Vec<AbstractByte>) -> Result<(), UBError>;
}

Every memory operation checks:

  1. ptr.alloc_id refers to a live allocation
  2. ptr.offset + size <= allocation.size
  3. Provenance matches (the pointer was derived from this allocation)

Instruction execution

Each instruction maps directly to a Rust function:

fn exec_add(a: &Value, b: &Value) -> Value {
    match (a, b) {
        (Value::Int(x), Value::Int(y)) => Value::Int(x + y),
        (Value::Poison, _) | (_, Value::Poison) => Value::Poison,
        _ => panic!("type error"),
    }
}

fn exec_assertsext(v: &Value, n: u32) -> Value {
    match v {
        Value::Int(x) => {
            let min = -(BigInt::one() << (n - 1));
            let max = (BigInt::one() << (n - 1)) - 1;
            if *x >= min && *x <= max {
                Value::Int(x.clone())
            } else {
                Value::Poison  // UB: report to user
            }
        }
        Value::Poison => Value::Poison,
        _ => panic!("type error"),
    }
}

Hierarchical CFG execution

The interpreter walks the hierarchical CFG:

  • Region.Loop: execute the loop body repeatedly until the exit branch is taken
  • Region.Branch: evaluate the condition, execute the corresponding sub-region
  • Region.Sequence: execute sub-regions in order
  • BasicBlock: execute instructions sequentially, then the terminator

MemSSA handling

The interpreter ignores MemSSA tokens — they are compile-time analysis artifacts. The interpreter uses its own concrete memory state for all load/store operations.

Drawbacks

  • Performance: arbitrary-precision integers and per-byte tracking are slower than native execution. The interpreter is not suitable for running large programs at scale.
  • Incomplete semantics: some IR features (atomics, concurrency) are difficult to interpret faithfully. Single-threaded sequential execution is a simplification.
  • Maintenance burden: the interpreter must be kept in sync with IR semantics changes. Every new instruction or semantic change requires interpreter updates.

Rationale and alternatives

Why a custom interpreter instead of lowering to LLVM and using lli?

Lowering to LLVM IR loses tuffy-specific semantics (infinite precision integers, four-state bytes, provenance). The whole point of the interpreter is to validate tuffy IR semantics directly, not after translation.

Why arbitrary-precision integers in the interpreter?

The IR defines int as infinite precision. The interpreter must faithfully implement this — using fixed-width integers would mask overflow bugs that the IR semantics are designed to prevent.

Why not use Miri directly?

Miri interprets Rust MIR, not tuffy IR. The tuffy interpreter operates on tuffy IR after optimization, catching bugs introduced by transforms. Miri and the tuffy interpreter serve complementary roles: Miri validates Rust semantics, the tuffy interpreter validates tuffy IR semantics.

Prior art

  • Miri: Rust MIR interpreter with UB detection, pointer provenance tracking, and abstract memory model. Primary inspiration for tuffy’s interpreter design.
  • llubi (LLVM UB-aware interpreter): LLVM IR interpreter with poison tracking and UB detection. Demonstrated the value of IR-level interpretation for miscompilation debugging. Limited by LLVM’s undef/poison duality and lack of provenance tracking.
  • lli (LLVM interpreter): basic LLVM IR interpreter without UB awareness. Insufficient for correctness validation.
  • KLEE: symbolic execution engine for LLVM IR. More powerful but much slower; tuffy’s interpreter focuses on concrete execution for speed.

Unresolved questions

  • Atomic semantics: how should the interpreter handle atomic operations? Sequential consistency by default, or model relaxed memory?
  • External function calls: how to handle calls to external functions (libc, system calls)? Stub library, or FFI to native implementations?
  • Vector operations: should the interpreter support scalable vectors, or only fixed-width vectors with a concrete vscale?
  • Performance optimization: are there opportunities to fast-path common cases (e.g., when all bytes are Bits, skip per-byte tracking)?

Future possibilities

  • Symbolic execution: extend the interpreter to support symbolic values, enabling exhaustive exploration of execution paths for bounded verification.
  • Interactive debugging: step-through execution with inspection of IR values, memory state, and provenance at each instruction.
  • Coverage-guided fuzzing: instrument the interpreter to report coverage, guiding the fuzzer toward unexplored paths.
  • Regression test generation: automatically generate minimal test cases from fuzzer-discovered bugs, adding them to the test suite.

Feature Name: tuffy_rewrite_trace

  • Status: Draft
  • Created: 2026-02-07
  • Completed: N/A

Summary

This RFC defines the rewrite path tracing infrastructure for the tuffy compiler. Every transform emits a machine-readable trace entry recording the rewrite rule, local context, before/after IR fragments, and decision rationale. A replay tool can re-execute the optimization sequence from a trace, enabling regression bisection, policy locking, and transform auditing.

Motivation

When a compiler optimization causes a performance regression or miscompilation, diagnosing the root cause is notoriously difficult:

  • Black-box pipeline: the optimization pipeline is opaque — developers see input IR and output IR, but not the chain of decisions that produced the result.
  • Non-reproducible decisions: heuristic decisions (cost model thresholds, unrolling factors) depend on analysis state that is not recorded. Changing the compiler version may silently change decisions.
  • Manual bisection: finding which transform caused a regression requires manually enabling/disabling passes, a tedious and error-prone process.

Rewrite path tracing makes the optimization pipeline fully transparent and reproducible. Every decision is recorded, and the entire optimization sequence can be replayed mechanically.

Guide-level explanation

Trace entries

Each time a transform fires, it emits a trace entry containing:

  1. Rule identity: which rewrite rule fired, and where in the pipeline
  2. Local context: the preconditions that enabled the transform (known bits, ranges, constants)
  3. Before/after IR: the local IR fragment before and after the rewrite
  4. Decision rationale: for equivalence transforms, the proof reference; for heuristic transforms, the cost model inputs and outputs

Trace format

Traces are structured, machine-readable data. Example (equivalence transform):

{
  "seq": 42,
  "rule": "strength_reduce_mul_pow2",
  "stage": "early_simplification",
  "iteration": 2,
  "function": "@compute",
  "region": "loop.0",
  "context": {
    "%C": {"kind": "const", "value": 8, "properties": ["is_power_of_2"]},
    "%x": {"kind": "known_bits", "known_zeros": "0xFFFFFFFF00000000"}
  },
  "before": {"instructions": ["%r = mul %x, %C"]},
  "after": {"instructions": ["%r = shl %x, 3"]},
  "decision": {
    "kind": "equivalence",
    "proof_ref": "Tuffy.Rewrites.StrengthReduce.mul_pow2"
  }
}

Example (heuristic transform):

{
  "seq": 87,
  "rule": "loop_unroll",
  "stage": "loop_optimization",
  "function": "@process",
  "region": "loop.2",
  "context": {
    "trip_count": 4,
    "body_instruction_count": 12,
    "body_has_calls": false
  },
  "before": {"instructions": ["region.loop { ... }"]},
  "after": {"instructions": ["region.sequence { ... }"]},
  "decision": {
    "kind": "cost_model",
    "before_cost": {"estimated_cycles": 48, "code_size": 24},
    "after_cost": {"estimated_cycles": 16, "code_size": 96},
    "threshold": "unroll if speedup > 2x",
    "speedup": 3.0
  }
}

Replay

A replay tool reads a trace and re-executes the same transforms on a given IR:

tuffy-replay --trace=opt_trace.json --input=before.tir --output=after.tir

At each step, the replay tool:

  1. Matches the pattern from the trace entry against the current IR
  2. Verifies the preconditions still hold
  3. Applies the rewrite
  4. If a precondition fails or the pattern doesn’t match, reports a divergence

Divergences indicate that the IR or compiler has changed in a way that invalidates the recorded optimization sequence.

Use cases

Use caseHow trace helps
Regression bisectReplay trace step by step, compare decisions with new compiler version, find first divergence
Policy lockingSave trace as a “policy file”, replay forces the same decisions on recompilation
Transform auditReview each trace entry to verify optimization rationale
Optimization reportGenerate human-readable report from trace showing what happened to each hot function
DebuggingWhen a miscompilation is found, trace pinpoints which transform introduced the bug

Overhead control

Tracing is optional, controlled by compiler flag:

  • --emit-trace=none (default): no tracing, zero overhead
  • --emit-trace=decisions: record rule + decision only (lightweight)
  • --emit-trace=full: record rule + context + before/after IR + decision (detailed)

When disabled, trace emission code is compiled out via conditional compilation, ensuring zero runtime cost.

Relationship to other features

  • Origin chain: tracks instruction-level lineage (where did this instruction come from). Trace tracks transform-level decisions (why was this rewrite applied). Complementary, not overlapping.
  • Policy-mechanism separation: trace is the execution record of a policy. Replaying a trace = enforcing a policy. Locking to a versioned policy = saving and replaying a trace.
  • Transform equivalence declaration: trace entries reference the declaration (equivalence or refinement) and, for verified rules, the Lean 4 proof.

Reference-level explanation

Trace data structures

struct Trace {
    metadata: TraceMetadata,
    entries: Vec<TraceEntry>,
}

struct TraceMetadata {
    compiler_version: String,
    pipeline_config: PipelineConfig,
    target_triple: String,
    timestamp: String,
}

struct TraceEntry {
    seq: u64,
    rule: String,
    stage: String,
    iteration: u32,
    function: String,
    region: Option<String>,
    context: Vec<ContextFact>,
    before: IrFragment,
    after: IrFragment,
    decision: Decision,
}

enum ContextFact {
    Const { name: String, value: BigInt, properties: Vec<String> },
    KnownBits { name: String, known_zeros: BitSet, known_ones: BitSet },
    Range { name: String, min: BigInt, max: BigInt },
    AssertSext { name: String, width: u32 },
    AssertZext { name: String, width: u32 },
}

struct IrFragment {
    instructions: Vec<String>,
}

enum Decision {
    Equivalence {
        proof_ref: Option<String>,
    },
    CostModel {
        before_cost: CostEstimate,
        after_cost: CostEstimate,
        threshold: String,
    },
}

struct CostEstimate {
    estimated_cycles: Option<u64>,
    code_size: Option<u64>,
    custom: HashMap<String, String>,
}

Trace emission

The pass infrastructure provides a trace emitter that transforms call:

impl TraceEmitter {
    fn emit(&mut self, entry: TraceEntry);
    fn is_enabled(&self) -> bool;
}

In declarative rewrite rules, trace emission is automatic — the codegen generator produces trace emission code alongside the transform code. The context is extracted from the pattern match bindings, and the decision is derived from the rule’s TransformKind.

For the lightweight --emit-trace=decisions mode, before/after IR fragments and detailed context are omitted.

Replay engine

struct ReplayEngine {
    trace: Trace,
    current_step: usize,
}

enum ReplayResult {
    Success,
    Divergence {
        step: u64,
        expected_rule: String,
        reason: DivergenceReason,
    },
}

enum DivergenceReason {
    PatternNotFound,
    PreconditionFailed { fact: String, expected: String, actual: String },
    DifferentDecision { expected: Decision, actual: Decision },
}

The replay engine processes trace entries sequentially. At each step, it attempts to apply the recorded transform. If the pattern matches and preconditions hold, the transform is applied. Otherwise, a divergence is reported with diagnostic information.

Serialization

Traces are serialized as JSON for interoperability. For large traces, streaming JSON (one entry per line, JSONL format) avoids loading the entire trace into memory.

tuffy --emit-trace=full -o output program.tir > trace.jsonl

Drawbacks

  • Trace size: full traces with before/after IR fragments can be large for complex programs. The decisions mode mitigates this.
  • Replay brittleness: replaying a trace on a different compiler version may diverge early if IR normalization changes, even if the final result is equivalent.
  • Serialization cost: even with conditional compilation, the trace data structures add code complexity.

Rationale and alternatives

Why machine-readable structured traces?

Alternative: human-readable logs (like LLVM’s -pass-remarks). Rejected because logs cannot be replayed. They are useful for human inspection but do not enable mechanical regression bisection or policy locking.

Alternative: IR snapshots at each stage. Recording full IR after each pass is expensive and doesn’t capture the decision rationale. Traces are more compact and more informative.

Why JSON/JSONL?

Widely supported, human-inspectable, easy to process with standard tools. Binary formats would be more compact but harder to debug and less interoperable.

Why optional with zero overhead when disabled?

Tracing is a development and debugging tool. Production builds should not pay any cost for it. Conditional compilation ensures this.

Prior art

  • LLVM -pass-remarks: human-readable optimization remarks. Useful but not machine-replayable.
  • LLVM opt-bisect-limit: binary search for which pass causes a bug. Coarser than trace-based bisection (pass-level vs transform-level).
  • GCC -fdump-tree-*: dumps IR at various stages. Snapshots, not decision traces.
  • Alive2: extracts local function-to-function rewrites for verification. Inspires tuffy’s local context extraction in trace entries.

Unresolved questions

  • Trace stability: how to handle trace replay across compiler versions when IR normalization changes? Should traces include a format version for compatibility?
  • Partial replay: should the replay tool support replaying only a subset of trace entries (e.g., only transforms in a specific function)?
  • Trace diffing: should there be a tool to diff two traces (e.g., from two compiler versions) and highlight decision changes?

Future possibilities

  • Trace-guided optimization: use traces from previous compilations to warm-start the optimizer, skipping exploration of decisions already known to be good.
  • Trace visualization: graphical tool showing the optimization pipeline as a sequence of local rewrites, with drill-down into context and decisions.
  • Community trace sharing: share traces for known-good optimization sequences on specific workloads, enabling reproducible performance.

Feature Name: tuffy_verification

  • Status: Draft
  • Created: 2026-02-07
  • Completed: N/A

Summary

This RFC defines the correctness guarantee framework for the tuffy compiler. All optimization transforms are declarative rewrite rules with machine-checked correctness proofs in Lean 4. Instruction selection (isel) — the lowering from tuffy IR to machine instructions — is also formally verified in Lean 4, proving that the generated machine code faithfully implements the IR semantics. An Alive2-style automatic verifier and traditional testing (interpreter + fuzzer) serve as auxiliary discovery tools, not as correctness foundations.

Motivation

Compiler optimization bugs are a persistent source of real-world software defects. LLVM’s InstCombine alone has accumulated hundreds of miscompilation bugs, many stemming from subtle semantic errors in hand-written transforms. The root causes are:

  • Imperative transforms: hand-written code that directly manipulates IR is error-prone. Each transform must correctly handle edge cases (poison, overflow, pointer provenance) that are easy to overlook.
  • Informal correctness arguments: most transforms are justified by informal reasoning or review, not machine-checked proofs. Subtle errors survive review.
  • Disconnected verification: tools like Alive2 verify transforms after the fact, but are not integrated into the development workflow as a hard gate.

Instruction selection has similar risks. The lowering from IR to machine instructions must preserve semantics — incorrect register sizing, wrong sign/zero extension, or mishandled non-standard bit-widths (e.g., u17 annotations) can silently produce wrong code. Traditional compilers rely on testing to catch these bugs; tuffy formally verifies isel correctness.

Tuffy takes a formal-first approach: every rewrite rule and every isel lowering rule must have a Lean 4 proof before it enters the production pipeline.

Guide-level explanation

Three-layer verification

Tuffy uses three layers of verification, with clear trust hierarchy:

LayerRoleTrust level
Lean 4 proofsGold standard. Machine-checked correctness of every rewrite rule, MemSSA refinement rule, annotation propagation rule, isel lowering rule, and MIR translation ruleAuthoritative
Alive2-style verifierSMT-based quick check for discovering missed optimizations and regressionsAuxiliary
Interpreter + fuzzer + test suiteEnd-to-end behavioral testingAuxiliary

Only Lean 4 proofs gate production inclusion. The other two layers are discovery tools — they find bugs and missed optimizations, but do not constitute correctness evidence.

Verified optimization transforms

Transforms are not hand-written imperative IR manipulation. They are declarative rewrite rules:

rule strength_reduce_mul_pow2 {
    pattern:  mul %x, const(C) where is_power_of_2(C)
    replace:  shl %x, log2(C)
    kind:     equivalence
}

rule narrow_add_known_bits {
    pattern:  add %x, %y where known_zeros(%x, [32:63]) && known_zeros(%y, [32:63])
    replace:  add %x, %y  // same operation, but refine known_zeros on result
    refine:   known_zeros(result, [32:63])
    kind:     refinement
}

Each rule specifies:

  • pattern: what IR fragment to match
  • replace: what to produce
  • kind: equivalence or refinement (used by dirty bit mechanism)
  • preconditions: constraints that must hold for the rule to apply

Workflow

The development workflow for a new optimization:

  1. Define the rewrite rule and prove correctness in Lean 4 — the rule definition and its proof live together in the same Lean file, against the formal IR semantics. This ensures the proof and the rule are always consistent.
  2. Lean exports declarative transform description — Lean generates a machine-readable description (JSON) of the verified rule, including pattern, replacement, preconditions, and transform kind.
  3. Codegen generator produces Rust code from the exported description via codegen.
  4. Generated code uses the builder API (origin, dirty bit handled automatically by Rust type system)
  5. Alive2-style verifier cross-checks the rule for additional confidence
  6. Fuzzer exercises the rule on diverse inputs

Steps 1–3 are mandatory. Steps 5–6 are recommended but not gating.

The key insight: by having Lean own both the rule definition and the proof, there is no possibility of the proof and the implementation diverging. The codegen generator only reads what Lean exports.

Verified MemSSA refinement

MemSSA refinement is a transform that rewrites a load’s mem operand to point to an earlier MemoryDef when alias analysis proves that intervening defs do not clobber the load’s memory location. This follows the “analysis is also a transformation” principle — refinement modifies the IR, not a side table.

Each refinement step is a rewrite rule with a proof obligation:

Given a load from location L depending on mem token %memN, and an intervening MemoryDef %memK → %memN that does not alias L, the load’s mem operand can be rewritten from %memN to %memK without changing observable behavior.

These rules are proven in Lean 4 against the formal memory semantics (evalLoad, evalStore, and the four-state byte model). The proof must account for pointer provenance — two pointers with different AllocIds never alias, but pointers derived via inttoptr may alias any exposed allocation.

Verified annotation propagation

Value-level analyses (KnownBits, DemandedBits) are transforms that write analysis results into IR annotations. Each propagation rule must be proven sound: the annotation produced must be a valid over-approximation of the actual value set.

Examples of proof obligations:

  • KnownBits through add: if %x has known_zeros at bits [32:63] and %y has known_zeros at bits [32:63], then add %x, %y has known_zeros at bits [33:63] (not [32:63] — carry may set bit 32).
  • DemandedBits through shift: if only bits [0:7] of shl %x, 8 are demanded, then no bits of %x are demanded (the result is always zero in those positions).

Unsound annotations cause silent miscompilation: downstream passes and instruction selection trust annotations to select machine operations. An incorrect :s32 annotation on a value that exceeds 32-bit signed range produces wrong code with no runtime error.

Annotation propagation rules follow the same workflow as optimization rewrite rules: defined and proven in Lean 4, exported as declarative descriptions, codegen’d to Rust.

Verified instruction selection

Instruction selection lowers tuffy IR (infinite precision integers with annotations) to target machine instructions (fixed-width registers). This lowering must preserve semantics: the machine code must produce the same observable behavior as the IR.

Key properties to verify:

  • Annotation-driven sizing: an :s32 annotation lowers to 32-bit machine operations; non-standard widths (e.g., :u17) lower to the correct shift/mask sequences.
  • Signedness correctness: signed annotations select signed machine operations (SAR, IDIV, signed condition codes) and unsigned annotations select unsigned ones (SHR, DIV, unsigned condition codes).
  • Extension correctness: sext/zext from annotation width N produces the correct result, including the no-op cases (zext from unsigned, sext from signed) and the cases requiring explicit shift/mask sequences.

Each isel lowering rule is defined in Lean 4 as a relation between IR instruction semantics and machine instruction semantics. The proof obligation is:

For all inputs satisfying the annotation constraints, the machine instruction sequence produces the same result as the IR instruction under the formal semantics.

The Lean formalization requires a machine-level semantics model (x86-64 integer operations on fixed-width registers) in addition to the existing IR semantics.

Verified MIR translation

MIR translation converts rustc MIR to tuffy IR. If this translation is wrong, all downstream verification is meaningless. Key non-trivial translation steps:

  • Integer type translation: MIR fixed-width integers (u8, i32, u64) → int with :s<N> / :u<N> annotations. Incorrect annotations cause silent miscompilation.
  • Control flow structuralization: flat MIR CFG → hierarchical SESE regions.
  • Aggregate scalarization: structs, enums, arrays → scalar operations.
  • mem2reg: MIR mutable Places → SSA values.

Approach: define a MIR semantic subset in Lean 4 that covers the MIR constructs tuffy consumes. This is not a full MIR formalization — it models only the subset relevant to translation, independent of rustc’s internal MIR definition. Each translation rule is proven correct against this subset model and the tuffy IR formal semantics.

When rustc’s MIR semantics change, the Lean MIR subset model is updated and affected translation proofs are re-verified. The interpreter + fuzzer layer provides additional confidence that the subset model faithfully reflects actual MIR behavior.

Trust boundary

┌─────────────────────────────────────────┐
│              Trusted components          │
│                                         │
│  ┌───────────────┐  ┌────────────────┐  │
│  │ Lean 4 kernel │  │ IR formal      │  │
│  │               │  │ semantics      │  │
│  └───────────────┘  └────────────────┘  │
│  ┌────────────────┐  ┌───────────────┐  │
│  │ x86-64 machine │  │ MIR semantic  │  │
│  │ semantics      │  │ subset model  │  │
│  └────────────────┘  └───────────────┘  │
│  ┌───────────────┐                      │
│  │ Codegen       │                      │
│  │ generator     │                      │
│  └───────────────┘                      │
├─────────────────────────────────────────┤
│              Verified components         │
│                                         │
│  ┌────────────────────────────────────┐  │
│  │ Optimization rewrite rules        │  │
│  │ (proven correct in Lean 4)        │  │
│  └────────────────────────────────────┘  │
│  ┌────────────────────────────────────┐  │
│  │ MemSSA refinement rules           │  │
│  │ (proven correct in Lean 4)        │  │
│  └────────────────────────────────────┘  │
│  ┌────────────────────────────────────┐  │
│  │ Annotation propagation rules      │  │
│  │ (proven correct in Lean 4)        │  │
│  └────────────────────────────────────┘  │
│  ┌────────────────────────────────────┐  │
│  │ Isel lowering rules               │  │
│  │ (proven correct in Lean 4)        │  │
│  └────────────────────────────────────┘  │
│  ┌────────────────────────────────────┐  │
│  │ MIR translation rules             │  │
│  │ (proven against MIR subset model) │  │
│  └────────────────────────────────────┘  │
├─────────────────────────────────────────┤
│              Constrained components      │
│                                         │
│  ┌────────────────────────────────────┐  │
│  │ Generated Rust transform code     │  │
│  │ (type system enforces invariants) │  │
│  └────────────────────────────────────┘  │
└─────────────────────────────────────────┘

IR formal semantics in Lean 4

The foundation of all proofs is a formal semantics of tuffy IR defined in Lean 4. This includes:

  • Type system semantics (infinite precision integers, byte types, pointers, mem tokens, vectors)
  • Instruction semantics (each operation’s mathematical definition)
  • Poison propagation rules
  • Memory model (four-state bytes, provenance, MemSSA token semantics)
  • Annotation semantics (range constraints, KnownBits, DemandedBits soundness conditions)
  • Assert node semantics (assertsext, assertzext)

A separate MIR semantic subset model defines the MIR constructs that tuffy consumes, serving as the source semantics for MIR translation proofs.

This formalization is the single source of truth for what the IR means. All rewrite rule, MemSSA refinement, annotation propagation, isel, and MIR translation proofs reference these definitions.

Alive2-style verifier

A separate SMT-based tool that can:

  • Automatically verify simple rewrite rules (bounded bit widths, no loops)
  • Discover missed optimizations by enumerating candidate rewrites
  • Regression-test existing rules against IR semantics changes
  • Run in CI as a fast feedback loop

This tool is valuable for development velocity but is not a substitute for Lean 4 proofs. It may miss edge cases that the theorem prover catches, and it cannot handle unbounded reasoning (infinite precision integers require induction).

Interpreter and fuzzer

The tuffy IR interpreter (miri-like) combined with fuzzing provides end-to-end validation:

  • Compile Rust programs with tuffy, run the output, compare against reference (rustc + LLVM)
  • Fuzz IR fragments through the optimization pipeline, check that interpreter results are preserved
  • Exercise edge cases in poison propagation, pointer provenance, and memory model

Reference-level explanation

Rewrite rule representation

struct RewriteRule {
    name: String,
    pattern: Pattern,
    replacement: Replacement,
    preconditions: Vec<Precondition>,
    kind: TransformKind,
    proof_ref: LeanProofRef,
}

struct LeanProofRef {
    module: String,      // Lean 4 module path
    theorem: String,     // theorem name
    verified: bool,      // checked by build system
}

Codegen pipeline

Lean 4 rewrite rule definition + proof
        │
        ▼
  Codegen generator (trusted)
        │
        ▼
  Generated Rust code:
    - Pattern matching function
    - Replacement construction (via Builder API)
    - TransformKind declaration
    - Precondition checks

The generator produces Rust code that:

  1. Implements the Pass trait
  2. Uses the builder API (origin is mandatory, dirty bit is automatic)
  3. Declares TransformKind matching the Lean 4 proof (equivalence or refinement)

Build system integration

The build system verifies that:

  1. Every rewrite rule in the production pipeline has a corresponding Lean 4 proof
  2. The proof references a theorem that is successfully checked by Lean 4
  3. The generated Rust code is up-to-date with the rule definition

Rules without proofs can exist in a experimental pipeline for development, but cannot be included in release builds.

Drawbacks

  • Lean 4 learning curve: contributors must learn Lean 4 to add new optimizations. This raises the barrier to contribution.
  • Proof effort: proving correctness of complex transforms (loop optimizations, vectorization) requires significant effort. Some transforms may be difficult to formalize.
  • Formalization maintenance: changes to IR semantics require updating the Lean 4 formalization and potentially re-proving existing theorems.
  • Codegen generator trust: the generator is in the trusted computing base. Bugs in the generator could produce incorrect Rust code from correct proofs.
  • Development velocity: requiring proofs for every optimization may slow down the pace of adding new optimizations compared to traditional compilers.

Rationale and alternatives

Why Lean 4?

  • Active ecosystem with Mathlib providing extensive mathematical foundations
  • Good metaprogramming support (tactics, macros) for proof automation
  • Overlap with Rust community (shared interest in formal methods)
  • Compiled language with reasonable performance for proof checking

Alternative: Coq. Mature ecosystem but less active development. Lean 4’s metaprogramming is more ergonomic.

Alternative: Isabelle. Strong automation (Sledgehammer) but less mainstream adoption and weaker programming language integration.

Why declarative transforms?

Alternative: imperative transforms with post-hoc verification. Write transforms in Rust, then verify them with Alive2 or similar. Rejected because imperative code has a large surface area for bugs that post-hoc tools may miss, and it’s harder to establish a formal correspondence between the Rust code and the verified property.

Declarative rules have a narrow semantic gap between the rule definition and the Lean 4 proof. The codegen generator bridges this gap with minimal trusted code.

Why not rely solely on Alive2-style verification?

SMT-based verification is bounded — it checks finite bit widths and cannot perform induction. Tuffy’s infinite precision integers require inductive proofs that only an interactive theorem prover can provide. Additionally, SMT solvers may time out on complex transforms, giving no answer rather than a proof.

Prior art

  • CompCert: fully verified C compiler in Coq. Proves correctness of the entire compilation pipeline, not just individual transforms. Tuffy takes a similar approach for isel (verified lowering) while using declarative rules for optimizations.
  • Alive / Alive2: SMT-based verification of LLVM InstCombine transforms. Discovered numerous bugs. Tuffy uses this as an auxiliary tool, not the primary correctness mechanism.
  • CakeML: verified ML compiler in HOL4. Similar full-pipeline verification to CompCert.
  • Lean 4 + Mathlib: growing library of formalized mathematics. Provides foundations for reasoning about integers, bit operations, and algebraic properties.
  • Souper: superoptimizer for LLVM that discovers optimizations via SMT. Could complement tuffy’s approach by suggesting candidate rewrite rules for human proof.

Unresolved questions

  • Declarative rule expressiveness: can all useful transforms be expressed declaratively, or will some require escape hatches for imperative logic (e.g., complex loop transforms)?
  • Proof automation: how much of the proof burden can be automated via Lean 4 tactics? Custom tactics for common proof patterns (bit manipulation, poison propagation) would significantly reduce effort.
  • Codegen generator verification: should the generator itself be verified (in Lean 4), or is manual audit sufficient given its narrow scope?
  • Incremental proof checking: how to efficiently re-check proofs when IR semantics change? Full re-verification may be expensive.

Future possibilities

  • Proof-carrying optimizations: distribute verified rewrite rules as packages, enabling a community-driven optimization library with machine-checked correctness.
  • Superoptimizer integration: use SMT-based search (Souper-style) to discover candidate rewrites, then prove them in Lean 4 for inclusion in the production pipeline.
  • End-to-end verification: with optimizations and isel now verified, extend verification to the remaining pipeline stages (MIR translation, register allocation, machine code emission) to close the full compilation gap.
  • LLM-assisted proofs: use language models to draft Lean 4 proofs for new rewrite rules, with human review and machine checking.

Feature Name: at-use-annotations

  • Status: Draft
  • Created: 2026-02-08
  • Completed: N/A

Summary

Replace standalone assert_sext/assert_zext instructions with annotations on value definitions (result-side) and use edges (use-side). Range constraints, known bits, and demanded bits are encoded directly on the IR edges rather than as separate instructions. Additionally, simplify the text format by eliding the implicit top-level region function.

Motivation

The current IR uses assert_sext and assert_zext as standalone instructions to express range constraints on infinite precision integers:

v0 = param 0
v1 = assert_sext v0, 32
v2 = param 1
v3 = assert_sext v2, 32
v4 = add v1, v3
v5 = assert_sext v4, 32
ret v5

This has several problems:

  1. Instruction bloat — Every constrained value requires a separate assert instruction, roughly doubling the instruction count for typical integer-heavy code.
  2. Artificial data dependencies — Assert nodes create new SSA values (v1 instead of v0), forcing all downstream uses to reference the asserted value. This complicates pattern matching in optimization passes.
  3. No place for analysis results — The planned at-use bit-level analysis (known bits, demanded bits) has no natural home. Assert nodes only express range constraints, not arbitrary bit-level facts.
  4. Phase ordering — Analysis results stored in side tables become stale after transformations. Encoding them in the IR itself keeps them synchronized.

Additionally, the text format wraps every function body in region function { ... }, which is redundant since every function has exactly one top-level function region.

Guide-level explanation

Annotations on values

Every value reference in tuffy IR can carry an optional annotation describing constraints on that value. Annotations appear in two positions:

  • Result-side: on the value defined by an instruction, declaring the range of the output.
  • Use-side: on an operand of an instruction, declaring a constraint the consumer requires.

Annotation types

AnnotationMeaning
:s<N>Value is in signed N-bit range [-2^(N-1), 2^(N-1)-1]
:u<N>Value is in unsigned N-bit range [0, 2^N-1]
:known(<ternary>)Per-bit three-state constraint (see below)
(none)No constraint; value is an unconstrained int

Annotations are composable: :s32:known(...) applies both constraints simultaneously.

Known bits ternary encoding

Each bit in a known annotation is one of four states:

SymbolMeaning
0Bit is known to be 0
1Bit is known to be 1
?Unknown — bit is demanded but its value has not been determined
xDon’t-care — bit is not demanded by the consumer

Example: :known(xxxx_??11_1111_0000) means:

  • Bits [0:3] are known-0
  • Bits [4:9] are known-1
  • Bits [8:9] are demanded but unknown
  • Bits [10:] are don’t-care (not demanded)

The distinction between ? and x matters for optimization:

  • ? (unknown): the consumer needs this bit, so the producer must supply a well-defined value. An analysis pass may later refine ? to 0 or 1.
  • x (don’t-care): the consumer ignores this bit entirely. The producer is free to supply any value, enabling dead-bit elimination and narrowing optimizations.

Example: i32 a + i32 b (C signed addition)

Before (assert nodes):

func @add_i32(int, int) -> int {
  region function {
    bb0:
      v0 = param 0
      v1 = assert_sext v0, 32
      v2 = param 1
      v3 = assert_sext v2, 32
      v4 = add v1, v3
      v5 = assert_sext v4, 32
      ret v5
  }
}

After (at-use annotations, top-level region elided):

func @add_i32(int:s32, int:s32) -> int:s32 {
  bb0:
    v0:s32 = param 0
    v1:s32 = param 1
    v2:s32 = add v0:s32, v1:s32
    ret v2:s32
}

6 instructions reduced to 4. No artificial SSA values from assert nodes.

Example: optimization strengthening a use-side annotation

After an analysis pass discovers that v1 is always in [0, 100] at a particular use:

func @example(int:s32) -> int:s32 {
  bb0:
    v0:s32 = param 0
    v1:s32 = add v0:s32, v0:s32
    v2:s32 = foo v1:u7              // use-side: optimizer proved v1 in [0, 127] here
    ret v2:s32
}

The :u7 on the use of v1 is stronger than the :s32 on its definition. This is valid because the use-side annotation is a local fact about this specific use site.

Example: known bits after optimization

v3:s32:known(xxxx_xxxx_xxxx_xxxx_????_????_????_???0) = and v1:s32, v2:s32

The and result has bit 0 known to be 0 (discovered by analysis), bits [1:15] are demanded but unknown, and bits [16:31] are don’t-care.

Semantics

Result-side annotation violation: the instruction itself produces poison.

v0:s8 = add v1, v2    // if the true result is 200, v0 becomes poison

Use-side annotation violation: the consuming instruction produces poison.

v1 = foo v0:u4        // if v0 is 20, then foo produces poison

Constraint strength: a use-side annotation may be stronger (narrower) than the result-side annotation on the referenced value. It must not be weaker — a weaker use-side annotation is redundant and should be omitted.

No annotation: the value is an unconstrained infinite precision int. During instruction selection, the verifier reports an error if it cannot determine a concrete machine type for an unconstrained value.

Function signatures

Function signatures carry annotations on parameter types and the return type. These are ABI-level contracts between caller and callee:

func @add_i32(int:s32, int:s32) -> int:s32 { ... }
  • Parameter annotation: the caller guarantees the argument satisfies the constraint. Inside the function body, param result-side annotations match the signature.
  • Return annotation: the callee guarantees the return value satisfies the constraint. The ret use-side annotation must be at least as strong as the signature’s return annotation.

Text format: elide top-level region

Every function has exactly one top-level region function. The text format elides it:

// Before:
func @f(int) -> int {
  region function {
    bb0:
      ...
  }
}

// After:
func @f(int) -> int {
  bb0:
    ...
}

Inner regions (loop, plain) are still written explicitly.

Reference-level explanation

Annotation representation

In the Rust implementation, annotations are stored alongside each operand (use-side) and each instruction result (result-side). A compact representation:

#![allow(unused)]
fn main() {
/// Per-bit four-state annotation.
struct KnownBits {
    /// Bits known to be 1.
    ones: BigUint,
    /// Bits known to be 0.
    zeros: BigUint,
    /// Bits that are demanded by the consumer.
    demanded: BigUint,
}
}

Invariant: ones & zeros == 0 (a bit cannot be simultaneously known-0 and known-1).

The :s<N> and :u<N> shorthand annotations are sugar over KnownBits:

  • :s32 implies demanded bits [0:31], with the sign bit propagation constraint.
  • :u32 implies demanded bits [0:31], with bits [32:] known-0.

Interaction with other features

Block arguments: block arguments carry result-side annotations. Branch instructions carry use-side annotations on the passed values:

bb0:
  v0:s32 = iconst 42
  br bb1(v0:s32)

bb1(v1:s32: int):
  ret v1:s32

Terminators: ret, br, brif, continue, region_yield all carry use-side annotations on their operands.

Memory operations: load result carries a result-side annotation. store value operand carries a use-side annotation. Pointer operands are not annotated (pointers have concrete types).

Instruction selection: the verifier walks all values and checks that every int value has sufficient annotation (result-side or use-side) to determine a concrete machine type. Missing annotations are reported as errors.

Grammar changes

annotation   ::= (':' range_ann)*
range_ann    ::= 's' N | 'u' N | 'known' '(' ternary ')'
ternary      ::= [01?x_]+

function     ::= 'func' '@' name '(' param_list ')' ('->' type annotation)? '{' body '}'
param_list   ::= (type annotation (',' type annotation)*)?
body         ::= (block | region)*
instruction  ::= (value annotation '=')? opcode operands
operand      ::= value annotation

The top-level region function { ... } wrapper is removed from the grammar. The function body directly contains blocks and regions.

Drawbacks

  • Annotation storage overhead — Every use edge now carries optional annotation data. For dense integer code this may increase memory usage compared to standalone assert instructions (one annotation per use vs one instruction per value).
  • Complexity in the builder API — The builder must accept annotations on every operand, making the API surface larger.
  • Parsing complexity — The text format parser must handle annotations on every value reference, increasing grammar complexity.

Rationale and alternatives

Why annotations on edges instead of standalone instructions?

The core insight is that range constraints and bit-level facts are properties of how a value is used, not properties of the value itself. A value v0 may be in the full int range, but a particular consumer may only need 8 bits of it. Encoding this on the use edge is the natural representation.

Alternative: keep assert nodes, add side-table analyses

This is the LLVM approach — KnownBits and DemandedBits live in analysis caches outside the IR. The problem is phase ordering: after a transformation, the side tables are stale and must be recomputed. Tuffy’s “analysis is also a transformation” principle rejects this approach.

Alternative: assert nodes with use-side annotations (hybrid)

Keep assert nodes for result-side constraints, add use-side annotations only for analysis results. This avoids changing the instruction set but retains the instruction bloat problem and the artificial SSA value issue.

Why signed/unsigned is a binary choice

A natural question is whether annotations should support expressing both signed and unsigned constraints simultaneously — e.g., a value that is both :s32 and :u16. The answer is no, because the intersection of any two range annotations is always expressible as a single annotation:

IntersectionResultReasoning
:s<M>:s<N>:s<min(M,N)>Smaller signed range is a subset of the larger
:u<M>:u<N>:u<min(M,N)>Smaller unsigned range is a subset of the larger
:s<M>:u<N>:u<min(M-1, N)>The non-negative portion of :s<M> is [0, 2^(M-1)-1], which equals :u<M-1>

The third case is the key insight. A signed N-bit value is in [-2^(N-1), 2^(N-1)-1]. An unsigned M-bit value is in [0, 2^M-1]. Their intersection is [0, min(2^(N-1)-1, 2^M-1)], which is always an unsigned range expressible as :u<min(N-1, M)>.

Since any combination of signed and unsigned constraints collapses to a single annotation, supporting both simultaneously adds no expressive power — it only adds implementation complexity.

For constraints that range annotations genuinely cannot express (e.g., “bit 3 is known to be 1” or “bits above 8 are don’t-care”), the KnownBits extension is the correct generalization. KnownBits subsumes range annotations entirely: :s32 and :u32 are sugar over specific KnownBits patterns.

Impact of not doing this

Without this change, the IR would need both assert nodes (for range constraints) and a separate mechanism (side tables or new annotation syntax) for known/demanded bits. Two parallel systems for the same concept.

Prior art

  • LLVM KnownBits / DemandedBits — Side-table analyses that track per-bit information. Effective but disconnected from the IR, causing phase ordering issues and stale results after transformations. Tuffy’s approach promotes these to first-class IR constructs.
  • Cranelift aegraph annotations — Cranelift’s e-graph framework attaches facts to values (range constraints for proof-carrying code). Similar in spirit to result-side annotations, but not on use edges.
  • MLIR type refinement — MLIR dialects can carry refined type information on values. However, MLIR’s approach is type-based (the type itself changes), not annotation-based (the type stays int, annotations are metadata).

Unresolved questions

  • Compact storage for common cases — Most annotations will be simple :s32 or :u64. Should the implementation use a small-enum optimization to avoid allocating BigUint for these common cases?
  • Annotation canonicalization — Should redundant annotations (e.g., :s32 on a use when the result already has :s32) be stripped automatically, or preserved for readability?
  • Lean formalization — How should annotations be represented in the Lean formal model? As metadata on the operational semantics, or as part of the value type?

Future possibilities

  • Annotation inference pass — A dedicated pass that propagates known bits forward and demanded bits backward, filling in ? bits with 0/1 and narrowing ? to x where possible.
  • Annotation-driven instruction selection — The instruction selector reads annotations directly to choose machine instructions, register classes, and sign/zero extension without a separate legalization phase.
  • Annotation-aware rewrite rules — Declarative rewrite rules that match on annotation patterns, e.g., “if all bits above bit 7 are don’t-care, narrow to 8-bit operation”.
  • Floating point annotations — Extend the annotation system to floating point values (e.g., known-NaN, known-finite, known-positive).

SOP: Modifying the Tuffy IR

  • Created: 2026-02-08
  • Status: Active

Overview

This document describes the standard operating procedure for making changes to the Tuffy IR. Because the IR is the core data structure shared across the entire compiler, changes must be propagated to multiple components in a specific order.

Guiding Principle

The Lean 4 formal definitions are the source of truth. All other artifacts (Rust code, spec document) must conform to the Lean definitions. When in doubt, the Lean code is authoritative.

Checklist

1. Lean 4 Formal Definitions (Source of Truth)

Update the formal model first. Depending on the nature of the change:

What changedFile to update
Types, values, memory modellean/TuffyLean/IR/Types.lean
Instruction semanticslean/TuffyLean/IR/Semantics.lean

After editing, verify with:

cd lean && lake build

If the change affects existing proofs, fix all proof breakages before proceeding.

2. Rust Implementation (tuffy_ir)

Translate the Lean changes into the Rust IR crate. Files to consider:

What changedFile to update
Type system (new type or type change)tuffy_ir/src/types.rs
New/modified instruction or opcodetuffy_ir/src/instruction.rs
Reference handles (ValueRef, BlockRef, etc.)tuffy_ir/src/value.rs
Function/block/region structuretuffy_ir/src/function.rs
Builder API for constructing IRtuffy_ir/src/builder.rs
Textual format / displaytuffy_ir/src/display.rs

Run after changes:

cargo test -p tuffy_ir
cargo clippy -p tuffy_ir

3. Downstream Crates

IR changes typically break downstream consumers. Update in dependency order:

  1. Interpreter (tuffy_ir_interp/src/lib.rs) — Add/update evaluation logic for new or changed instructions.

  2. Optimizer (tuffy_opt/) — Update any passes that pattern-match on changed instructions.

  3. Code Generation (tuffy_codegen/) — Update instruction selection (src/isel.rs), encoding (src/encode.rs), or emission (src/emit.rs) as needed.

  4. Trace / TV (tuffy_trace/, tuffy_tv/) — Update if the change affects tracing or visualization.

4. Spec Document

Update the human-readable reference:

  • docs/spec/ (split into multiple files; see docs/spec/index.md)

Ensure the spec accurately reflects the new Lean definitions and Rust implementation. The spec is not authoritative — it is documentation derived from the Lean model.

5. Tests

Add or update tests at every layer:

LayerLocation
Rust unit teststuffy_ir/src/tests.rs
Interpreter teststuffy_ir_interp/tests/
Codegen teststuffy_codegen/src/tests.rs
UI testsrustc_codegen_tuffy/tests/ (run via rustc_codegen_tuffy/tests/run-ui-tests.sh)

6. Final Verification

Before committing, all of the following must pass:

cd lean && lake build
cargo test
cargo clippy
rustc_codegen_tuffy/tests/run-ui-tests.sh

Commit Strategy

Each component should be committed separately per the project’s commit convention. A typical IR change produces commits in this order:

  1. feat(lean): ... — Lean formal definitions
  2. feat(ir): ... — Rust tuffy_ir crate
  3. feat(interp): ... / feat(codegen): ... — downstream crates
  4. docs: ... — spec update

Every individual commit must pass all tests.

SOP: Updating Documentation

  • Created: 2026-02-09
  • Status: Active

Overview

This document describes the checks required whenever documentation files under docs/ are added or modified. Skipping these checks can cause broken links, missing pages on the published site, or stale content that contradicts the source of truth.

Checklist

1. SUMMARY.md

Every Markdown file that should appear on the published site must be listed in docs/SUMMARY.md. mdBook only builds pages present in the summary — unlisted files will result in 404 errors.

When adding a new page:

  • Add an entry to docs/SUMMARY.md in the correct section and nesting level.
  • If the page belongs to an existing group (e.g., spec/), add it as a nested entry under the group’s parent.

2. Spec Index

If the change adds or removes a page under docs/spec/, update the Table of Contents in docs/spec/index.md to match.

3. Source of Truth Conformance

The spec documents (docs/spec/) describe the IR but are not authoritative. The Lean 4 definitions (lean/TuffyLean/IR/) are the source of truth.

When updating spec docs, verify that:

  • Descriptions match the current Lean definitions.
  • Semantics formulas reference the correct eval* function names from TuffyLean.IR.Semantics.
  • Type names and instruction opcodes match the Rust implementation in tuffy_ir/.

Check that all cross-references between documentation pages are valid:

  • Relative links (e.g., [types](types.md)) point to files that exist.
  • Anchor links (e.g., instructions.md#comparison) reference headings that exist in the target file.

5. Build Verification

After making changes, build the documentation locally and verify:

mdbook build

The build must complete without errors. Optionally preview with:

mdbook serve

6. Commit

Documentation-only changes use the docs commit type:

docs: <description>
docs(spec): <description>    # for spec-specific changes

If the documentation change accompanies a code change (e.g., adding a new instruction), the docs update should be a separate commit from the code change, per the project’s commit convention.