Introduction
Tuffy is an experimental optimizing compiler written in Rust, developed with LLM assistance.
Note: This project is currently in an experimental stage.
Key Design Ideas
- Infinite precision integers — The IR uses a single
inttype with no fixed bitwidth. Arithmetic operates on mathematical integers; signedness and minimum required bits are derived at use sites. This eliminates zext/sext/trunc noise and lets optimization passes focus on mathematical equivalence. - Analysis is also a transformation — Static analyses are expressed as transformations on the IR itself, unifying analysis and optimization and reducing phase ordering problems.
- Declarative, verified optimizations — Transforms are declarative rewrite rules with machine-checked correctness proofs in Lean 4 against the formal IR semantics. A codegen generator produces Rust implementation from verified rules, minimizing the trusted code base.
- Formal-first — Correctness of optimization passes is backed by formal verification from the start, not bolted on after the fact.
- Hardware-friendly IR — IR representations designed to be cache-friendly and suitable for modern hardware, reducing pointer chasing and improving data locality during compilation.
- Policy-mechanism separation — Optimization strategies are decoupled from the compiler engine. Users can specify or pin a particular optimization policy to prevent performance regressions when upgrading the compiler.
Documentation Overview
- IR Language Reference — The textual IR format, type system, and instruction set.
- Initial Design — Early design discussions and architectural decisions.
- References — Curated list of reference documents and resources.
- RFCs — Design proposals for major features.
- Development Tasks — Task tracking for current milestones.
Tuffy IR Language Reference
This document is the reference manual for the Tuffy IR, the intermediate representation used by the tuffy compiler. It describes the type system, instruction set, control flow structure, and textual format of the IR.
Status: This spec describes the currently implemented IR. Features described in the IR Design RFC but not yet implemented are noted as planned.
Source of truth: The formal definitions in Lean 4 (
lean/TuffyLean/IR/) are authoritative. This spec is a human-readable description that must conform to the Lean definitions. When in doubt, consult the Lean code.
Introduction
Tuffy IR is a typed, SSA-based intermediate representation organized around a hierarchical control flow graph with SESE (Single Entry, Single Exit) regions.
Key design principles:
- Infinite precision integers — no fixed bit widths; range constraints via assert nodes
- Block arguments instead of PHI nodes (Cranelift/MLIR style)
- Hierarchical CFG — loops and structured control flow are explicit SESE regions
- Poison-only UB model — no
undef; onlypoison - Single data structure — one IR across all compilation stages
Table of Contents
Type System
Tuffy IR has a minimal type system supporting scalar types and aggregate types (structs, arrays). Aggregate types flow through the IR as single values; ABI-specific decomposition is handled by the backend legalize pass.
int
Infinite precision integer. No fixed bit width, no signedness. Arithmetic on int is
mathematical: add 3, 5 produces 8 with no possibility of overflow.
Range constraints are expressed via value annotations rather than the type itself. The decision of sign-extension vs zero-extension is deferred to instruction selection.
bool
Boolean type with two values: true and false. Distinct from integers — boolean
logic is not conflated with integer arithmetic. Comparison instructions (icmp) return
bool, and control flow instructions (brif, select) require bool conditions.
To convert between bool and int, use bool_to_int (true → 1, false → 0). The
reverse conversion uses icmp.ne val, 0.
byte(N)
Raw memory data of N bytes. Distinct from integers. The byte type preserves pointer fragments and supports per-byte poison tracking. Loads and stores operate on byte types.
Note: byte is defined in the type system but not yet used by any implemented
instructions. Memory operations currently use int directly.
ptr(AS)
Pointer with address space AS. Address space 0 is the default. Pointers carry provenance (allocation ID + offset). Different address spaces may have different pointer sizes and semantics.
mem
Abstract memory state token for Memory SSA. Memory tokens flow through the function as
regular SSA values, making memory dependencies explicit. The entry block receives the
initial memory state as a block parameter of type mem; ret carries the final mem
token as an operand.
Memory-producing operations (store, call, fence, atomics) consume a mem token
and produce a new one. Plain load consumes a mem token but does not produce one.
At CFG join points, memory versions merge via regular block parameters of type mem.
float
IEEE 754 floating point type. Variants: bf16 (bfloat16), f16 (half), f32 (single),
f64 (double). Floating point operations (fadd, fsub, etc.) operate on float-typed
values. The result type of a float operation matches the operand float width.
vec(VT)
Vector type parameterized by total bit-width. Variants: vec<N> (fixed-width, e.g., 128 for SSE),
vec<vscale x N> (scalable, e.g., SVE, RVV). Element count is derived from the bit-width and
element size.
struct{T0, T1, ...}
Struct type with field types. Represents a heterogeneous aggregate with indexed fields.
Struct values flow through the IR as single SSA values. Field access uses extractvalue and
insertvalue instructions with index paths.
Padding must be explicitly represented in the struct type. Padding fields are typically
represented as [byte(N); M] (array of N-byte chunks) or byte(N) for individual padding bytes.
Example: struct{int, [byte(8); 1], bool} is a struct with an int field, 8 bytes of padding,
and a bool field. The padding is explicit and must be accounted for in field indices.
The backend legalize pass expands struct values into register-sized pieces according to the target ABI (e.g., System V AMD64 ABI rules for struct passing and return).
[T; N]
Array type with element type T and count N. Represents a homogeneous aggregate with N elements
of type T. Array values flow through the IR as single SSA values. Element access uses
extractvalue and insertvalue instructions with index paths.
Example: [int; 10] is an array of 10 integers.
The backend legalize pass expands array values into register-sized pieces according to the target ABI.
Values
A value in tuffy IR is one of five kinds, as defined in Lean (TuffyLean.IR.Value):
| Kind | Description |
|---|---|
int(v) | An infinite precision integer with mathematical value v |
bool(v) | A boolean value (true or false) |
bytes(bs) | A sequence of abstract bytes |
ptr(p) | A pointer with provenance (allocation ID + offset) |
poison | A poison value — the result of undefined behavior |
Abstract Bytes
Each byte in memory is represented as an abstract byte (TuffyLean.IR.AbstractByte):
| State | Description |
|---|---|
bits(val) | A concrete byte value (0–255) |
ptrFragment(allocId, index) | The index-th byte of a pointer to allocation allocId |
uninit | Uninitialized memory (a distinct abstract state, not “random bits”) |
poison | A poisoned byte |
Pointers and Provenance
A pointer consists of an allocation ID and an integer offset
(TuffyLean.IR.Pointer). The allocation ID tracks which allocation the pointer
belongs to, enabling provenance-based alias analysis.
Functions
A function is the top-level unit of compilation. It has a name, typed parameters, an optional return type, and a body organized as a hierarchical CFG.
func @name(param_types...) -> ret_type {
...
}
- Name: prefixed with
@in text format. - Parameters: a list of types. Parameter values are created via
paraminstructions. Parameters may optionally carry source-level names for readability (see below). - Return type: optional. Functions with no return type return
void. - Body: blocks and nested regions directly inside the function braces. The implicit
top-level
functionregion is not written in the text format.
Named Parameters
When source-level parameter names are available (e.g., extracted from MIR debug info),
they are displayed in the function signature and param instructions:
func @add(%a: int, %b: int) -> int {
bb0:
v0 = param %a
v1 = param %b
v2 = add v0, v1
ret v2
}
Parameter names are prefixed with % in text format. When no name is available,
the numeric ABI index is used as a fallback:
func @add(int, int) -> int {
bb0:
v0 = param 0
v1 = param 1
Named parameters are display-only — the internal representation uses numeric ABI
indices (Op::Param(u32)) for code generation.
Regions
The CFG is organized as a tree of SESE (Single Entry, Single Exit) regions. Each region contains an ordered sequence of basic blocks and child regions.
Region Kinds
| Kind | Description |
|---|---|
function | Top-level function region. Every function has exactly one. |
loop | Loop region. The entry block is the loop header. Backedges use continue. |
plain | Generic SESE region for structured control flow. |
Nesting and Implicit Capture
Regions nest to form a tree. Values defined in an outer region are visible in inner regions via implicit capture (VPlan style). There is no explicit capture list — the live-in set is computed on demand.
func @example(int) -> int {
bb0:
v0 = iconst 10 // v0 defined in function body
region loop {
bb1(v1: int):
v2 = add v1, v0 // v0 captured implicitly from outer scope
...
}
}
Loop Regions
A loop region’s entry block is the loop header. The continue terminator transfers
control back to the loop header, passing new values for the header’s block arguments.
Exiting a loop is done by branching to a block outside the loop region.
Basic Blocks
A basic block is a straight-line sequence of instructions with no internal control flow.
Blocks are named bb0, bb1, etc. in text format.
Block Arguments
Tuffy IR uses block arguments instead of PHI nodes to represent values that merge at control flow join points. Block arguments are declared in the block header and receive values from predecessor branches:
bb0:
v0 = iconst 1
br bb1(v0)
bb1(v1: int): // v1 is a block argument
ret v1
Each branch instruction (br, brif, continue) passes values to the target block’s
arguments. The number and types of passed values must match the target block’s argument
list.
Instruction Ordering
Instructions within a block execute sequentially. The last instruction in a block must
be a terminator (ret, br, brif, continue, region_yield).
Instructions
Every instruction produces a value (named vN in text format) and has a result type.
Terminators are not separated from regular instructions — they are simply the last
instruction in a block by convention.
Constants
iconst
vN = iconst <immediate>
Produces an integer constant. The immediate is an arbitrary-precision integer,
matching the Lean spec’s Int type.
Semantics: Value.int(immediate)
param
vN = param <index>
vN = param %name
References a function parameter by index (0-based) or by name. When source-level
parameter names are available, the named form param %name is used in the text
format. The internal representation always uses the numeric ABI index. This is not
a true constant but creates a named value for the parameter within the instruction
stream.
Arithmetic
All integer arithmetic operates on mathematical integers with infinite precision.
There is no overflow. The formal semantics are defined in TuffyLean.IR.Semantics.
add
vN = add vA, vB
Integer addition. Semantics: evalAdd(a, b) = a + b
sub
vN = sub vA, vB
Integer subtraction. Semantics: evalSub(a, b) = a - b
mul
vN = mul vA, vB
Integer multiplication. Semantics: evalMul(a, b) = a * b
div
vN = div vA, vB
Integer division. Produces poison if vB is zero. Signedness is a property of
operand annotations, not the operation itself — in the infinite precision integer
model, signed and unsigned division are mathematically identical.
Semantics: evalDiv(a, b) = if b = 0 then poison else a / b
rem
vN = rem vA, vB
Integer remainder. Produces poison if vB is zero. Signedness is a property of
operand annotations, not the operation itself.
Semantics: evalRem(a, b) = if b = 0 then poison else a % b
and
vN = and vA, vB
Bitwise AND on infinite precision two’s complement integers.
Semantics: evalAnd(a, b) = Int.land a b
or
vN = or vA, vB
Bitwise OR on infinite precision two’s complement integers.
Semantics: evalOr(a, b) = Int.lor a b
xor
vN = xor vA, vB
Bitwise XOR on infinite precision two’s complement integers.
Semantics: evalXor(a, b) = Int.xor a b
shl
vN = shl vA, vB
Left shift. Produces poison if the shift amount vB is negative.
Semantics: evalShl(a, b) = if b < 0 then poison else a <<< b
shr
vN = shr vA, vB
Right shift. Produces poison if the shift amount vB is negative.
Signedness is a property of operand annotations, not the operation.
In infinite precision, logical and arithmetic right shift are identical.
Semantics: evalShr(a, b) = if b < 0 then poison else a >>> b
count_ones
vN = count_ones vA
Population count: returns the number of set bits in the binary representation
of vA. Produces poison if vA is negative.
Semantics: evalCountOnes(a) = if a < 0 then poison else popcount(a)
merge
vN = merge.<width> vA, vB
Replace the low width bits of vA with the low width bits of vB, producing a
single integer. Produces poison if width is 0.
Semantics: evalMerge(a, b, width) = if width = 0 then poison else (a with low width bits cleared) | (b AND (2^width - 1))
split
vHi, vLo = split.<width> vA
Decompose vA at bit position width. Produces two results:
vLo= the lowwidthbits ofvA(zero-extended)vHi=vAright-shifted bywidthbits
Produces poison if width is 0. This is a multi-result instruction.
Semantics: evalSplitHi(a, width) = a >>> width, evalSplitLo(a, width) = a mod 2^width
clmul
vN = clmul vA, vB
Carry-less multiplication (polynomial multiplication over GF(2)). Multiplies vA
and vB using XOR instead of addition for accumulating partial products. Produces
poison if either operand is negative.
Semantics: evalClmul(a, b) = if a < 0 ∨ b < 0 then poison else clmulNat(a, b)
Floating Point Arithmetic
Floating point operations operate on values of float type (bf16, f16, f32, f64).
The formal semantics use Lean 4’s native Float type (IEEE 754 binary64) as a placeholder
model. The formal float model will be refined when full NaN payload and denormal semantics
are decided.
Rewrite flags
Binary floating-point arithmetic instructions (fadd, fsub, fmul, fdiv) may carry
optional rewrite flags that grant the optimizer permission to apply algebraic
transformations. These flags do not change the instruction’s operational semantics — they
only widen the set of legal rewrites.
| Flag | Meaning |
|---|---|
reassoc | Allow associativity / commutativity reordering |
contract | Allow contraction (e.g., a*b+c → fma(a,b,c)) |
Flags appear between the opcode and the operands:
vN = fadd reassoc contract vA, vB
When no flags are present the instruction follows strict IEEE 754 semantics.
Mirrors TuffyLean.IR.FpRewriteFlags.
fadd
vN = fadd [flags] vA, vB
Floating point addition. Semantics: evalFAdd(a, b) = a + b
fsub
vN = fsub [flags] vA, vB
Floating point subtraction. Semantics: evalFSub(a, b) = a - b
fmul
vN = fmul [flags] vA, vB
Floating point multiplication. Semantics: evalFMul(a, b) = a * b
fdiv
vN = fdiv [flags] vA, vB
Floating point division. Semantics: evalFDiv(a, b) = a / b
fneg
vN = fneg vA
Floating point negation. Semantics: evalFNeg(a) = -a
fabs
vN = fabs vA
Floating point absolute value. Semantics: evalFAbs(a) = Float.abs a
copysign
vN = copysign vMag, vSign
Produce a value with the magnitude of vMag and the sign bit of vSign.
Semantics: evalCopySign(mag, sign) = if sign < 0 then -(abs mag) else abs mag
Comparison
icmp
vN = icmp.<pred> vA, vB
Integer comparison. Returns a bool value: true if the comparison holds, false
otherwise.
Predicates:
| Predicate | Description |
|---|---|
eq | Equal |
ne | Not equal |
lt | Less than (signedness from annotation) |
le | Less than or equal (signedness from annotation) |
gt | Greater than (signedness from annotation) |
ge | Greater than or equal (signedness from annotation) |
Semantics: evalICmp(op, a, b) = bool(op(a, b))
select
vN = select vCond, vTrue, vFalse
Conditional select. If vCond is true, produces vTrue; otherwise produces vFalse.
The condition must be of type bool. The result type matches the type of vTrue/vFalse.
Semantics: evalSelect(cond, tv, fv) = if cond then tv else fv
bool_to_int
vN = bool_to_int vA
Convert a bool to an int: true → 1, false → 0.
Semantics: evalBoolToInt(b) = if b then 1 else 0
Value Annotations
Range constraints and bit-level facts are encoded as annotations on value definitions (result-side) and use edges (use-side), rather than as standalone instructions. See RFC: at-use-annotations for the full design.
Annotation types
| Annotation | Meaning |
|---|---|
:s<N> | Value is in signed N-bit range [-2^(N-1), 2^(N-1)-1] |
:u<N> | Value is in unsigned N-bit range [0, 2^N-1] |
:known(<ternary>) | Per-bit four-state constraint |
:nofpclass(<classes>) | Float value must not belong to the listed FP classes |
| (none) | No constraint; unconstrained int |
Annotations are composable: :s32:known(...) applies both constraints simultaneously.
Known bits encoding
Each bit in a known annotation is one of four states:
| Symbol | Meaning |
|---|---|
0 | Bit is known to be 0 |
1 | Bit is known to be 1 |
? | Unknown — bit is demanded but value not determined |
x | Don’t-care — bit is not demanded |
nofpclass
Constrains which IEEE 754 floating-point value classes a float value may belong to.
If the value falls into an excluded class, the result is poison. This is separate
from integer annotations (:s<N>, :u<N>) and mirrors LLVM’s nofpclass attribute.
The 10 individual FP classes (mirroring LLVM’s FPClassTest):
| Class | Description |
|---|---|
snan | Signaling NaN |
qnan | Quiet NaN |
ninf | Negative infinity |
nnorm | Negative normal |
nsub | Negative subnormal |
nzero | Negative zero |
pzero | Positive zero |
psub | Positive subnormal |
pnorm | Positive normal |
pinf | Positive infinity |
Convenience groups:
| Group | Expands to |
|---|---|
nan | snan qnan |
inf | ninf pinf |
zero | nzero pzero |
Syntax:
v0:nofpclass(nan inf) = fadd v1, v2 // result must not be NaN or ±Inf
v1 = fsub v0:nofpclass(nzero), v2 // use-side: v0 must not be -0.0
Mirrors TuffyLean.IR.FpClassMask.
Result-side annotations
Declared on the value produced by an instruction. Violation causes the instruction
to produce poison.
v0:s32 = add v1, v2 // if true result outside [-2^31, 2^31-1], v0 is poison
Use-side annotations
Declared on an operand of a consuming instruction. Violation causes the consuming
instruction to produce poison. May be stronger than the result-side annotation.
v1 = foo v0:u8 // if v0 is outside [0, 255], foo produces poison
Function signature annotations
Function signatures carry annotations as ABI-level contracts:
func @add_i32(int:s32, int:s32) -> int:s32 { ... }
Parameter annotations are caller guarantees; return annotations are callee guarantees.
Memory Operations
All memory operations participate in Memory SSA. Operations that modify memory state
consume a mem token and produce a new one. Plain loads consume a mem token but do
not produce one. See types.md for details on the mem type.
load
vN = load.<size> vPtr, vMem
Load size bytes from the address pointed to by vPtr. Takes a mem token as input
(MemoryUse). Returns a data value only — does not produce a new mem token.
Semantics: evalLoad(mem, addr, size) = bytes [mem[addr], mem[addr+1], ..., mem[addr+size-1]]
store
vN = store.<size> vVal, vPtr, vMem
Store a value to the address pointed to by vPtr. Takes a mem token as input
(MemoryDef) and produces a new mem token as its result.
Semantics: evalStore(mem, addr, bs) = mem[addr..addr+len(bs)] := bs
stack_slot
vN = stack_slot <bytes>
Allocate bytes bytes on the stack. Returns a ptr(0) to the allocated memory.
memcpy
vN = memcpy vDst, vSrc, vCount, align=<N>, vMem
Copy vCount bytes from vSrc to vDst. The source and destination regions must
not overlap (non-overlapping memcpy semantics). align is the minimum alignment hint
for both pointers (must be a power of two). Takes a mem token as input (MemoryDef)
and produces a new mem token.
Semantics: evalMemCopy(mem, dst, src, count) = evalStore(mem, dst, readBytes(mem, src, count))
memmove
vN = memmove vDst, vSrc, vCount, align=<N>, vMem
Copy vCount bytes from vSrc to vDst. Unlike memcpy, overlapping regions are
handled correctly: all source bytes are read before any destination bytes are written.
align is the minimum alignment hint for both pointers. Takes a mem token as input
(MemoryDef) and produces a new mem token.
Semantics: evalMemMove(mem, dst, src, count) = evalStore(mem, dst, readBytes(mem, src, count))
(bytes are captured from mem before the store, making overlap safe)
memset
vN = memset vDst, vVal, vCount, align=<N>, vMem
Fill vCount bytes at vDst with the byte value vVal (low 8 bits). align is
the minimum alignment hint for the destination pointer. Takes a mem token as input
(MemoryDef) and produces a new mem token.
Semantics: evalMemSet(mem, dst, val, count) = evalStore(mem, dst, replicate(count, val & 0xFF))
Atomic Operations
Atomic operations provide thread-safe memory access with explicit memory ordering constraints. The formal semantics in Lean define sequential-only behavior; a formal concurrency model (e.g., based on C11) is TBD.
Enumerations
MemoryOrdering — memory ordering constraints for atomic operations:
| Ordering | Description |
|---|---|
relaxed | No ordering constraints |
acquire | Subsequent reads see writes from the releasing thread |
release | Prior writes are visible to the acquiring thread |
acqrel | Combined acquire + release |
seqcst | Sequentially consistent total order |
AtomicRmwOp — read-modify-write operation kinds:
| Op | Description |
|---|---|
xchg | Exchange (swap) |
add | Integer addition |
sub | Integer subtraction |
and | Bitwise AND |
or | Bitwise OR |
xor | Bitwise XOR |
load.atomic
vM, vN = load.atomic.<ordering> vPtr, vMem
Atomic load from pointer vPtr with the specified memory ordering. Takes a mem
token as input (MemoryDef) and produces two results: a new mem token and the
loaded data value.
store.atomic
vN = store.atomic.<ordering> vVal, vPtr, vMem
Atomic store of vVal to pointer vPtr with the specified memory ordering.
Takes a mem token as input (MemoryDef) and produces a new mem token.
rmw
vM, vN = rmw.<op>.<ordering> vPtr, vVal, vMem
Atomic read-modify-write. Atomically reads the value at vPtr, applies <op>
with vVal, writes the result back. Takes a mem token as input (MemoryDef) and
produces two results: a new mem token and the original value.
cmpxchg
vM, vN = cmpxchg.<success_ord>.<failure_ord> vPtr, vExpected, vDesired, vMem
Atomic compare-and-exchange. Atomically compares the value at vPtr with
vExpected; if equal, writes vDesired. Takes a mem token as input (MemoryDef)
and produces two results: a new mem token and the old value. The caller uses
icmp to determine whether the exchange succeeded.
<success_ord> applies on successful exchange; <failure_ord> applies on failure.
fence
vN = fence.<ordering> vMem
Memory fence. Establishes ordering constraints without accessing memory.
Takes a mem token as input (MemoryDef) and produces a new mem token.
Type Conversion
bitcast
vN = bitcast vA
Reinterpret the bits of vA as a different type. The result type is determined by
the instruction’s type annotation.
sext
vN = sext vA, <bits>
Sign-extend vA to bits bits. Used during lowering to make bit widths explicit
for instruction selection.
zext
vN = zext vA, <bits>
Zero-extend vA to bits bits. Used during lowering to make bit widths explicit
for instruction selection.
Pointer Operations
Pointer operations manipulate pointers with explicit provenance tracking. The formal
semantics are defined in TuffyLean.IR.Semantics.
ptradd
vN = ptradd vPtr, vOffset
Pointer addition. Offsets the pointer vPtr by vOffset bytes. The result preserves
the provenance of the base pointer.
Semantics: evalPtrAdd(base, offset) = ptr { allocId = base.allocId, offset = base.offset + offset }
ptrdiff
vN = ptrdiff vA, vB
Pointer difference. Computes the byte offset between two pointers. Both pointers must
belong to the same allocation; otherwise the result is poison.
Semantics: evalPtrDiff(a, b) = if a.allocId = b.allocId then a.offset - b.offset else poison
ptrtoint
vN = ptrtoint vPtr
Convert a pointer to an integer. The provenance is captured — the resulting integer
retains knowledge that it came from a pointer.
Semantics: evalPtrToInt(p) = p.offset
ptrtoaddr
vN = ptrtoaddr vPtr
Extract the address from a pointer, discarding provenance. Returns a plain integer
with no provenance information.
Semantics: evalPtrToAddr(p) = p.offset
inttoptr
vN = inttoptr vAddr
Create a pointer from an integer address. The resulting pointer has no valid
provenance (wildcard provenance).
Semantics: evalIntToPtr(addr, allocId) = ptr { allocId, offset = addr }
Function Calls
symbol_addr
vN = symbol_addr @name
Load the address of a named symbol (function or static data). The @name refers
to an entry in the module’s symbol table. The result type is ptr.
This instruction makes function calls and static data references self-contained
in the IR — the callee of a call is a value produced by symbol_addr rather
than a side-channel mapping.
Semantics: Produces a pointer to the symbol identified by @name.
call
vM = call vCallee(vArg0, vArg1, ...), vMem
vM, vN = call vCallee(vArg0, vArg1, ...), vMem -> <ret_type>
Call the function pointed to by vCallee with the given arguments. Takes a mem
token as input (MemoryDef). For void calls, produces a single mem token result.
For non-void calls, produces two results: a new mem token and the return value.
Terminators
Terminators are instructions that end a basic block by transferring control flow. By convention, they are the last instruction in a block.
ret
ret vA, vMem
ret vMem
Return from the function. Takes a mem token as the final memory state operand.
Optionally returns a data value. A function with a return type must return a value;
a void function uses ret vMem with only the mem token.
br
br bbN(vA, vB, ...)
br bbN
Unconditional branch to block bbN, passing values as block arguments.
If the target block has no arguments, the argument list is omitted.
brif
brif vCond, bbThen(args...), bbElse(args...)
brif vCond, bbThen, bbElse
Conditional branch. The condition vCond must be of type bool. If true, branches
to bbThen; otherwise branches to bbElse. Each target may receive block arguments.
continue
continue vA, vB, ...
continue
Loop backedge. Transfers control back to the entry block of the enclosing loop
region, passing values as the header block’s arguments. Only valid inside a loop
region.
region_yield
region_yield vA, vB, ...
region_yield
Exit the enclosing region, yielding values to the parent region. Used for structured control flow where a region produces result values.
unreachable
unreachable
Indicates that control flow should never reach this point. If executed, the behavior is undefined (the optimizer may assume this path is dead).
trap
trap
Unconditionally abort execution. Used for runtime checks such as failed assertions
(e.g., division-by-zero guards). Unlike unreachable, reaching a trap is not UB —
it is a well-defined program abort.
Text Format
Tuffy IR uses a Cranelift-inspired text format for human-readable output.
Naming Conventions
| Entity | Format | Example |
|---|---|---|
| Functions | @name | @add, @factorial |
| Values | vN | v0, v1, v2 |
| Blocks | bbN | bb0, bb1 |
| Regions | region <kind> | region loop, region plain |
Values are numbered sequentially (v0, v1, v2, …) in region tree order. Within each block, block arguments are numbered before instruction results.
Grammar
function ::= 'func' '@' name '(' param_list ')' ('->' type annotation)? '{' body '}'
param_list ::= (type annotation (',' type annotation)*)?
body ::= (block | region)*
region ::= 'region' region_kind '{' region_body '}'
region_kind ::= 'loop' | 'plain'
region_body ::= (block | region)*
block ::= block_header instruction*
block_header ::= 'bb' N block_args? ':'
block_args ::= '(' block_arg (',' block_arg)* ')'
block_arg ::= value ':' type
instruction ::= (value annotation '=')? opcode operands
operand ::= value annotation
annotation ::= (':' range_ann)*
range_ann ::= 's' N | 'u' N | 'known' '(' ternary ')'
ternary ::= [01?x_]+
value ::= 'v' N
type ::= 'int' | 'byte' | 'ptr'
Complete Example
A factorial function demonstrating nested regions, block arguments, and control flow:
func @factorial(int:s32) -> int:s32 {
bb0:
v0:s32 = param 0
v1 = iconst 1
v2 = iconst 1
br bb1(v2, v1)
region loop {
bb1(v4: int, v5: int):
v6 = icmp.le v5:s32, v0:s32
brif v6, bb2, bb3
bb2:
v8 = mul v4, v5
v9 = sub v5, v1
continue v8, v9
}
bb3:
ret v4:s32
}
Planned Features
The following features are described in the IR Design RFC but are not yet implemented in the Rust codebase.
Floating Point Semantics
The IR adopts IEEE 754-2008 as the floating-point semantics standard. Basic
operations (fadd, fsub, fmul, fdiv, fneg, fabs, copysign) are
implemented. Per-instruction rewrite flags (reassoc, contract) and float
value class constraints (nofpclass) are defined.
IEEE 754 Lean library
The current Lean model uses Lean 4’s native Float type (IEEE 754 binary64) as a
placeholder. Float is opaque — it maps directly to hardware double and cannot be
unfolded in proofs. Properties that follow from the IEEE 754 standard (e.g.,
“fadd produces -0.0 only when both operands are -0.0”) must be axiomatized
(see IR/FloatAxioms.lean). This is unsatisfactory for a formal-first project.
We need a Lean 4 library that provides a transparent, proof-friendly model of IEEE 754 floating-point arithmetic covering all widths (f16, bf16, f32, f64), rounding modes, NaN payloads, subnormals, and signed zeros. Existing efforts:
- HOLFloat-Lean — formalizes FP semantics in Lean 4, building on prior HOL formalizations.
- Flean — individual effort for IEEE 754 in Lean; requires rewrite.
- Flocq port — the Coq Flocq library is the most mature FP formalization; porting it to Lean 4 is a community proposal.
Until a suitable library is available, tuffy will continue axiomatizing IEEE 754 properties as needed. The axiom set should be kept minimal and well-documented so that it can be discharged once a real library is adopted.
Scalable Vector Types
Vector types parameterized by total bit-width, not element count. Element count is
derived: count = total_bits / element_bits. Scalable vectors use
vscale × base_width, where vscale is a runtime constant determined by hardware
(cf. SVE, RVV). Well-formedness: total_bits % element_bits == 0 and lane count
must be a power of two. See the scalable vector RFC (planned).
Memory SSA
Memory dependencies encoded directly into the IR as memory version tokens on load/store operations, enabling progressive refinement as alias analysis improves.
Atomic Operations
Atomic operations (load.atomic, store.atomic, rmw, cmpxchg, fence) are
implemented with MemoryOrdering and AtomicRmwOp enumerations. See
Atomic Operations. The formal concurrency model is TBD — current
Lean semantics define sequential behavior only.
Control Flow Structuralization
Translation from rustc MIR to tuffy IR with automatic structuralization of loops and branches into SESE regions. Irreducible control flow remains as flat CFG within a region.
Initial Design Discussion
- Created: 2026-02-07
- Status: Frozen — This document is a historical record. Do not modify.
Background
Tuffy is an experimental optimizing compiler written in Rust with LLM assistance. This document records the initial design discussion and key decisions made before implementation begins.
Long-term Vision
Tuffy aims to be a production-grade, language-agnostic optimizing compiler backend:
- Multi-language backend — Serve as a codegen backend for multiple programming languages, not just Rust
- Custom IR — Design and implement a custom intermediate representation with optimization passes
- Multi-platform — Support multiple target architectures from the ground up
- Balanced optimization — Pursue both fast compilation and competitive runtime performance
- Formal verification — Provide formal correctness guarantees for optimization passes
Short-term Milestone
The first milestone is rustc_codegen_tuffy, a codegen backend for rustc as an alternative to LLVM.
Key Design Decisions
- Target platform: Multi-platform architecture from the start, with x86-64 as the initial implementation target
- Optimization direction: Balance between compilation speed and runtime performance (not exclusively pursuing either)
- IR design: Custom IR — build our own intermediate representation, perform optimizations on it, then lower to machine code (rather than translating directly from rustc MIR)
- IR interpreter: Implement a tuffy IR interpreter for testing and validation, similar to Miri
- Backend strategy: Implement
rustc_codegen_ssatraits to reuse existing MIR lowering code (~10,000 lines). Builder generates tuffy IR instead of LLVM IR. Same approach asrustc_codegen_cranelift. May switch to direct MIR traversal later if more control is needed.
IR Design
Structure:
- Single data structure for all stages — no separate SDAG/MIR like LLVM. Different stages have different normalization (legalization) constraints on the same IR, progressively lowering toward machine code.
UB model (following Alive2):
- Only
poison— noundef. Simplifies semantics and avoids the well-known undef/poison confusion in LLVM. - Uninitialized memory is allowed at the memory level, but values are either concrete or poison.
Byte type (b<N>):
- Introduced from the start, based on the LLVM byte type RFC. Represents raw memory data distinct from integers, with per-byte poison tracking. Enables sound memcpy lowering and load merging.
Integer type:
- A single
inttype with infinite precision — no fixed bitwidth (no i8/i32/i64). - No overflow: arithmetic operations are mathematical (e.g., add is true addition).
- Range constraints via assert nodes: e.g.,
assertsext(val, 32)constrains a value to the 32-bit signed range, producing poison if violated. zext/sext/truncinstructions are eliminated from the IR.- Bitwise operations (and/or/xor/shift) are also defined on infinite precision integers.
At-use bit-level analysis (encoded in IR):
- Each use edge carries bit-level annotations: which bits are demanded, and which bits are known to be 0 or 1.
- UB constraints (from assert nodes, memory operations, ABI boundaries) propagate backward through the IR, refining known bits on upstream values.
- This is analogous to LLVM’s
KnownBitsandDemandedBitsanalyses, but promoted to first-class IR constructs rather than side analyses. - Follows the “analysis is also a transformation” principle: analysis results are encoded directly into the IR, not stored in external tables.
- Instruction selection in the final stage derives concrete machine types from these at-use annotations.
Representation form:
- Hierarchical CFG with nested SESE regions. Loops and branches are explicitly represented as regions, not inferred from CFG back-edges.
- Region internals use traditional SSA basic blocks with fixed instruction ordering.
- Translation from rustc MIR requires control flow structuralization to recover nested regions from the flat CFG.
- Enables early loop optimizations (vectorization, LICM, unrolling) by preserving loop structure in the IR.
- Inspired by LLVM VPlan’s hierarchical CFG and recipe-based transformation model.
Memory model:
- Abstract pointer model: pointer = allocation ID + offset, preserving full provenance.
- Two ptr-to-int instructions with distinct semantics:
ptrtoint: converts pointer to integer, capturing provenance.inttoptr(ptrtoint(p))is a safe roundtrip.ptrtoaddr: extracts only the address portion, discarding provenance.
- Four-state byte representation:
Bits(u8) | PtrFragment(Pointer, n) | Uninit | Poison. Each byte in memory is one of these states. - Uninitialized memory (
Uninit) is a distinct abstract state, not “random bits”. Any operation on uninit values is UB (produces poison). - Poison can exist at the byte level in memory, aligned with the per-byte poison tracking of the byte type.
Testing Strategy
Validation and benchmarking will use the following targets (in progressive order):
- Internal unit tests — small test cases for basic codegen correctness (arithmetic, control flow, function calls)
no_stdprograms — minimal programs to validate the backend without standard library dependencies- serde — widely used crate with heavy generics and macros, tests monomorphization and trait dispatch
- rustc test suite — rustc’s ui tests for broad correctness coverage
- ripgrep — real-world CLI project for end-to-end correctness and performance comparison against LLVM backend
- clippy — later-stage validation target due to tight coupling with rustc internals
Notes
- Self-hosting is not a separate goal since tuffy is written in Rust — once the backend can compile Rust, self-hosting follows naturally
- JIT support and outperforming LLVM are not current goals but may be revisited later
Target Abstraction
- Triple-based target identification, following LLVM’s convention.
- TargetLowering design follows LLVM’s approach (legal types, operation lowering, register classes).
- Sub-method override via command-line: individual TargetLowering properties (pointer size, legal integer widths, max vector width, etc.) can be overridden via
--target-override key=valueflags. This enables testing IR-level passes against specific target properties without implementing a full architecture backend. - Cost model: deferred, not yet designed.
IR Memory Layout
- Encapsulation: memory layout is an implementation detail, hidden behind a stable API. The rest of the compiler operates on opaque handles, not raw pointers. Layout can be changed without affecting pass code.
- Arena + index: IR nodes are allocated in arenas. References are
u32indices (opaque handles), not pointers. Reduces reference size and enables serialization. - Contiguous instruction storage: instructions within the same basic block are stored contiguously in the arena. Iterating a BB’s instructions is a linear memory scan, not pointer chasing.
- Tombstone + periodic compaction: deleted instructions are marked as tombstones. Periodic compaction restores contiguity.
- No SOA: struct-of-arrays layout rejected; standard AOS with contiguous storage is sufficient.
Crate Structure
Tuffy is a language-agnostic compiler backend. rustc_codegen_tuffy is one frontend adapter; future language frontends reuse the entire optimization and backend stack.
tuffy/
├── tuffy_ir/ — IR definition: types, instructions, CFG, memory model
├── tuffy_ir_interp/ — IR interpreter (miri-like, UB-aware)
├── tuffy_opt/ — optimization passes (generated from declarative rewrite rules)
├── tuffy_target/ — target abstraction layer (trait + CLI override)
├── tuffy_target_x86/ — x86 backend (i386 + x86-64)
├── tuffy_codegen/ — instruction selection, register allocation, machine code emission
├── tuffy_trace/ — rewrite path tracing
├── tuffy_tv/ — translation validation (Alive2-style SMT verifier)
├── rustc_codegen_tuffy/ — rustc frontend adapter (MIR → tuffy IR)
Dependencies: tuffy_ir is the bottom layer with no internal dependencies. All other crates depend on it directly or transitively.
Object file emission: use the object crate (same as Cranelift and rustc) for writing ELF, Mach-O, COFF object files. No custom object file handling.
References
A curated list of reference documents and resources for the tuffy project.
Compiler Optimization
- Future Directions for Optimizing Compilers — Nuno P. Lopes, John Regehr, 2018. Discusses challenges in making optimizing compilers faster, less buggy, and more capable.
- Cranelift: Using E-Graphs for Verified, Cooperating Middle-End Optimizations — Chris Fallin, Bytecode Alliance. Proposes acyclic e-graphs (aegraphs) as a unified middle-end optimization framework for Cranelift, replacing separate GVN/LICM/constant-folding passes. Key ideas: pure nodes float above a side-effect skeleton (the original CFG), scoped elaboration naturally subsumes GVN and LICM, and ISLE-based declarative rewrite rules enable verifiable optimizations. Directly relevant to tuffy’s declarative rewrite rules and “analysis is also a transformation” design.
- optir — Jamey Sharp. Research prototype combining e-graph equality saturation (via
egg) with RVSDG (Regionalized Value State Dependence Graph). Control flow is encoded structurally viaswitch/loop/funcoperators rather than explicit branches, enabling optimizations across complex control flow including irreducible CFGs. Relevant to tuffy’s hierarchical CFG regions and declarative rewrite approach. - egg: Fast and Extensible Equality Saturation — Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, Pavel Panchekha, POPL 2021. Introduces the
eggRust library for e-graphs and equality saturation with novel techniques: rebuilding for amortized invariant maintenance, and e-class analyses for domain-specific data attached to equivalence classes. Relevant to tuffy’s declarative rewrite rules — equality saturation is a candidate execution model for applying verified rewrite rules without phase-ordering concerns.
Memory Analysis
- MemorySSA — LLVM. Virtual SSA form for memory operations using a single memory partition. Three node types: MemoryDef (stores, calls, fences, atomics), MemoryUse (loads), MemoryPhi (CFG joins). Each MemoryDef creates a new version of “all memory”; precision is recovered via a clobber walker backed by alias analysis rather than upfront partitioning. Key design insight: precise memory partitioning is impractical and not worth the cost — a single chain with on-demand disambiguation is more efficient. Directly informs tuffy’s MemSSA design: single memory token chain encoded in IR, with progressive refinement as alias analysis improves.
IR Design and Semantics
-
RVSDG: An Intermediate Representation for Optimizing Compilers — Nico Reissmann, Jan Christian Meyer, Helge Bahmann, Magnus Själander, 2019. Formalizes the Regionalized Value State Dependence Graph, a data-flow-centric IR where nodes represent computations, edges represent dependencies, and regions capture hierarchical program structure. Demonstrates Dead Node and Common Node Elimination on the representation. Directly relevant to tuffy’s hierarchical CFG with SESE regions and the optir approach built on top of RVSDG.
-
RFC: Add a New Byte Type to LLVM IR — Juneyoung Lee, Pedro Lobo, Nuno Lopes, George Mitenkov. Proposes
b<N>byte type to represent raw memory data, enabling per-byte poison tracking and paving the way for undef removal. -
Leaving the Sea of Nodes — V8 team. Post-mortem on 10 years of Sea of Nodes in TurboFan: effect chain complexity, poor cache locality, limited floating benefits for effectful operations, compilation speed issues. Replaced with CFG-based Turboshaft, halving compile time.
-
A Simple Reply — Cliff Click. Rebuttal to V8’s SoN criticism: argues problems stem from JS’s lack of strong typing, not SoN itself. Strong typing (Java, Rust) provides ECA naturally, simplifying effect management. Worklist algorithms solve visitation order. Dead code elimination is near-free with reference counting.
-
VPlan: Vectorization Plan — LLVM. Hierarchical CFG with nested SESE regions and recipe-based transformations for vectorization. Key ideas: plan-then-execute (analysis doesn’t modify IR), multiple candidates with cost estimation, recipes naturally record transformation provenance. Relevant to tuffy’s “analysis is also a transformation” and rewrite path tracing goals.
Memory Model and Pointer Semantics
- Pointers Are Complicated, or: What’s in a Byte? — Ralf Jung, 2018. Argues pointers are not integers; proposes abstract pointer model (allocation ID + offset) preserving provenance. Defines bytes as
Bits(u8) | PtrFragment(Pointer, n) | Uninit. Directly relevant to tuffy’s byte type and uninitialized memory model. - Pointers Are Complicated II, or: We Need Better Language Specs — Ralf Jung, 2020. Shows three individually-correct LLVM optimizations that produce wrong results when combined, due to unspecified pointer provenance. Concludes: pointers have provenance, integers do not; ptr-to-int-to-ptr roundtrips are not no-ops.
- Pointers Are Complicated III, or: Pointer-Integer Casts Exposed — Ralf Jung, 2022. Proposes “provenance exposure” model: ptr-to-int casts have side effects (exposing provenance), int-to-ptr casts are non-deterministic. Rust’s Strict Provenance API (
ptr.addr(),with_addr()) offers a cleaner alternative. Relevant to tuffy’s pointer representation design. - Uninit — Ralf Jung, 2019. Uninitialized memory is not “random bits” but a distinct abstract state (
None). Any operation on uninit values is UB. Validates tuffy’s decision to allow uninitialized memory at the memory level while keeping values as concrete-or-poison. - Do We Really Need Undefined Behavior? — Ralf Jung, 2021. Argues unrestricted UB is necessary for practical optimization (register allocation, constant folding). Platform-specific UB would make even basic codegen impossible. Supports tuffy’s poison-based UB model as a principled middle ground.
- Clarifying the Semantics of ptrtoint — LLVM discussion. Proposes separating
ptrtoint(full bitwise representation) fromptrtoaddr(address only). Highlights that compiler-level provenance and hardware provenance (CHERI) are distinct. Relevant to tuffy’s pointer representation: need to decide whether ptr-to-int exposes provenance or just address. - RFC: Replacing getelementptr with ptradd — nikic (Nikita Popov). Proposes replacing type-based GEP with offset-based
ptraddinstruction. Eliminates redundant encodings, makes address arithmetic explicit and visible to CSE/LICM, simplifies analysis by removing expensive GEP decomposition. Tuffy adoptsptradddirectly instead of GEP.
ABI and Calling Conventions
- RFC: An ABI Lowering Library for LLVM — nikic (Nikita Popov). Proposes a standalone LLVMABI library with its own type system dedicated to ABI lowering, extracting logic currently duplicated across frontends (Clang, Rust, etc.). Key design: independent type system capturing only ABI-relevant information, per-target calling convention implementations, ABIArgInfo results guiding IR generation. Layered to depend only on LLVM Support, with a C API for non-C++ frontends. Directly relevant to tuffy’s plan for a dedicated ABI library that maps
intparameters/returns to concrete register classes and calling conventions.
Testing and Validation
- RFC: Upstreaming LLVM UB-Aware Interpreter — dtcxzyw. Proposes upstreaming llubi, a UB-aware LLVM IR interpreter that detects undefined behavior during execution and properly propagates poison values. Key motivation: reducing miscompilation reproducers directly on LLVM IR (minutes vs hours with C-level tools). Separate implementation from legacy ExecutionEngine because GenericValue cannot represent poison or pointer provenance. Relevant to tuffy’s IR interpreter design.
- llvm-ub-aware-interpreter (llubi) — dtcxzyw. Implementation of UB-aware LLVM IR interpreter. Detects guardable UBs at execution time, tracks poison propagation, integrates with llvm-reduce for test case minimization and Csmith/Rustlantis for fuzzing. Treats undef as zero (practical simplification). Does not model pointer aliasing or full provenance. Directly informs tuffy’s interpreter design, though tuffy’s poison-only model (no undef) and four-state bytes enable a cleaner implementation.
Rustc Backend Integration
- Lowering MIR — rustc dev guide. MIR lowering flow:
codegen_crate→codegen_mirper function, with modules for blocks, statements, operands, places, rvalues. SSA analysis runs before translation. One MIR basic block generally maps to one backend basic block. - Backend Agnostic Codegen — rustc dev guide. Trait-based backend interface:
CodegenBackend,ExtraBackendMethods,CodegenMethods(CodegenCx),BuilderMethods(Builder). Associated types: Value, BasicBlock, Type, Function. ~10,000 lines of reusable backend-agnostic code inrustc_codegen_ssa. Custom backends implement these traits to plug in.
Feature Name: tuffy_ir
- Status: Draft
- Created: 2026-02-07
- Completed: N/A
Summary
This RFC defines the intermediate representation (IR) for the tuffy compiler. The IR uses a single data structure across all compilation stages with stage-dependent normalization, infinite precision integers with at-use bit-level analysis, a four-state byte memory model, and a hierarchical CFG with nested SESE regions.
Motivation
Existing compiler IRs (LLVM IR, Cranelift CLIF) carry historical design baggage that limits optimization potential and complicates formal verification:
- Fixed-width integers force premature commitment to bit widths, introducing zext/sext/trunc noise that obscures mathematical equivalences.
- undef/poison duality in LLVM creates well-documented semantic confusion and miscompilation bugs. Approximately 18% of detected LLVM miscompilations stem from
undefalone. - Flat CFGs lose structural information (loops, branches) that must be expensively recovered for high-level optimizations.
- Side-table analyses (KnownBits, DemandedBits) are disconnected from the IR, creating phase ordering problems and stale analysis results.
- Multiple IR layers (LLVM IR → SelectionDAG → MachineIR) require costly translations with potential information loss at each boundary.
Tuffy’s IR addresses these issues from first principles, designed for formal verification, efficient compilation, and clean semantics.
Guide-level explanation
Type system
Tuffy IR has a minimal type system:
int— Infinite precision integer. No bit width, no signedness. Arithmetic is mathematical:add(3, 5)is8, period. There is no overflow.b<N>— Byte type representing raw memory data, where N is a multiple of 8. Distinct from integers; preserves pointer fragments and per-byte poison tracking.ptr<AS>— Pointer type carrying provenance (allocation ID + offset) in address space AS. Default address space is 0. Different address spaces may have different pointer sizes and semantics.float/double— IEEE 754 floating point types.vec<vscale x N x T>— Scalable vector type.vscaleis a runtime constant determined by hardware;Nis the minimum number of elements;Tis the element type (int,float,double,b<N>,ptr<AS>). Fixed-width vectors are represented asvscale=1(e.g.,vec<1 x 4 x b32>is a 128-bit fixed vector of 4 x 32-bit elements). When the element type isint, the actual vector width is unknown until instruction selection, sinceinthas no fixed bit width; the element count is determined by the IR, but the physical vector register mapping is deferred.
Infinite precision integers
Instead of i32 or i64, all integer values are simply int. Range constraints are expressed via at-use annotations on value definitions and use edges (see RFC: at-use-annotations):
%raw = load b32, ptr %p // load 32 bits of raw bytes from memory
%x:s32 = bitcast %raw to int // cast to int, result annotated as signed 32-bit
%y:s32 = add %x:s32, 1 // mathematical addition, no overflow; result in s32 range
%out = bitcast %y:s32 to b32 // cast back to bytes for storing
store b32 %out, ptr %q
Loads always operate on byte types (b<N>). The byte type carries the bit width; integers do not. The decision of sign-extension vs zero-extension, and whether to use a GPR or FPR, is deferred entirely to instruction selection based on at-use annotations.
There are no zext, sext, or trunc instructions. Bitwise operations (and, or, xor, shift) are defined on infinite precision integers.
At-use annotations
Range constraints and bit-level analysis results are encoded directly into the IR as annotations on value definitions (result-side) and use edges (use-side). See RFC: at-use-annotations for the full design.
Each annotation carries a four-state per-bit encoding:
0: bit is known to be 01: bit is known to be 1?: bit is demanded but value unknownx: bit is don’t-care (not demanded)
Shorthand annotations :s32, :u8 etc. are sugar over the full known-bits representation.
Result-side annotations are generated by the frontend from source language type information. Use-side annotations may be added by optimization passes when they discover stronger constraints at specific use sites. This follows the “analysis is also a transformation” principle — analysis results are encoded directly into the IR, not stored in side tables.
Hierarchical CFG
The IR uses a hierarchical control flow graph:
func @example(int:s32) -> int:s32 {
region loop {
bb0(i: int):
%cond = cmp.lt %i:s32, %arg:s32
br %cond, bb1, exit
bb1:
%i.next = add %i, 1
br bb0(%i.next)
}
exit:
ret %i:s32
}
Loops and branches are explicitly represented as nested SESE (Single-Entry Single-Exit) regions. Inside each region, instructions are organized in traditional SSA basic blocks with fixed ordering.
Memory model
Pointers are abstract values with provenance:
%p = alloca int // %p has provenance tied to this allocation
%addr = ptrtoaddr %p // extracts address only, discards provenance
%bits = ptrtoint %p // converts to integer, captures provenance
%q = inttoptr %bits // safe roundtrip: recovers provenance
Each byte in memory is one of four states:
- Bits(u8): a concrete byte value (0–255)
- PtrFragment(Pointer, n): the nth byte of an abstract pointer
- Uninit: uninitialized memory (not “random bits”; a distinct abstract state)
- Poison: a poisoned byte
Memory SSA
Memory dependencies are encoded directly into the IR as Memory SSA (MemSSA). Each memory operation (load, store, call) carries a memory version token representing the memory state it depends on or produces.
%mem0 = entry_mem // initial memory state
%mem1 = store b32 %v, ptr %p, %mem0 // store produces new memory version
%raw = load b32, ptr %q, %mem1 // load depends on memory version
MemSSA supports progressive refinement: initially a conservative single memory chain, then split into finer-grained versions as alias analysis improves. This follows the “analysis is also a transformation” principle — memory dependency information lives in the IR, not in side tables.
Atomic operations
Atomic memory operations are regular load/store/rmw/cmpxchg instructions annotated with a memory ordering:
%raw = load.atomic b32, ptr %p, ordering=acquire, %mem0
%mem1 = store.atomic b32 %v, ptr %p, ordering=release, %mem0
%old = rmw.add b32, ptr %p, %delta, ordering=seqcst, %mem0
%res, %ok = cmpxchg b32, ptr %p, %expected, %desired, success=acqrel, failure=acquire, %mem0
Supported orderings: relaxed, acquire, release, acqrel, seqcst. Fence instructions are also supported as standalone operations.
Atomic operations participate in MemSSA like any other memory operation, with ordering constraints restricting reordering during optimization.
UB model
The IR uses only poison for undefined behavior. There is no undef.
- Operations with UB (e.g., division by zero) produce
poison. - Annotation violations produce
poison: result-side violations at the defining instruction, use-side violations at the consuming instruction. - Any operation on
Uninitmemory producespoison. poisonpropagates through computations but is only UB when “observed” (e.g., used as a branch condition, stored to observable memory).
Single data structure, multiple stages
The IR uses one data structure throughout compilation. Different stages impose different normalization constraints:
- High-level: close to MIR translation, high-level operations allowed (e.g., aggregate operations, generic intrinsics).
- Optimized: after optimization passes, IR is normalized but still target-independent.
- Legalized: operations are lowered to target-legal forms (e.g., 64-bit multiply on 32-bit targets becomes a sequence of 32-bit operations).
- Pre-emission: fully scheduled, register-allocated, ready for machine code emission.
No separate SDAG or MachineIR. The same graph structure is progressively constrained.
MIR translation
Translation from rustc MIR to tuffy IR involves:
- Control flow structuralization: attempt to normalize to natural loops and promote to SESE regions. Irreducible control flow is left as flat CFG within a region.
- mem2reg: MIR uses mutable
Placevalues (not pure SSA). A mem2reg pass promotes stack allocations to SSA values where possible. - Drop/unwind: Rust’s drop glue and panic unwind edges are preserved in CFG form, not promoted to SESE regions. Unwind paths remain as explicit CFG edges.
- Aggregate scalarization: structs, enums, arrays, and tuples are fully decomposed into scalar operations (analogous to LLVM’s SROA). No aggregate types exist in tuffy IR.
- Type translation: MIR fixed-width integers (
u8,i32,u64, etc.) are translated tointwith corresponding:u<N>/:s<N>result-side annotations at definition points.
Machine code lowering
Lowering from tuffy IR to machine code:
- ABI: a dedicated ABI library maps
intparameters/returns to concrete register classes and calling conventions based on the target triple. - Stack frame layout: derived from the target triple; alloca sizes are resolved after instruction selection determines concrete bit widths.
- Instruction selection: at-use bit-level annotations determine concrete types; the instruction selector chooses sext/zext and GPR/FPR placement.
- Debug info: DWARF generation follows LLVM’s approach.
- Unwind tables: exception handling table generation follows LLVM’s approach.
Reference-level explanation
Type system details
| Type | Description |
|---|---|
int | Infinite precision integer, no fixed bit width |
b<N> | Byte type, N must be a multiple of 8 |
ptr | Pointer with provenance (allocation ID + offset) |
float | IEEE 754 single precision |
double | IEEE 754 double precision |
vec<vscale x N x T> | Scalable vector, vscale is runtime constant, N is min element count, T is element type |
Integer operations
All integer arithmetic operates on mathematical integers:
| Operation | Semantics |
|---|---|
add %a, %b | Mathematical addition |
sub %a, %b | Mathematical subtraction |
mul %a, %b | Mathematical multiplication |
div %a, %b | Integer division (poison if %b == 0) |
rem %a, %b | Integer remainder (poison if %b == 0) |
and %a, %b | Bitwise AND on infinite precision |
or %a, %b | Bitwise OR on infinite precision |
xor %a, %b | Bitwise XOR on infinite precision |
shl %a, %b | Left shift |
shr %a, %b | Right shift (signedness from annotation) |
Value annotations
Annotations on value definitions and use edges:
| Annotation | Semantics |
|---|---|
:s<N> | Poison if value outside signed N-bit range [-2^(N-1), 2^(N-1)-1] |
:u<N> | Poison if value outside unsigned N-bit range [0, 2^N-1] |
:known(<ternary>) | Per-bit four-state constraint (0, 1, ?, x) |
Result-side violation → instruction produces poison. Use-side violation → consuming instruction produces poison.
Pointer operations
| Operation | Semantics |
|---|---|
ptrtoint %p | Convert pointer to integer, capturing provenance |
ptrtoaddr %p | Extract address only, discard provenance |
inttoptr %i | Convert integer to pointer, recovering captured provenance |
ptradd %p, %offset | Pointer arithmetic within an allocation (provenance-preserving offset addition) |
ptrdiff %p, %q | Pointer difference; returns %p.offset - %q.offset as int; poison if %p and %q do not belong to the same allocation |
Byte type operations
| Operation | Semantics |
|---|---|
bitcast %v to b<N> | Type-punning conversion between byte type and other types |
load b<N>, ptr %p | Load raw bytes from memory |
store b<N> %v, ptr %p | Store raw bytes to memory |
Vector operations
All vector operations work on vec<vscale x N x T> types. A mask operand (vec<vscale x N x int>) is optional on most operations to support predicated execution (SVE/RVV style).
Elementwise arithmetic:
| Operation | Semantics |
|---|---|
vec.add %a, %b | Elementwise addition |
vec.sub %a, %b | Elementwise subtraction |
vec.mul %a, %b | Elementwise multiplication |
vec.div %a, %b | Elementwise division |
vec.rem %a, %b | Elementwise remainder |
vec.and %a, %b | Elementwise bitwise AND |
vec.or %a, %b | Elementwise bitwise OR |
vec.xor %a, %b | Elementwise bitwise XOR |
vec.shl %a, %b | Elementwise left shift |
vec.shr %a, %b | Elementwise right shift (signedness from annotation) |
vec.fadd %a, %b | Elementwise floating point addition |
vec.fmul %a, %b | Elementwise floating point multiplication |
Horizontal reductions:
| Operation | Semantics |
|---|---|
vec.reduce.add %v | Sum all elements |
vec.reduce.mul %v | Multiply all elements |
vec.reduce.smin %v | Signed minimum across elements |
vec.reduce.smax %v | Signed maximum across elements |
vec.reduce.umin %v | Unsigned minimum across elements |
vec.reduce.umax %v | Unsigned maximum across elements |
vec.reduce.and %v | Bitwise AND across elements |
vec.reduce.or %v | Bitwise OR across elements |
vec.reduce.xor %v | Bitwise XOR across elements |
vec.reduce.fadd %v | Floating point sum (ordered or unordered via flag) |
vec.reduce.fmin %v | Floating point minimum across elements |
vec.reduce.fmax %v | Floating point maximum across elements |
Scatter/gather (indexed vector memory access):
| Operation | Semantics |
|---|---|
vec.gather %base, %indices, %mask, %mem | Load elements from base + indices[i] for each lane where mask is true; produces vector + memory token |
vec.scatter %v, %base, %indices, %mask, %mem | Store elements to base + indices[i] for each lane where mask is true; produces memory token |
Masking/predication:
| Operation | Semantics |
|---|---|
vec.select %mask, %a, %b | Per-lane select: mask[i] ? a[i] : b[i] |
vec.compress %v, %mask | Pack active lanes (mask=true) to front |
vec.expand %v, %mask | Scatter contiguous elements to active lanes |
Lane manipulation:
| Operation | Semantics |
|---|---|
vec.splat %scalar | Broadcast scalar to all lanes |
vec.extract %v, %idx | Extract scalar from lane |
vec.insert %v, %scalar, %idx | Insert scalar into lane |
vec.shuffle %a, %b, %indices | Permute lanes from two vectors |
Contiguous vector memory access:
| Operation | Semantics |
|---|---|
vec.load %ptr, %mask, %mem | Masked contiguous load |
vec.store %v, %ptr, %mask, %mem | Masked contiguous store |
Platform intrinsics:
Platform-specific intrinsics (e.g., std::arch::x86_64::*) support two modes, selectable via compiler flag or function-level attribute:
- Map mode (default): recognized intrinsics are mapped to tuffy vector operations, enabling further optimization by the compiler.
- Passthrough mode: intrinsics are preserved as opaque target-specific calls (
target.intrinsic "x86.avx2.add.ps" %a, %b), emitted as-is during instruction selection. Suitable for hand-tuned SIMD code where the user wants exact instruction control.
Unrecognized intrinsics always use passthrough mode.
Memory byte states
Each byte in the abstract memory is represented as:
enum AbstractByte {
Bits(u8), // concrete value 0-255
PtrFragment(AllocId, usize), // nth byte of a pointer
Uninit, // uninitialized
Poison, // poisoned
}
Hierarchical CFG structure
enum CfgNode {
BasicBlock(Vec<Instruction>, Terminator),
Region(RegionKind, Vec<CfgNode>),
}
enum RegionKind {
Loop { header: BasicBlockId },
Branch { condition: Value, then_region: RegionId, else_region: RegionId },
Sequence,
}
Regions are always SESE. Translation from rustc MIR requires control flow structuralization.
At-use annotation representation
Each use edge and each instruction result carries an optional annotation:
struct Annotation {
known_ones: BigUint, // bits known to be 1
known_zeros: BigUint, // bits known to be 0
demanded: BigUint, // bits that are demanded
}
Invariant: known_ones & known_zeros == 0.
Shorthand :s<N> and :u<N> are sugar over this representation. No annotation means unconstrained int; the verifier reports an error if instruction selection cannot determine a concrete machine type.
Drawbacks
- Infinite precision integers add complexity to instruction selection. The final stage must derive concrete bit widths from at-use annotations, which is a non-trivial analysis.
- Hierarchical CFG requires control flow structuralization from MIR, which may introduce extra blocks for irreducible control flow.
- At-use annotations on every edge increase memory footprint of the IR.
- Single data structure means late-stage IR carries fields irrelevant to early stages and vice versa, potentially wasting memory.
- Novel design means less existing tooling and literature to draw from compared to traditional SSA CFG.
Rationale and alternatives
Why infinite precision integers?
Two equally important motivations:
-
Formal verification: Infinite precision integers are mathematical integers — exactly what theorem provers (Lean 4, Mathlib) natively support. Rewrite rule proofs operate on standard integer arithmetic with induction, algebraic properties, and equivalences available out of the box. Fixed-width integers would require every proof to reason about overflow, wrapping semantics, and sign extension, dramatically increasing proof complexity.
-
Pattern recognition: Source languages (especially C) introduce redundant integer conversions due to implicit promotion rules (e.g., C’s integer promotion widens
charandshorttoint). In LLVM IR, these become chains of zext/sext/trunc that obscure the underlying mathematical equivalence. Pattern matching in optimization passes fails to see through the conversion noise, causing missed optimizations. Infinite precision integers eliminate this noise entirely — the mathematical structure is directly visible in the IR.
Together, these make both the proof side and the optimization side simpler and more effective.
Alternative: arbitrary-width integers (LLVM-style). Rejected because they still require zext/sext/trunc and overflow reasoning in every pass and every proof.
Alternative: fixed set of widths (Cranelift-style). Simpler but still carries the same fundamental problems.
Why hierarchical CFG?
Flat CFGs lose loop structure, requiring expensive loop detection passes. Since tuffy aims to perform loop optimizations early, preserving this structure in the IR is essential.
Alternative: flat SSA CFG. Rejected because loop detection is expensive and error-prone.
Alternative: Sea of Nodes. Considered seriously. Advantages for strongly-typed languages (Cliff Click’s arguments), but engineering complexity (scheduling, debugging, cache locality) outweighs benefits. V8’s experience is cautionary even if their problems were partly JS-specific.
Alternative: E-graph. Attractive for formal verification but immature for handling side effects and control flow at scale.
Why four-state bytes?
The Uninit state (separate from Poison) correctly models uninitialized memory as an abstract concept rather than “random bits.” The PtrFragment state preserves pointer provenance through byte-level memory operations like memcpy. Poison at the byte level enables per-byte poison tracking aligned with the byte type design.
Why ptrtoint captures provenance?
This enables safe inttoptr(ptrtoint(p)) roundtrips while ptrtoaddr provides a provenance-stripping alternative. This avoids the LLVM problem where three individually-correct optimizations produce wrong results when combined (Ralf Jung, 2020).
Prior art
- LLVM IR: SSA CFG with fixed-width integers, undef/poison duality, flat CFG. Tuffy’s IR addresses known LLVM pain points (undef removal, byte type, phase ordering).
- Cranelift CLIF: Single IR with stage-dependent legalization. Tuffy adopts this “single data structure” approach but adds hierarchical CFG and infinite precision integers.
- LLVM VPlan: Hierarchical CFG with SESE regions for vectorization. Inspires tuffy’s region-based structure.
- V8 TurboFan / Turboshaft: Sea of Nodes abandoned in favor of CFG after 10 years. Informs tuffy’s decision to use SSA basic blocks inside regions.
- Cliff Click’s Sea of Nodes: Strong typing (Java, Rust) mitigates many of V8’s problems. Tuffy acknowledges this but prioritizes engineering simplicity.
- Alive2: Poison-only UB model, byte type proposal. Directly informs tuffy’s UB and type design.
- Miri: Abstract pointer model (allocation ID + offset), abstract byte representation. Informs tuffy’s memory model.
Unresolved questions
- Floating point semantics: How should NaN payloads and denormals be handled? Are there opportunities for infinite-precision-like simplification?
- SIMD/vector types: How do vector operations fit into the type system and hierarchical CFG?
- Aggregate types: Should structs/arrays exist in the IR or be fully decomposed into scalar operations?
- Exception handling: How do Rust panics and unwinding interact with SESE regions?
- Concurrency model: How are atomic operations and memory ordering represented?
- Stack slot lifetime and coloring:
stack_slotcurrently takes only a size parameter. To support stack coloring (overlapping non-live slots), lifetime information is needed. The preferred direction is region-based: stack slots belong to a SESE region and are automatically dead when control exits the region. Open question: after lowering to a flat CFG, how is region-based lifetime information preserved? Options include (a) materializing lifetime markers at region boundaries during lowering, or (b) maintaining a region tree alongside the flat CFG. The lifetime.start/end intrinsic approach (LLVM-style) is not preferred.
Future possibilities
- E-graph integration: Equality saturation could be applied as an optimization strategy on subgraphs within the hierarchical CFG, without replacing the core IR representation.
- Superoptimizer: The formal-first design and rewrite path tracing make it natural to integrate solver-based optimization discovery (Souper-style).
- Multi-language frontend: The IR is designed to be language-agnostic. Future frontends for other languages can target the same IR.
- JIT compilation: The single data structure with stage-dependent normalization could support a JIT mode that skips late-stage optimizations for faster compilation.
- Translation validation: Rewrite path traces provide the foundation for end-to-end translation validation of the compilation pipeline.
Feature Name: tuffy_pipeline
- Status: Draft
- Created: 2026-02-07
- Completed: N/A
Summary
This RFC defines the compilation pipeline design for the tuffy compiler, focusing on four core aspects: value-level analysis encoded as IR transformations, automatic debug info preservation via origin chains, automatic profiling info preservation with merging semantics, and a hybrid pass ordering strategy with fine-grained user control.
Motivation
Compiler pipelines face recurring engineering problems that erode correctness and developer productivity:
- Phase ordering: In LLVM, analyses like KnownBits and DemandedBits live in side tables disconnected from the IR. When a transform modifies the IR without updating these tables, analysis results become stale. Subsequent passes either use stale data (miscompilation) or must conservatively discard and recompute (lost optimization opportunities). This is the root cause of many phase ordering bugs.
- Debug info loss: LLVM requires each optimization pass to manually preserve debug info. In practice, many passes silently drop it. The result is poor debugging experience at higher optimization levels — a well-known pain point for all LLVM-based compilers.
- Profiling info loss: Profile-guided optimization data (branch weights, execution counts) suffers the same manual-maintenance problem. Transforms that restructure control flow or merge/split blocks often discard profile annotations.
Tuffy’s pipeline addresses these by making preservation automatic at the infrastructure level, rather than relying on per-pass discipline.
Guide-level explanation
Pass classification
Tuffy distinguishes two categories of analysis:
Structural analyses (side table, on-demand):
- Dominator tree
- Loop tree
- Post-dominator tree
These are cheaply recomputable from the CFG and do not carry semantic information that could be irreversibly lost by transforms. They live in traditional side tables and are invalidated/recomputed as needed.
Value-level analyses (encoded in IR, transform passes):
- KnownBits — which bits of a value are provably 0 or 1
- DemandedBits — which bits of a value are actually needed at each use site
- MemSSA refinement — splitting memory version chains as alias analysis improves
These carry semantic facts about values that, if lost, cannot be recovered without re-deriving from first principles. In tuffy, these are transform passes — they read the IR, compute results, and write those results back into the IR (on use-edge annotations or memory tokens). They are scheduled in the pipeline alongside optimization passes, not invoked as side analyses.
This follows the “analysis is also a transformation” principle: there is no distinction between “analysis pass” and “optimization pass” for value-level information. Both modify the IR.
Origin chain
Every instruction in the IR carries an origin — a reference to the instruction(s) it was derived from. When a transform creates a new instruction, the builder API requires specifying an origin. This is a mandatory parameter, not optional metadata.
// Builder API enforces origin
let new_add = builder.build_add(lhs, rhs, origin: old_mul.origin());
The origin chain serves two purposes:
-
Debug info derivation: Debug information (source locations, variable mappings) is not stored on every instruction. Instead, it is computed from the origin chain and current IR values. As long as the chain is unbroken, debug info can be derived for any instruction by tracing back to its origin.
-
Profiling info association: Profiling data (execution counts, branch weights) is associated with instructions via origin. When transforms merge or split instructions, profile data is merged accordingly — this is valid because profile data ultimately serves code generation decisions, not source-level tracing.
Origin in practice
Most transforms have a natural origin for new instructions:
| Transform | New instruction | Origin source |
|---|---|---|
| Strength reduction | shl %x, 1 replacing mul %x, 2 | The original mul |
| Loop canonicalization | Preheader branch | Latch branch |
| Instruction combining | add %a, %c from add(add(%a, %b), %c) | Both original add instructions |
| Dead code elimination | (no new instructions) | N/A |
| Loop unrolling | Duplicated loop body | Original loop body instructions |
When multiple origins contribute, the origin is a set. The debug info derivation picks the most specific source location from the set.
Pipeline structure
A tuffy compilation pipeline is a sequence of passes:
MIR translation
→ [value-level analysis: initial KnownBits/DemandedBits]
→ [optimization passes interleaved with value-level analysis updates]
→ legalization
→ instruction selection
→ register allocation
→ machine code emission
Value-level analysis passes can be scheduled at any point. Since they are transforms (they modify the IR), the pass manager treats them identically to optimization passes.
Pass ordering
Tuffy uses a hybrid pass ordering strategy: a fixed high-level framework with local iteration, combined with fine-grained user control to prevent performance regressions.
High-level framework: The pipeline has a fixed sequence of major phases (MIR translation → optimization → legalization → instruction selection → register allocation → emission). Within the optimization phase, passes are grouped into stages (e.g., early simplification, loop optimization, late cleanup), each stage iterating locally until the IR stabilizes or an iteration limit is reached.
User control granularity:
| Level | Mechanism | Example |
|---|---|---|
| Version lock | Pin entire pipeline to a versioned strategy | --opt-policy=v1.2.3 |
| Preset profile | Select a predefined pipeline | -O2, -Os, -Ofast |
| Pass-level | Enable/disable individual passes | --disable-pass=loop-unroll |
| Iteration control | Cap local iteration counts | --max-instcombine-iterations=3 |
| Function-level | Per-function attribute overrides | #[tuffy::opt_policy("v1.2.3")] |
Function-level control enables mixed strategies within a single compilation unit:
#[tuffy::opt_policy("v1.2.3")] // lock to specific version
fn stable_hot_path() { ... }
#[tuffy::opt_level("aggressive")] // aggressive optimization
fn compute_kernel() { ... }
#[tuffy::opt_level("fast-compile")] // minimal passes, fast compilation
fn cold_error_handler() { ... }
Dirty bit mechanism
Value-level analyses (KnownBits, DemandedBits, MemSSA) are scheduled on demand using a dirty bit protocol:
- When an optimization pass modifies an instruction, the infrastructure automatically marks affected use edges as dirty.
- When an optimization pass performs a lossy transform (refinement, not equivalence), downstream annotations are also marked dirty — the existing annotations may be overly precise and no longer sound.
- A subsequent pass that reads a dirty annotation triggers lazy recomputation on that annotation only. Clean annotations are never recomputed.
This is incremental lazy evaluation: mark dirty on write, recompute on read.
Transform equivalence declaration
Each transform pass declares whether it performs equivalence-preserving or lossy (refinement) transformations:
- Equivalence: the transform preserves all semantic properties (e.g., strength reduction
mul %x, 2→shl %x, 1). Existing value-level annotations remain sound; no dirty marking needed beyond modified instructions. - Lossy refinement: the transform narrows behavior (e.g., range narrowing, speculative execution). Downstream annotations may be invalidated and must be marked dirty.
This declaration is manual, but the formal verification / proof framework can verify its correctness.
Other pipeline aspects
The following aspects follow LLVM’s established approach and are not novel to tuffy:
- Legalization: progressive lowering of target-illegal operations to target-legal forms, following LLVM’s legalization model.
- Pass manager: pass registration, scheduling, and region-aware execution follow LLVM’s new pass manager design.
- Interprocedural optimization: inlining, cross-function analysis, and call graph management follow LLVM’s approach.
- Verification passes: IR invariant checks inserted between passes in debug builds, following LLVM’s verifier model.
Reference-level explanation
Origin representation
struct Origin {
sources: SmallVec<[InstructionId; 1]>,
}
Most instructions have a single origin (1-element SmallVec). Merged instructions carry multiple origins. The SmallVec<[_; 1]> avoids heap allocation in the common case.
Debug info derivation
Debug info is stored in a separate table indexed by InstructionId:
struct DebugInfoTable {
/// Maps original (pre-optimization) instruction IDs to source locations
source_locations: HashMap<InstructionId, SourceLocation>,
/// Maps original instruction IDs to variable bindings
variable_bindings: HashMap<InstructionId, Vec<VariableBinding>>,
}
To query the source location of a current instruction:
- Follow its origin chain to find original instruction IDs
- Look up those IDs in the debug info table
- If multiple origins, pick the most specific (innermost) source location
This design means optimization passes never touch debug info directly. The table is write-once (populated during MIR translation) and read-only thereafter.
Profiling info
Profile data is stored similarly:
struct ProfileData {
/// Execution counts per original basic block
block_counts: HashMap<BasicBlockId, u64>,
/// Branch weights per original terminator
branch_weights: HashMap<InstructionId, Vec<(BasicBlockId, u32)>>,
}
When a transform merges blocks or restructures control flow, the profile data for new blocks is computed by summing or averaging the profile data of their origins. This is done automatically by the infrastructure when blocks are created via the builder API.
Builder API enforcement
The instruction builder requires origin at construction:
impl Builder {
fn build_add(&mut self, lhs: Value, rhs: Value, origin: Origin) -> Value;
fn build_load(&mut self, ty: ByteType, ptr: Value, mem: MemToken, origin: Origin) -> Value;
// ... all build methods require origin
}
There is no way to create an instruction without specifying an origin. This is a compile-time guarantee, not a runtime check.
Value-level analysis as transform
A KnownBits analysis pass operates as:
fn run_known_bits_analysis(func: &mut Function) {
for inst in func.instructions_reverse_postorder() {
let known = compute_known_bits(inst, func);
for use_edge in func.uses_of(inst) {
use_edge.update_known_bits(known); // modifies IR in-place
}
}
}
This is registered in the pass manager as a regular transform pass. The pass manager does not distinguish it from an optimization pass.
Dirty bit protocol
Each use-edge annotation carries a dirty flag:
struct UseAnnotation {
demanded_bits: BitSet,
known_zeros: BitSet,
known_ones: BitSet,
dirty: bool,
}
The infrastructure manages dirty flags automatically:
impl Function {
/// Called by infrastructure when an instruction is modified
fn mark_uses_dirty(&mut self, inst: InstructionId) {
for use_edge in self.uses_of(inst) {
use_edge.annotation.dirty = true;
}
}
/// Called by infrastructure for lossy transforms
fn mark_downstream_dirty(&mut self, inst: InstructionId) {
// Transitively marks all downstream use edges dirty
let mut worklist = vec![inst];
while let Some(id) = worklist.pop() {
for user in self.users_of(id) {
for use_edge in self.uses_of(user) {
if !use_edge.annotation.dirty {
use_edge.annotation.dirty = true;
worklist.push(user);
}
}
}
}
}
}
Transform equivalence declaration
Each pass declares its transform category via a trait:
enum TransformKind {
Equivalence,
Refinement,
}
trait Pass {
fn name(&self) -> &str;
fn transform_kind(&self) -> TransformKind;
fn run(&self, func: &mut Function);
}
The pass manager uses transform_kind() to determine dirty propagation scope after each pass runs. The proof framework can verify that a pass declared as Equivalence does not perform lossy transforms.
Pass ordering configuration
Pipeline configuration is a serializable data structure:
struct PipelineConfig {
version: String,
stages: Vec<Stage>,
}
struct Stage {
name: String,
passes: Vec<PassConfig>,
max_iterations: u32,
}
struct PassConfig {
name: String,
enabled: bool,
params: HashMap<String, String>,
}
Users can export, modify, and lock pipeline configurations. Function-level attributes override the global config for specific functions.
Drawbacks
- Origin overhead: Every instruction carries an origin reference. For single-origin instructions this is one
InstructionId(typically 4 bytes), but merged instructions allocate. In practice, most instructions have single origins, so the overhead is modest. - Debug info derivation cost: Computing debug info on demand (rather than storing it per-instruction) adds cost at debug info emission time. This is acceptable because debug info emission happens once at the end of compilation.
- Profile merging heuristics: Automatically merging profile data during transforms may produce inaccurate estimates. However, this is no worse than the status quo (where profile data is often silently dropped), and the merged data is still useful for code generation heuristics.
- Dirty bit propagation cost: Lossy transforms trigger transitive downstream dirty marking, which in the worst case touches a large portion of the IR. In practice, most transforms are equivalence-preserving, limiting the frequency of transitive propagation.
- Transform declaration burden: Pass authors must correctly declare equivalence vs refinement. Incorrect declarations compromise soundness (under-marking) or efficiency (over-marking). The proof framework mitigates this but adds verification overhead.
Rationale and alternatives
Why encode value-level analysis in IR?
The alternative is LLVM’s approach: side-table analyses invalidated by transforms. This creates phase ordering problems — the order in which analyses and transforms run affects the final result. By encoding analysis results in the IR, every pass sees the most recent information, and transforms that don’t affect a particular annotation leave it intact.
Why origin chains instead of per-instruction debug info?
Alternative: LLVM’s DebugLoc approach. Each instruction carries a DebugLoc that passes must manually preserve. Rejected because it relies on per-pass discipline and fails in practice — many LLVM passes silently drop debug locations.
Alternative: Debug info as IR instructions (LLVM’s dbg.value). Intrinsic calls represent debug info in the IR. Rejected because these pseudo-instructions complicate every transform (must be moved, cloned, or deleted alongside real instructions) and are still frequently mishandled.
Origin chains move the burden from pass authors to the infrastructure. Pass authors specify “where did this instruction come from” (which they always know), and the infrastructure handles debug/profile info automatically.
Why allow profile merging?
Profile data is inherently approximate (sampled, potentially stale). Merging during transforms preserves the relative hotness information that matters for code generation, even if absolute counts shift. The alternative — dropping profile data when transforms can’t preserve it exactly — is strictly worse.
Prior art
- LLVM DebugLoc: Per-instruction debug location, manually maintained by passes. Known to be fragile; significant ongoing effort to improve preservation.
- LLVM dbg.value/dbg.declare: Debug info as intrinsic calls in IR. Adds complexity to every transform.
- GCC tree-ssa: Similar side-table analysis approach with phase ordering issues.
- MLIR: Operation-level location tracking with fusion semantics for merged operations. Closer to tuffy’s origin chain concept.
Unresolved questions
- Origin chain compression: Long optimization pipelines may create deep origin chains. Should origins be transitively compressed (point directly to the original MIR instruction) or kept as a full chain?
- Profile data merging policy: What specific merging rules apply for different transform types (block merging, loop unrolling, function inlining)?
- Dirty bit granularity: Should dirty marking be per-use-edge (current design) or per-instruction? Per-instruction is coarser but cheaper to maintain.
- Pipeline config format: What serialization format for pipeline configurations? TOML, JSON, or a custom DSL?
Future possibilities
- Differential debugging: Origin chains enable automatic bisection — given a miscompiled instruction, trace its origin chain to identify which transform introduced the bug.
- Optimization reports: Origin chains combined with rewrite path traces provide detailed optimization reports showing what happened to each source-level operation.
- Selective optimization: Profile data preservation enables fine-grained PGO where hot and cold paths receive different optimization strategies throughout the pipeline.
Feature Name: tuffy_interpreter
- Status: Draft
- Created: 2026-02-07
- Completed: N/A
Summary
This RFC defines the design of the tuffy IR interpreter, a miri-like execution engine that directly interprets tuffy IR with full UB detection, poison tracking, pointer provenance enforcement, and four-state byte memory semantics. The interpreter serves as the primary testing and validation tool for transform correctness.
Motivation
Compiler testing traditionally relies on end-to-end execution: compile a program, run it, check the output. This approach has two fundamental limitations:
- Coarse granularity: end-to-end tests only detect bugs that manifest as observable output differences. Silent UB, provenance violations, and poison mishandling may not produce wrong outputs on any particular test input.
- Slow reduction: when a miscompilation is found, reducing the reproducer from a full program to a minimal IR fragment requires expensive tools (creduce, llvm-reduce) operating at the source or LLVM IR level.
The LLVM UB-aware interpreter (llubi) demonstrated that an IR-level interpreter with UB detection can reduce miscompilation reproducers from hours to minutes. Tuffy’s IR design — poison-only UB model, four-state bytes, abstract pointer provenance — is inherently more amenable to precise interpretation than LLVM IR.
Guide-level explanation
Core capabilities
The tuffy interpreter executes IR functions directly, maintaining a concrete machine state:
- Infinite precision integers: implemented via arbitrary-precision integer library (e.g.,
num-bigint). No overflow, no wrapping — arithmetic is mathematical. - Four-state byte memory: each byte in the interpreter’s memory is one of
Bits(u8),PtrFragment(AllocId, usize),Uninit, orPoison. - Pointer provenance tracking: every pointer carries an
(AllocId, offset)pair. Out-of-bounds access, use-after-free, and provenance violations are detected. - Poison propagation: poison values propagate through computations. Using poison as a branch condition, storing poison to observable memory, or violating an assert node is flagged as UB.
- Assert node enforcement:
assertsext(v, N)andassertzext(v, N)are checked at runtime; violations produce poison (and are reported as UB).
Execution modes
The interpreter supports multiple execution modes:
| Mode | Purpose |
|---|---|
| Normal | Execute and produce output, report UB as warnings |
| Strict | Abort on first UB detection |
| Differential | Execute two IR versions (pre/post optimization), compare results |
| Reduce | Deterministic execution for use with IR reducer tools |
Differential testing
The primary use case for transform validation:
tuffy-interp --differential --before=pre_opt.tir --after=post_opt.tir --input=test_case
The interpreter runs both IR versions on the same input and reports any behavioral divergence. Combined with a fuzzer, this provides high-confidence testing of individual transforms.
UB detection
The interpreter detects and reports:
| UB category | Detection mechanism |
|---|---|
| Assert node violation | Runtime range check on assertsext/assertzext |
| Uninit read | Memory byte is Uninit at load time |
| Poison observation | Poison used as branch condition or stored to observable memory |
| Out-of-bounds access | Pointer offset exceeds allocation bounds |
| Use-after-free | Access to freed allocation |
| Provenance violation | Pointer used with wrong AllocId |
| Division by zero | Divisor is zero |
| Null dereference | Pointer with null address |
Integration with fuzzing
The interpreter integrates with fuzzing tools:
- Rustlantis: generates random Rust programs, compiles with tuffy, runs through interpreter
- IR-level fuzzer: generates random tuffy IR fragments, runs through optimization pipeline, differential-tests with interpreter
- Reducer: when a bug is found, an IR reducer (similar to llvm-reduce) minimizes the reproducer using the interpreter as the oracle
Reference-level explanation
Value representation
enum Value {
Int(BigInt),
Bytes(Vec<AbstractByte>),
Ptr(Pointer),
Float(f32),
Double(f64),
Vec(Vec<Value>),
Poison,
}
enum AbstractByte {
Bits(u8),
PtrFragment(AllocId, usize),
Uninit,
Poison,
}
struct Pointer {
alloc_id: AllocId,
offset: usize,
address_space: u32,
}
Memory model
struct Memory {
allocations: HashMap<AllocId, Allocation>,
next_alloc_id: AllocId,
}
struct Allocation {
bytes: Vec<AbstractByte>,
size: usize,
alive: bool,
}
impl Memory {
fn alloc(&mut self, size: usize) -> Pointer;
fn free(&mut self, ptr: Pointer) -> Result<(), UBError>;
fn load(&self, ptr: Pointer, size: usize) -> Result<Vec<AbstractByte>, UBError>;
fn store(&mut self, ptr: Pointer, bytes: Vec<AbstractByte>) -> Result<(), UBError>;
}
Every memory operation checks:
ptr.alloc_idrefers to a live allocationptr.offset + size <= allocation.size- Provenance matches (the pointer was derived from this allocation)
Instruction execution
Each instruction maps directly to a Rust function:
fn exec_add(a: &Value, b: &Value) -> Value {
match (a, b) {
(Value::Int(x), Value::Int(y)) => Value::Int(x + y),
(Value::Poison, _) | (_, Value::Poison) => Value::Poison,
_ => panic!("type error"),
}
}
fn exec_assertsext(v: &Value, n: u32) -> Value {
match v {
Value::Int(x) => {
let min = -(BigInt::one() << (n - 1));
let max = (BigInt::one() << (n - 1)) - 1;
if *x >= min && *x <= max {
Value::Int(x.clone())
} else {
Value::Poison // UB: report to user
}
}
Value::Poison => Value::Poison,
_ => panic!("type error"),
}
}
Hierarchical CFG execution
The interpreter walks the hierarchical CFG:
- Region.Loop: execute the loop body repeatedly until the exit branch is taken
- Region.Branch: evaluate the condition, execute the corresponding sub-region
- Region.Sequence: execute sub-regions in order
- BasicBlock: execute instructions sequentially, then the terminator
MemSSA handling
The interpreter ignores MemSSA tokens — they are compile-time analysis artifacts. The interpreter uses its own concrete memory state for all load/store operations.
Drawbacks
- Performance: arbitrary-precision integers and per-byte tracking are slower than native execution. The interpreter is not suitable for running large programs at scale.
- Incomplete semantics: some IR features (atomics, concurrency) are difficult to interpret faithfully. Single-threaded sequential execution is a simplification.
- Maintenance burden: the interpreter must be kept in sync with IR semantics changes. Every new instruction or semantic change requires interpreter updates.
Rationale and alternatives
Why a custom interpreter instead of lowering to LLVM and using lli?
Lowering to LLVM IR loses tuffy-specific semantics (infinite precision integers, four-state bytes, provenance). The whole point of the interpreter is to validate tuffy IR semantics directly, not after translation.
Why arbitrary-precision integers in the interpreter?
The IR defines int as infinite precision. The interpreter must faithfully implement this — using fixed-width integers would mask overflow bugs that the IR semantics are designed to prevent.
Why not use Miri directly?
Miri interprets Rust MIR, not tuffy IR. The tuffy interpreter operates on tuffy IR after optimization, catching bugs introduced by transforms. Miri and the tuffy interpreter serve complementary roles: Miri validates Rust semantics, the tuffy interpreter validates tuffy IR semantics.
Prior art
- Miri: Rust MIR interpreter with UB detection, pointer provenance tracking, and abstract memory model. Primary inspiration for tuffy’s interpreter design.
- llubi (LLVM UB-aware interpreter): LLVM IR interpreter with poison tracking and UB detection. Demonstrated the value of IR-level interpretation for miscompilation debugging. Limited by LLVM’s undef/poison duality and lack of provenance tracking.
- lli (LLVM interpreter): basic LLVM IR interpreter without UB awareness. Insufficient for correctness validation.
- KLEE: symbolic execution engine for LLVM IR. More powerful but much slower; tuffy’s interpreter focuses on concrete execution for speed.
Unresolved questions
- Atomic semantics: how should the interpreter handle atomic operations? Sequential consistency by default, or model relaxed memory?
- External function calls: how to handle calls to external functions (libc, system calls)? Stub library, or FFI to native implementations?
- Vector operations: should the interpreter support scalable vectors, or only fixed-width vectors with a concrete vscale?
- Performance optimization: are there opportunities to fast-path common cases (e.g., when all bytes are
Bits, skip per-byte tracking)?
Future possibilities
- Symbolic execution: extend the interpreter to support symbolic values, enabling exhaustive exploration of execution paths for bounded verification.
- Interactive debugging: step-through execution with inspection of IR values, memory state, and provenance at each instruction.
- Coverage-guided fuzzing: instrument the interpreter to report coverage, guiding the fuzzer toward unexplored paths.
- Regression test generation: automatically generate minimal test cases from fuzzer-discovered bugs, adding them to the test suite.
Feature Name: tuffy_rewrite_trace
- Status: Draft
- Created: 2026-02-07
- Completed: N/A
Summary
This RFC defines the rewrite path tracing infrastructure for the tuffy compiler. Every transform emits a machine-readable trace entry recording the rewrite rule, local context, before/after IR fragments, and decision rationale. A replay tool can re-execute the optimization sequence from a trace, enabling regression bisection, policy locking, and transform auditing.
Motivation
When a compiler optimization causes a performance regression or miscompilation, diagnosing the root cause is notoriously difficult:
- Black-box pipeline: the optimization pipeline is opaque — developers see input IR and output IR, but not the chain of decisions that produced the result.
- Non-reproducible decisions: heuristic decisions (cost model thresholds, unrolling factors) depend on analysis state that is not recorded. Changing the compiler version may silently change decisions.
- Manual bisection: finding which transform caused a regression requires manually enabling/disabling passes, a tedious and error-prone process.
Rewrite path tracing makes the optimization pipeline fully transparent and reproducible. Every decision is recorded, and the entire optimization sequence can be replayed mechanically.
Guide-level explanation
Trace entries
Each time a transform fires, it emits a trace entry containing:
- Rule identity: which rewrite rule fired, and where in the pipeline
- Local context: the preconditions that enabled the transform (known bits, ranges, constants)
- Before/after IR: the local IR fragment before and after the rewrite
- Decision rationale: for equivalence transforms, the proof reference; for heuristic transforms, the cost model inputs and outputs
Trace format
Traces are structured, machine-readable data. Example (equivalence transform):
{
"seq": 42,
"rule": "strength_reduce_mul_pow2",
"stage": "early_simplification",
"iteration": 2,
"function": "@compute",
"region": "loop.0",
"context": {
"%C": {"kind": "const", "value": 8, "properties": ["is_power_of_2"]},
"%x": {"kind": "known_bits", "known_zeros": "0xFFFFFFFF00000000"}
},
"before": {"instructions": ["%r = mul %x, %C"]},
"after": {"instructions": ["%r = shl %x, 3"]},
"decision": {
"kind": "equivalence",
"proof_ref": "Tuffy.Rewrites.StrengthReduce.mul_pow2"
}
}
Example (heuristic transform):
{
"seq": 87,
"rule": "loop_unroll",
"stage": "loop_optimization",
"function": "@process",
"region": "loop.2",
"context": {
"trip_count": 4,
"body_instruction_count": 12,
"body_has_calls": false
},
"before": {"instructions": ["region.loop { ... }"]},
"after": {"instructions": ["region.sequence { ... }"]},
"decision": {
"kind": "cost_model",
"before_cost": {"estimated_cycles": 48, "code_size": 24},
"after_cost": {"estimated_cycles": 16, "code_size": 96},
"threshold": "unroll if speedup > 2x",
"speedup": 3.0
}
}
Replay
A replay tool reads a trace and re-executes the same transforms on a given IR:
tuffy-replay --trace=opt_trace.json --input=before.tir --output=after.tir
At each step, the replay tool:
- Matches the pattern from the trace entry against the current IR
- Verifies the preconditions still hold
- Applies the rewrite
- If a precondition fails or the pattern doesn’t match, reports a divergence
Divergences indicate that the IR or compiler has changed in a way that invalidates the recorded optimization sequence.
Use cases
| Use case | How trace helps |
|---|---|
| Regression bisect | Replay trace step by step, compare decisions with new compiler version, find first divergence |
| Policy locking | Save trace as a “policy file”, replay forces the same decisions on recompilation |
| Transform audit | Review each trace entry to verify optimization rationale |
| Optimization report | Generate human-readable report from trace showing what happened to each hot function |
| Debugging | When a miscompilation is found, trace pinpoints which transform introduced the bug |
Overhead control
Tracing is optional, controlled by compiler flag:
--emit-trace=none(default): no tracing, zero overhead--emit-trace=decisions: record rule + decision only (lightweight)--emit-trace=full: record rule + context + before/after IR + decision (detailed)
When disabled, trace emission code is compiled out via conditional compilation, ensuring zero runtime cost.
Relationship to other features
- Origin chain: tracks instruction-level lineage (where did this instruction come from). Trace tracks transform-level decisions (why was this rewrite applied). Complementary, not overlapping.
- Policy-mechanism separation: trace is the execution record of a policy. Replaying a trace = enforcing a policy. Locking to a versioned policy = saving and replaying a trace.
- Transform equivalence declaration: trace entries reference the declaration (equivalence or refinement) and, for verified rules, the Lean 4 proof.
Reference-level explanation
Trace data structures
struct Trace {
metadata: TraceMetadata,
entries: Vec<TraceEntry>,
}
struct TraceMetadata {
compiler_version: String,
pipeline_config: PipelineConfig,
target_triple: String,
timestamp: String,
}
struct TraceEntry {
seq: u64,
rule: String,
stage: String,
iteration: u32,
function: String,
region: Option<String>,
context: Vec<ContextFact>,
before: IrFragment,
after: IrFragment,
decision: Decision,
}
enum ContextFact {
Const { name: String, value: BigInt, properties: Vec<String> },
KnownBits { name: String, known_zeros: BitSet, known_ones: BitSet },
Range { name: String, min: BigInt, max: BigInt },
AssertSext { name: String, width: u32 },
AssertZext { name: String, width: u32 },
}
struct IrFragment {
instructions: Vec<String>,
}
enum Decision {
Equivalence {
proof_ref: Option<String>,
},
CostModel {
before_cost: CostEstimate,
after_cost: CostEstimate,
threshold: String,
},
}
struct CostEstimate {
estimated_cycles: Option<u64>,
code_size: Option<u64>,
custom: HashMap<String, String>,
}
Trace emission
The pass infrastructure provides a trace emitter that transforms call:
impl TraceEmitter {
fn emit(&mut self, entry: TraceEntry);
fn is_enabled(&self) -> bool;
}
In declarative rewrite rules, trace emission is automatic — the codegen generator produces trace emission code alongside the transform code. The context is extracted from the pattern match bindings, and the decision is derived from the rule’s TransformKind.
For the lightweight --emit-trace=decisions mode, before/after IR fragments and detailed context are omitted.
Replay engine
struct ReplayEngine {
trace: Trace,
current_step: usize,
}
enum ReplayResult {
Success,
Divergence {
step: u64,
expected_rule: String,
reason: DivergenceReason,
},
}
enum DivergenceReason {
PatternNotFound,
PreconditionFailed { fact: String, expected: String, actual: String },
DifferentDecision { expected: Decision, actual: Decision },
}
The replay engine processes trace entries sequentially. At each step, it attempts to apply the recorded transform. If the pattern matches and preconditions hold, the transform is applied. Otherwise, a divergence is reported with diagnostic information.
Serialization
Traces are serialized as JSON for interoperability. For large traces, streaming JSON (one entry per line, JSONL format) avoids loading the entire trace into memory.
tuffy --emit-trace=full -o output program.tir > trace.jsonl
Drawbacks
- Trace size: full traces with before/after IR fragments can be large for complex programs. The
decisionsmode mitigates this. - Replay brittleness: replaying a trace on a different compiler version may diverge early if IR normalization changes, even if the final result is equivalent.
- Serialization cost: even with conditional compilation, the trace data structures add code complexity.
Rationale and alternatives
Why machine-readable structured traces?
Alternative: human-readable logs (like LLVM’s -pass-remarks). Rejected because logs cannot be replayed. They are useful for human inspection but do not enable mechanical regression bisection or policy locking.
Alternative: IR snapshots at each stage. Recording full IR after each pass is expensive and doesn’t capture the decision rationale. Traces are more compact and more informative.
Why JSON/JSONL?
Widely supported, human-inspectable, easy to process with standard tools. Binary formats would be more compact but harder to debug and less interoperable.
Why optional with zero overhead when disabled?
Tracing is a development and debugging tool. Production builds should not pay any cost for it. Conditional compilation ensures this.
Prior art
- LLVM -pass-remarks: human-readable optimization remarks. Useful but not machine-replayable.
- LLVM opt-bisect-limit: binary search for which pass causes a bug. Coarser than trace-based bisection (pass-level vs transform-level).
- GCC -fdump-tree-*: dumps IR at various stages. Snapshots, not decision traces.
- Alive2: extracts local function-to-function rewrites for verification. Inspires tuffy’s local context extraction in trace entries.
Unresolved questions
- Trace stability: how to handle trace replay across compiler versions when IR normalization changes? Should traces include a format version for compatibility?
- Partial replay: should the replay tool support replaying only a subset of trace entries (e.g., only transforms in a specific function)?
- Trace diffing: should there be a tool to diff two traces (e.g., from two compiler versions) and highlight decision changes?
Future possibilities
- Trace-guided optimization: use traces from previous compilations to warm-start the optimizer, skipping exploration of decisions already known to be good.
- Trace visualization: graphical tool showing the optimization pipeline as a sequence of local rewrites, with drill-down into context and decisions.
- Community trace sharing: share traces for known-good optimization sequences on specific workloads, enabling reproducible performance.
Feature Name: tuffy_verification
- Status: Draft
- Created: 2026-02-07
- Completed: N/A
Summary
This RFC defines the correctness guarantee framework for the tuffy compiler. All optimization transforms are declarative rewrite rules with machine-checked correctness proofs in Lean 4. Instruction selection (isel) — the lowering from tuffy IR to machine instructions — is also formally verified in Lean 4, proving that the generated machine code faithfully implements the IR semantics. An Alive2-style automatic verifier and traditional testing (interpreter + fuzzer) serve as auxiliary discovery tools, not as correctness foundations.
Motivation
Compiler optimization bugs are a persistent source of real-world software defects. LLVM’s InstCombine alone has accumulated hundreds of miscompilation bugs, many stemming from subtle semantic errors in hand-written transforms. The root causes are:
- Imperative transforms: hand-written code that directly manipulates IR is error-prone. Each transform must correctly handle edge cases (poison, overflow, pointer provenance) that are easy to overlook.
- Informal correctness arguments: most transforms are justified by informal reasoning or review, not machine-checked proofs. Subtle errors survive review.
- Disconnected verification: tools like Alive2 verify transforms after the fact, but are not integrated into the development workflow as a hard gate.
Instruction selection has similar risks. The lowering from IR to machine instructions must preserve semantics — incorrect register sizing, wrong sign/zero extension, or mishandled non-standard bit-widths (e.g., u17 annotations) can silently produce wrong code. Traditional compilers rely on testing to catch these bugs; tuffy formally verifies isel correctness.
Tuffy takes a formal-first approach: every rewrite rule and every isel lowering rule must have a Lean 4 proof before it enters the production pipeline.
Guide-level explanation
Three-layer verification
Tuffy uses three layers of verification, with clear trust hierarchy:
| Layer | Role | Trust level |
|---|---|---|
| Lean 4 proofs | Gold standard. Machine-checked correctness of every rewrite rule, MemSSA refinement rule, annotation propagation rule, isel lowering rule, and MIR translation rule | Authoritative |
| Alive2-style verifier | SMT-based quick check for discovering missed optimizations and regressions | Auxiliary |
| Interpreter + fuzzer + test suite | End-to-end behavioral testing | Auxiliary |
Only Lean 4 proofs gate production inclusion. The other two layers are discovery tools — they find bugs and missed optimizations, but do not constitute correctness evidence.
Verified optimization transforms
Transforms are not hand-written imperative IR manipulation. They are declarative rewrite rules:
rule strength_reduce_mul_pow2 {
pattern: mul %x, const(C) where is_power_of_2(C)
replace: shl %x, log2(C)
kind: equivalence
}
rule narrow_add_known_bits {
pattern: add %x, %y where known_zeros(%x, [32:63]) && known_zeros(%y, [32:63])
replace: add %x, %y // same operation, but refine known_zeros on result
refine: known_zeros(result, [32:63])
kind: refinement
}
Each rule specifies:
- pattern: what IR fragment to match
- replace: what to produce
- kind: equivalence or refinement (used by dirty bit mechanism)
- preconditions: constraints that must hold for the rule to apply
Workflow
The development workflow for a new optimization:
- Define the rewrite rule and prove correctness in Lean 4 — the rule definition and its proof live together in the same Lean file, against the formal IR semantics. This ensures the proof and the rule are always consistent.
- Lean exports declarative transform description — Lean generates a machine-readable description (JSON) of the verified rule, including pattern, replacement, preconditions, and transform kind.
- Codegen generator produces Rust code from the exported description via codegen.
- Generated code uses the builder API (origin, dirty bit handled automatically by Rust type system)
- Alive2-style verifier cross-checks the rule for additional confidence
- Fuzzer exercises the rule on diverse inputs
Steps 1–3 are mandatory. Steps 5–6 are recommended but not gating.
The key insight: by having Lean own both the rule definition and the proof, there is no possibility of the proof and the implementation diverging. The codegen generator only reads what Lean exports.
Verified MemSSA refinement
MemSSA refinement is a transform that rewrites a load’s mem operand to point to an
earlier MemoryDef when alias analysis proves that intervening defs do not clobber the
load’s memory location. This follows the “analysis is also a transformation” principle —
refinement modifies the IR, not a side table.
Each refinement step is a rewrite rule with a proof obligation:
Given a load from location L depending on mem token %memN, and an intervening MemoryDef %memK → %memN that does not alias L, the load’s mem operand can be rewritten from %memN to %memK without changing observable behavior.
These rules are proven in Lean 4 against the formal memory semantics (evalLoad, evalStore, and the four-state byte model). The proof must account for pointer provenance — two pointers with different AllocIds never alias, but pointers derived via inttoptr may alias any exposed allocation.
Verified annotation propagation
Value-level analyses (KnownBits, DemandedBits) are transforms that write analysis results into IR annotations. Each propagation rule must be proven sound: the annotation produced must be a valid over-approximation of the actual value set.
Examples of proof obligations:
- KnownBits through add: if
%xhas known_zeros at bits [32:63] and%yhas known_zeros at bits [32:63], thenadd %x, %yhas known_zeros at bits [33:63] (not [32:63] — carry may set bit 32). - DemandedBits through shift: if only bits [0:7] of
shl %x, 8are demanded, then no bits of%xare demanded (the result is always zero in those positions).
Unsound annotations cause silent miscompilation: downstream passes and instruction
selection trust annotations to select machine operations. An incorrect :s32 annotation
on a value that exceeds 32-bit signed range produces wrong code with no runtime error.
Annotation propagation rules follow the same workflow as optimization rewrite rules: defined and proven in Lean 4, exported as declarative descriptions, codegen’d to Rust.
Verified instruction selection
Instruction selection lowers tuffy IR (infinite precision integers with annotations) to target machine instructions (fixed-width registers). This lowering must preserve semantics: the machine code must produce the same observable behavior as the IR.
Key properties to verify:
- Annotation-driven sizing: an
:s32annotation lowers to 32-bit machine operations; non-standard widths (e.g.,:u17) lower to the correct shift/mask sequences. - Signedness correctness: signed annotations select signed machine operations (SAR, IDIV, signed condition codes) and unsigned annotations select unsigned ones (SHR, DIV, unsigned condition codes).
- Extension correctness: sext/zext from annotation width N produces the correct result, including the no-op cases (zext from unsigned, sext from signed) and the cases requiring explicit shift/mask sequences.
Each isel lowering rule is defined in Lean 4 as a relation between IR instruction semantics and machine instruction semantics. The proof obligation is:
For all inputs satisfying the annotation constraints, the machine instruction sequence produces the same result as the IR instruction under the formal semantics.
The Lean formalization requires a machine-level semantics model (x86-64 integer operations on fixed-width registers) in addition to the existing IR semantics.
Verified MIR translation
MIR translation converts rustc MIR to tuffy IR. If this translation is wrong, all downstream verification is meaningless. Key non-trivial translation steps:
- Integer type translation: MIR fixed-width integers (u8, i32, u64) →
intwith:s<N>/:u<N>annotations. Incorrect annotations cause silent miscompilation. - Control flow structuralization: flat MIR CFG → hierarchical SESE regions.
- Aggregate scalarization: structs, enums, arrays → scalar operations.
- mem2reg: MIR mutable Places → SSA values.
Approach: define a MIR semantic subset in Lean 4 that covers the MIR constructs tuffy consumes. This is not a full MIR formalization — it models only the subset relevant to translation, independent of rustc’s internal MIR definition. Each translation rule is proven correct against this subset model and the tuffy IR formal semantics.
When rustc’s MIR semantics change, the Lean MIR subset model is updated and affected translation proofs are re-verified. The interpreter + fuzzer layer provides additional confidence that the subset model faithfully reflects actual MIR behavior.
Trust boundary
┌─────────────────────────────────────────┐
│ Trusted components │
│ │
│ ┌───────────────┐ ┌────────────────┐ │
│ │ Lean 4 kernel │ │ IR formal │ │
│ │ │ │ semantics │ │
│ └───────────────┘ └────────────────┘ │
│ ┌────────────────┐ ┌───────────────┐ │
│ │ x86-64 machine │ │ MIR semantic │ │
│ │ semantics │ │ subset model │ │
│ └────────────────┘ └───────────────┘ │
│ ┌───────────────┐ │
│ │ Codegen │ │
│ │ generator │ │
│ └───────────────┘ │
├─────────────────────────────────────────┤
│ Verified components │
│ │
│ ┌────────────────────────────────────┐ │
│ │ Optimization rewrite rules │ │
│ │ (proven correct in Lean 4) │ │
│ └────────────────────────────────────┘ │
│ ┌────────────────────────────────────┐ │
│ │ MemSSA refinement rules │ │
│ │ (proven correct in Lean 4) │ │
│ └────────────────────────────────────┘ │
│ ┌────────────────────────────────────┐ │
│ │ Annotation propagation rules │ │
│ │ (proven correct in Lean 4) │ │
│ └────────────────────────────────────┘ │
│ ┌────────────────────────────────────┐ │
│ │ Isel lowering rules │ │
│ │ (proven correct in Lean 4) │ │
│ └────────────────────────────────────┘ │
│ ┌────────────────────────────────────┐ │
│ │ MIR translation rules │ │
│ │ (proven against MIR subset model) │ │
│ └────────────────────────────────────┘ │
├─────────────────────────────────────────┤
│ Constrained components │
│ │
│ ┌────────────────────────────────────┐ │
│ │ Generated Rust transform code │ │
│ │ (type system enforces invariants) │ │
│ └────────────────────────────────────┘ │
└─────────────────────────────────────────┘
IR formal semantics in Lean 4
The foundation of all proofs is a formal semantics of tuffy IR defined in Lean 4. This includes:
- Type system semantics (infinite precision integers, byte types, pointers, mem tokens, vectors)
- Instruction semantics (each operation’s mathematical definition)
- Poison propagation rules
- Memory model (four-state bytes, provenance, MemSSA token semantics)
- Annotation semantics (range constraints, KnownBits, DemandedBits soundness conditions)
- Assert node semantics (assertsext, assertzext)
A separate MIR semantic subset model defines the MIR constructs that tuffy consumes, serving as the source semantics for MIR translation proofs.
This formalization is the single source of truth for what the IR means. All rewrite rule, MemSSA refinement, annotation propagation, isel, and MIR translation proofs reference these definitions.
Alive2-style verifier
A separate SMT-based tool that can:
- Automatically verify simple rewrite rules (bounded bit widths, no loops)
- Discover missed optimizations by enumerating candidate rewrites
- Regression-test existing rules against IR semantics changes
- Run in CI as a fast feedback loop
This tool is valuable for development velocity but is not a substitute for Lean 4 proofs. It may miss edge cases that the theorem prover catches, and it cannot handle unbounded reasoning (infinite precision integers require induction).
Interpreter and fuzzer
The tuffy IR interpreter (miri-like) combined with fuzzing provides end-to-end validation:
- Compile Rust programs with tuffy, run the output, compare against reference (rustc + LLVM)
- Fuzz IR fragments through the optimization pipeline, check that interpreter results are preserved
- Exercise edge cases in poison propagation, pointer provenance, and memory model
Reference-level explanation
Rewrite rule representation
struct RewriteRule {
name: String,
pattern: Pattern,
replacement: Replacement,
preconditions: Vec<Precondition>,
kind: TransformKind,
proof_ref: LeanProofRef,
}
struct LeanProofRef {
module: String, // Lean 4 module path
theorem: String, // theorem name
verified: bool, // checked by build system
}
Codegen pipeline
Lean 4 rewrite rule definition + proof
│
▼
Codegen generator (trusted)
│
▼
Generated Rust code:
- Pattern matching function
- Replacement construction (via Builder API)
- TransformKind declaration
- Precondition checks
The generator produces Rust code that:
- Implements the
Passtrait - Uses the builder API (origin is mandatory, dirty bit is automatic)
- Declares
TransformKindmatching the Lean 4 proof (equivalence or refinement)
Build system integration
The build system verifies that:
- Every rewrite rule in the production pipeline has a corresponding Lean 4 proof
- The proof references a theorem that is successfully checked by Lean 4
- The generated Rust code is up-to-date with the rule definition
Rules without proofs can exist in a experimental pipeline for development, but cannot be included in release builds.
Drawbacks
- Lean 4 learning curve: contributors must learn Lean 4 to add new optimizations. This raises the barrier to contribution.
- Proof effort: proving correctness of complex transforms (loop optimizations, vectorization) requires significant effort. Some transforms may be difficult to formalize.
- Formalization maintenance: changes to IR semantics require updating the Lean 4 formalization and potentially re-proving existing theorems.
- Codegen generator trust: the generator is in the trusted computing base. Bugs in the generator could produce incorrect Rust code from correct proofs.
- Development velocity: requiring proofs for every optimization may slow down the pace of adding new optimizations compared to traditional compilers.
Rationale and alternatives
Why Lean 4?
- Active ecosystem with Mathlib providing extensive mathematical foundations
- Good metaprogramming support (tactics, macros) for proof automation
- Overlap with Rust community (shared interest in formal methods)
- Compiled language with reasonable performance for proof checking
Alternative: Coq. Mature ecosystem but less active development. Lean 4’s metaprogramming is more ergonomic.
Alternative: Isabelle. Strong automation (Sledgehammer) but less mainstream adoption and weaker programming language integration.
Why declarative transforms?
Alternative: imperative transforms with post-hoc verification. Write transforms in Rust, then verify them with Alive2 or similar. Rejected because imperative code has a large surface area for bugs that post-hoc tools may miss, and it’s harder to establish a formal correspondence between the Rust code and the verified property.
Declarative rules have a narrow semantic gap between the rule definition and the Lean 4 proof. The codegen generator bridges this gap with minimal trusted code.
Why not rely solely on Alive2-style verification?
SMT-based verification is bounded — it checks finite bit widths and cannot perform induction. Tuffy’s infinite precision integers require inductive proofs that only an interactive theorem prover can provide. Additionally, SMT solvers may time out on complex transforms, giving no answer rather than a proof.
Prior art
- CompCert: fully verified C compiler in Coq. Proves correctness of the entire compilation pipeline, not just individual transforms. Tuffy takes a similar approach for isel (verified lowering) while using declarative rules for optimizations.
- Alive / Alive2: SMT-based verification of LLVM InstCombine transforms. Discovered numerous bugs. Tuffy uses this as an auxiliary tool, not the primary correctness mechanism.
- CakeML: verified ML compiler in HOL4. Similar full-pipeline verification to CompCert.
- Lean 4 + Mathlib: growing library of formalized mathematics. Provides foundations for reasoning about integers, bit operations, and algebraic properties.
- Souper: superoptimizer for LLVM that discovers optimizations via SMT. Could complement tuffy’s approach by suggesting candidate rewrite rules for human proof.
Unresolved questions
- Declarative rule expressiveness: can all useful transforms be expressed declaratively, or will some require escape hatches for imperative logic (e.g., complex loop transforms)?
- Proof automation: how much of the proof burden can be automated via Lean 4 tactics? Custom tactics for common proof patterns (bit manipulation, poison propagation) would significantly reduce effort.
- Codegen generator verification: should the generator itself be verified (in Lean 4), or is manual audit sufficient given its narrow scope?
- Incremental proof checking: how to efficiently re-check proofs when IR semantics change? Full re-verification may be expensive.
Future possibilities
- Proof-carrying optimizations: distribute verified rewrite rules as packages, enabling a community-driven optimization library with machine-checked correctness.
- Superoptimizer integration: use SMT-based search (Souper-style) to discover candidate rewrites, then prove them in Lean 4 for inclusion in the production pipeline.
- End-to-end verification: with optimizations and isel now verified, extend verification to the remaining pipeline stages (MIR translation, register allocation, machine code emission) to close the full compilation gap.
- LLM-assisted proofs: use language models to draft Lean 4 proofs for new rewrite rules, with human review and machine checking.
Feature Name: at-use-annotations
- Status: Draft
- Created: 2026-02-08
- Completed: N/A
Summary
Replace standalone assert_sext/assert_zext instructions with annotations on value
definitions (result-side) and use edges (use-side). Range constraints, known bits, and
demanded bits are encoded directly on the IR edges rather than as separate instructions.
Additionally, simplify the text format by eliding the implicit top-level region function.
Motivation
The current IR uses assert_sext and assert_zext as standalone instructions to express
range constraints on infinite precision integers:
v0 = param 0
v1 = assert_sext v0, 32
v2 = param 1
v3 = assert_sext v2, 32
v4 = add v1, v3
v5 = assert_sext v4, 32
ret v5
This has several problems:
- Instruction bloat — Every constrained value requires a separate assert instruction, roughly doubling the instruction count for typical integer-heavy code.
- Artificial data dependencies — Assert nodes create new SSA values (
v1instead ofv0), forcing all downstream uses to reference the asserted value. This complicates pattern matching in optimization passes. - No place for analysis results — The planned at-use bit-level analysis (known bits, demanded bits) has no natural home. Assert nodes only express range constraints, not arbitrary bit-level facts.
- Phase ordering — Analysis results stored in side tables become stale after transformations. Encoding them in the IR itself keeps them synchronized.
Additionally, the text format wraps every function body in region function { ... }, which
is redundant since every function has exactly one top-level function region.
Guide-level explanation
Annotations on values
Every value reference in tuffy IR can carry an optional annotation describing constraints on that value. Annotations appear in two positions:
- Result-side: on the value defined by an instruction, declaring the range of the output.
- Use-side: on an operand of an instruction, declaring a constraint the consumer requires.
Annotation types
| Annotation | Meaning |
|---|---|
:s<N> | Value is in signed N-bit range [-2^(N-1), 2^(N-1)-1] |
:u<N> | Value is in unsigned N-bit range [0, 2^N-1] |
:known(<ternary>) | Per-bit three-state constraint (see below) |
| (none) | No constraint; value is an unconstrained int |
Annotations are composable: :s32:known(...) applies both constraints simultaneously.
Known bits ternary encoding
Each bit in a known annotation is one of four states:
| Symbol | Meaning |
|---|---|
0 | Bit is known to be 0 |
1 | Bit is known to be 1 |
? | Unknown — bit is demanded but its value has not been determined |
x | Don’t-care — bit is not demanded by the consumer |
Example: :known(xxxx_??11_1111_0000) means:
- Bits [0:3] are known-0
- Bits [4:9] are known-1
- Bits [8:9] are demanded but unknown
- Bits [10:] are don’t-care (not demanded)
The distinction between ? and x matters for optimization:
?(unknown): the consumer needs this bit, so the producer must supply a well-defined value. An analysis pass may later refine?to0or1.x(don’t-care): the consumer ignores this bit entirely. The producer is free to supply any value, enabling dead-bit elimination and narrowing optimizations.
Example: i32 a + i32 b (C signed addition)
Before (assert nodes):
func @add_i32(int, int) -> int {
region function {
bb0:
v0 = param 0
v1 = assert_sext v0, 32
v2 = param 1
v3 = assert_sext v2, 32
v4 = add v1, v3
v5 = assert_sext v4, 32
ret v5
}
}
After (at-use annotations, top-level region elided):
func @add_i32(int:s32, int:s32) -> int:s32 {
bb0:
v0:s32 = param 0
v1:s32 = param 1
v2:s32 = add v0:s32, v1:s32
ret v2:s32
}
6 instructions reduced to 4. No artificial SSA values from assert nodes.
Example: optimization strengthening a use-side annotation
After an analysis pass discovers that v1 is always in [0, 100] at a particular use:
func @example(int:s32) -> int:s32 {
bb0:
v0:s32 = param 0
v1:s32 = add v0:s32, v0:s32
v2:s32 = foo v1:u7 // use-side: optimizer proved v1 in [0, 127] here
ret v2:s32
}
The :u7 on the use of v1 is stronger than the :s32 on its definition. This is valid
because the use-side annotation is a local fact about this specific use site.
Example: known bits after optimization
v3:s32:known(xxxx_xxxx_xxxx_xxxx_????_????_????_???0) = and v1:s32, v2:s32
The and result has bit 0 known to be 0 (discovered by analysis), bits [1:15] are
demanded but unknown, and bits [16:31] are don’t-care.
Semantics
Result-side annotation violation: the instruction itself produces poison.
v0:s8 = add v1, v2 // if the true result is 200, v0 becomes poison
Use-side annotation violation: the consuming instruction produces poison.
v1 = foo v0:u4 // if v0 is 20, then foo produces poison
Constraint strength: a use-side annotation may be stronger (narrower) than the result-side annotation on the referenced value. It must not be weaker — a weaker use-side annotation is redundant and should be omitted.
No annotation: the value is an unconstrained infinite precision int. During
instruction selection, the verifier reports an error if it cannot determine a concrete
machine type for an unconstrained value.
Function signatures
Function signatures carry annotations on parameter types and the return type. These are ABI-level contracts between caller and callee:
func @add_i32(int:s32, int:s32) -> int:s32 { ... }
- Parameter annotation: the caller guarantees the argument satisfies the constraint.
Inside the function body,
paramresult-side annotations match the signature. - Return annotation: the callee guarantees the return value satisfies the constraint.
The
retuse-side annotation must be at least as strong as the signature’s return annotation.
Text format: elide top-level region
Every function has exactly one top-level region function. The text format elides it:
// Before:
func @f(int) -> int {
region function {
bb0:
...
}
}
// After:
func @f(int) -> int {
bb0:
...
}
Inner regions (loop, plain) are still written explicitly.
Reference-level explanation
Annotation representation
In the Rust implementation, annotations are stored alongside each operand (use-side) and each instruction result (result-side). A compact representation:
#![allow(unused)]
fn main() {
/// Per-bit four-state annotation.
struct KnownBits {
/// Bits known to be 1.
ones: BigUint,
/// Bits known to be 0.
zeros: BigUint,
/// Bits that are demanded by the consumer.
demanded: BigUint,
}
}
Invariant: ones & zeros == 0 (a bit cannot be simultaneously known-0 and known-1).
The :s<N> and :u<N> shorthand annotations are sugar over KnownBits:
:s32implies demanded bits [0:31], with the sign bit propagation constraint.:u32implies demanded bits [0:31], with bits [32:] known-0.
Interaction with other features
Block arguments: block arguments carry result-side annotations. Branch instructions carry use-side annotations on the passed values:
bb0:
v0:s32 = iconst 42
br bb1(v0:s32)
bb1(v1:s32: int):
ret v1:s32
Terminators: ret, br, brif, continue, region_yield all carry use-side
annotations on their operands.
Memory operations: load result carries a result-side annotation. store value
operand carries a use-side annotation. Pointer operands are not annotated (pointers
have concrete types).
Instruction selection: the verifier walks all values and checks that every int
value has sufficient annotation (result-side or use-side) to determine a concrete
machine type. Missing annotations are reported as errors.
Grammar changes
annotation ::= (':' range_ann)*
range_ann ::= 's' N | 'u' N | 'known' '(' ternary ')'
ternary ::= [01?x_]+
function ::= 'func' '@' name '(' param_list ')' ('->' type annotation)? '{' body '}'
param_list ::= (type annotation (',' type annotation)*)?
body ::= (block | region)*
instruction ::= (value annotation '=')? opcode operands
operand ::= value annotation
The top-level region function { ... } wrapper is removed from the grammar. The
function body directly contains blocks and regions.
Drawbacks
- Annotation storage overhead — Every use edge now carries optional annotation data. For dense integer code this may increase memory usage compared to standalone assert instructions (one annotation per use vs one instruction per value).
- Complexity in the builder API — The builder must accept annotations on every operand, making the API surface larger.
- Parsing complexity — The text format parser must handle annotations on every value reference, increasing grammar complexity.
Rationale and alternatives
Why annotations on edges instead of standalone instructions?
The core insight is that range constraints and bit-level facts are properties of
how a value is used, not properties of the value itself. A value v0 may be in
the full int range, but a particular consumer may only need 8 bits of it. Encoding
this on the use edge is the natural representation.
Alternative: keep assert nodes, add side-table analyses
This is the LLVM approach — KnownBits and DemandedBits live in analysis caches
outside the IR. The problem is phase ordering: after a transformation, the side tables
are stale and must be recomputed. Tuffy’s “analysis is also a transformation” principle
rejects this approach.
Alternative: assert nodes with use-side annotations (hybrid)
Keep assert nodes for result-side constraints, add use-side annotations only for analysis results. This avoids changing the instruction set but retains the instruction bloat problem and the artificial SSA value issue.
Why signed/unsigned is a binary choice
A natural question is whether annotations should support expressing both signed and
unsigned constraints simultaneously — e.g., a value that is both :s32 and :u16.
The answer is no, because the intersection of any two range annotations is always
expressible as a single annotation:
| Intersection | Result | Reasoning |
|---|---|---|
:s<M> ∩ :s<N> | :s<min(M,N)> | Smaller signed range is a subset of the larger |
:u<M> ∩ :u<N> | :u<min(M,N)> | Smaller unsigned range is a subset of the larger |
:s<M> ∩ :u<N> | :u<min(M-1, N)> | The non-negative portion of :s<M> is [0, 2^(M-1)-1], which equals :u<M-1> |
The third case is the key insight. A signed N-bit value is in [-2^(N-1), 2^(N-1)-1].
An unsigned M-bit value is in [0, 2^M-1]. Their intersection is
[0, min(2^(N-1)-1, 2^M-1)], which is always an unsigned range expressible as
:u<min(N-1, M)>.
Since any combination of signed and unsigned constraints collapses to a single annotation, supporting both simultaneously adds no expressive power — it only adds implementation complexity.
For constraints that range annotations genuinely cannot express (e.g., “bit 3 is
known to be 1” or “bits above 8 are don’t-care”), the KnownBits extension is the
correct generalization. KnownBits subsumes range annotations entirely: :s32 and
:u32 are sugar over specific KnownBits patterns.
Impact of not doing this
Without this change, the IR would need both assert nodes (for range constraints) and a separate mechanism (side tables or new annotation syntax) for known/demanded bits. Two parallel systems for the same concept.
Prior art
- LLVM KnownBits / DemandedBits — Side-table analyses that track per-bit information. Effective but disconnected from the IR, causing phase ordering issues and stale results after transformations. Tuffy’s approach promotes these to first-class IR constructs.
- Cranelift aegraph annotations — Cranelift’s e-graph framework attaches facts to values (range constraints for proof-carrying code). Similar in spirit to result-side annotations, but not on use edges.
- MLIR type refinement — MLIR dialects can carry refined type information on values.
However, MLIR’s approach is type-based (the type itself changes), not annotation-based
(the type stays
int, annotations are metadata).
Unresolved questions
- Compact storage for common cases — Most annotations will be simple
:s32or:u64. Should the implementation use a small-enum optimization to avoid allocatingBigUintfor these common cases? - Annotation canonicalization — Should redundant annotations (e.g.,
:s32on a use when the result already has:s32) be stripped automatically, or preserved for readability? - Lean formalization — How should annotations be represented in the Lean formal model? As metadata on the operational semantics, or as part of the value type?
Future possibilities
- Annotation inference pass — A dedicated pass that propagates known bits forward and
demanded bits backward, filling in
?bits with0/1and narrowing?toxwhere possible. - Annotation-driven instruction selection — The instruction selector reads annotations directly to choose machine instructions, register classes, and sign/zero extension without a separate legalization phase.
- Annotation-aware rewrite rules — Declarative rewrite rules that match on annotation patterns, e.g., “if all bits above bit 7 are don’t-care, narrow to 8-bit operation”.
- Floating point annotations — Extend the annotation system to floating point values (e.g., known-NaN, known-finite, known-positive).
SOP: Modifying the Tuffy IR
- Created: 2026-02-08
- Status: Active
Overview
This document describes the standard operating procedure for making changes to the Tuffy IR. Because the IR is the core data structure shared across the entire compiler, changes must be propagated to multiple components in a specific order.
Guiding Principle
The Lean 4 formal definitions are the source of truth. All other artifacts (Rust code, spec document) must conform to the Lean definitions. When in doubt, the Lean code is authoritative.
Checklist
1. Lean 4 Formal Definitions (Source of Truth)
Update the formal model first. Depending on the nature of the change:
| What changed | File to update |
|---|---|
| Types, values, memory model | lean/TuffyLean/IR/Types.lean |
| Instruction semantics | lean/TuffyLean/IR/Semantics.lean |
After editing, verify with:
cd lean && lake build
If the change affects existing proofs, fix all proof breakages before proceeding.
2. Rust Implementation (tuffy_ir)
Translate the Lean changes into the Rust IR crate. Files to consider:
| What changed | File to update |
|---|---|
| Type system (new type or type change) | tuffy_ir/src/types.rs |
| New/modified instruction or opcode | tuffy_ir/src/instruction.rs |
| Reference handles (ValueRef, BlockRef, etc.) | tuffy_ir/src/value.rs |
| Function/block/region structure | tuffy_ir/src/function.rs |
| Builder API for constructing IR | tuffy_ir/src/builder.rs |
| Textual format / display | tuffy_ir/src/display.rs |
Run after changes:
cargo test -p tuffy_ir
cargo clippy -p tuffy_ir
3. Downstream Crates
IR changes typically break downstream consumers. Update in dependency order:
-
Interpreter (
tuffy_ir_interp/src/lib.rs) — Add/update evaluation logic for new or changed instructions. -
Optimizer (
tuffy_opt/) — Update any passes that pattern-match on changed instructions. -
Code Generation (
tuffy_codegen/) — Update instruction selection (src/isel.rs), encoding (src/encode.rs), or emission (src/emit.rs) as needed. -
Trace / TV (
tuffy_trace/,tuffy_tv/) — Update if the change affects tracing or visualization.
4. Spec Document
Update the human-readable reference:
docs/spec/(split into multiple files; seedocs/spec/index.md)
Ensure the spec accurately reflects the new Lean definitions and Rust implementation. The spec is not authoritative — it is documentation derived from the Lean model.
5. Tests
Add or update tests at every layer:
| Layer | Location |
|---|---|
| Rust unit tests | tuffy_ir/src/tests.rs |
| Interpreter tests | tuffy_ir_interp/tests/ |
| Codegen tests | tuffy_codegen/src/tests.rs |
| UI tests | rustc_codegen_tuffy/tests/ (run via rustc_codegen_tuffy/tests/run-ui-tests.sh) |
6. Final Verification
Before committing, all of the following must pass:
cd lean && lake build
cargo test
cargo clippy
rustc_codegen_tuffy/tests/run-ui-tests.sh
Commit Strategy
Each component should be committed separately per the project’s commit convention. A typical IR change produces commits in this order:
feat(lean): ...— Lean formal definitionsfeat(ir): ...— Rusttuffy_ircratefeat(interp): .../feat(codegen): ...— downstream cratesdocs: ...— spec update
Every individual commit must pass all tests.
SOP: Updating Documentation
- Created: 2026-02-09
- Status: Active
Overview
This document describes the checks required whenever documentation files under docs/
are added or modified. Skipping these checks can cause broken links, missing pages on
the published site, or stale content that contradicts the source of truth.
Checklist
1. SUMMARY.md
Every Markdown file that should appear on the published site must be listed in
docs/SUMMARY.md. mdBook only builds pages present in the summary — unlisted files
will result in 404 errors.
When adding a new page:
- Add an entry to
docs/SUMMARY.mdin the correct section and nesting level. - If the page belongs to an existing group (e.g.,
spec/), add it as a nested entry under the group’s parent.
2. Spec Index
If the change adds or removes a page under docs/spec/, update the Table of Contents
in docs/spec/index.md to match.
3. Source of Truth Conformance
The spec documents (docs/spec/) describe the IR but are not authoritative. The
Lean 4 definitions (lean/TuffyLean/IR/) are the source of truth.
When updating spec docs, verify that:
- Descriptions match the current Lean definitions.
- Semantics formulas reference the correct
eval*function names fromTuffyLean.IR.Semantics. - Type names and instruction opcodes match the Rust implementation in
tuffy_ir/.
4. Internal Links
Check that all cross-references between documentation pages are valid:
- Relative links (e.g.,
[types](types.md)) point to files that exist. - Anchor links (e.g.,
instructions.md#comparison) reference headings that exist in the target file.
5. Build Verification
After making changes, build the documentation locally and verify:
mdbook build
The build must complete without errors. Optionally preview with:
mdbook serve
6. Commit
Documentation-only changes use the docs commit type:
docs: <description>
docs(spec): <description> # for spec-specific changes
If the documentation change accompanies a code change (e.g., adding a new instruction), the docs update should be a separate commit from the code change, per the project’s commit convention.