Introduction

Tuffy is an experimental optimizing compiler written in Rust, developed with LLM assistance.

Note: This project is currently in an experimental stage.

Key Design Ideas

Infinite precision integers — The IR uses a single int type with no fixed bitwidth. Arithmetic operates on mathematical integers; signedness and minimum required bits are derived at use sites. This eliminates zext/sext/trunc noise and lets optimization passes focus on mathematical equivalence.
Analysis is also a transformation — Static analyses are expressed as transformations on the IR itself, unifying analysis and optimization and reducing phase ordering problems.
Declarative, verified optimizations — Transforms are declarative rewrite rules with machine-checked correctness proofs in Lean 4 against the formal IR semantics. A codegen generator produces Rust implementation from verified rules, minimizing the trusted code base.
Formal-first — Correctness of optimization passes is backed by formal verification from the start, not bolted on after the fact.
Hardware-friendly IR — IR representations designed to be cache-friendly and suitable for modern hardware, reducing pointer chasing and improving data locality during compilation.
Policy-mechanism separation — Optimization strategies are decoupled from the compiler engine. Users can specify or pin a particular optimization policy to prevent performance regressions when upgrading the compiler.

Documentation Overview

IR Language Reference — The textual IR format, type system, and instruction set.
Initial Design — Early design discussions and architectural decisions.
References — Curated list of reference documents and resources.
RFCs — Design proposals for major features.

Tuffy IR Language Reference

This document is the reference manual for the Tuffy IR, the intermediate representation used by the tuffy compiler. It describes the type system, instruction set, control flow structure, and textual format of the IR.

Status: This spec describes the currently implemented IR. Features described in the IR Design RFC but not yet implemented are noted as planned.

Source of truth: The formal definitions in Lean 4 (lean/TuffyLean/IR/) are authoritative. This spec is a human-readable description that must conform to the Lean definitions. When in doubt, consult the Lean code.

Introduction

Tuffy IR is a typed, SSA-based intermediate representation organized around a hierarchical control flow graph with SESE (Single Entry, Single Exit) regions.

Key design principles:

Infinite precision integers — no fixed bit widths; range constraints via assert nodes
Block arguments instead of PHI nodes (Cranelift/MLIR style)
Hierarchical CFG — loops and structured control flow are explicit SESE regions
Poison-only UB model — no undef; only poison
Single data structure — one IR across all compilation stages

Type System
Values
Functions
Regions
Basic Blocks
Instructions
Text Format
Planned Features

Type System

Tuffy IR has a minimal type system supporting scalar types and aggregate types (structs, arrays). Aggregate types flow through the IR as single values; ABI-specific decomposition is handled by the backend legalize pass.

`unit`

Zero-sized unit type. Represents Rust’s (). Used as the result type for terminators and other instructions that do not produce a meaningful value.

`int`

Infinite precision integer. No fixed bit width, no signedness. Arithmetic on int is mathematical: add 3, 5 produces 8 with no possibility of overflow.

Range constraints are expressed via value annotations rather than the type itself. The decision of sign-extension vs zero-extension is deferred to instruction selection.

`bool`

Boolean type with two values: true and false. Distinct from integers — boolean logic is not conflated with integer arithmetic. Comparison instructions (icmp) return bool, and control flow instructions (brif, select) require bool conditions.

To convert bool to int, use select cond, 1, 0. To convert int to bool, use icmp.ne val, 0.

`byte(N)`

Raw memory data of N bytes. Distinct from integers. The byte type preserves pointer fragments and supports per-byte poison tracking. Loads and stores operate on byte types. It is also the leaf representation used when aggregates need explicit padding or raw byte-oriented subfields.

`ptr(AS)`

Pointer with address space AS. Address space 0 is the default. Pointers carry provenance (allocation ID + offset). Different address spaces may have different pointer sizes and semantics.

`mem`

Abstract memory state token for Memory SSA. Memory tokens flow through the function as regular SSA values, making memory dependencies explicit. The entry block receives the initial memory state as a block parameter of type mem; ret carries the final mem token as an operand.

Memory-producing operations (store, call, fence, atomics) consume a mem token and produce a new one. Plain load consumes a mem token but does not produce one. At CFG join points, memory versions merge via regular block parameters of type mem.

`float`

IEEE 754 floating point type. Variants: bf16 (bfloat16), f16 (half), f32 (single), f64 (double), f128 (quadruple). Floating point operations (fadd, fsub, etc.) operate on float-typed values. The result type of a float operation matches the operand float width.

`vec(VT)`

Vector type parameterized by total bit-width. Variants: vec<N> (fixed-width, e.g., 128 for SSE), vec<vscale x N> (scalable, e.g., SVE, RVV). Element count is derived from the bit-width and element size.

`struct{T0, T1, ...}`

Struct type with field types. Represents a heterogeneous aggregate with indexed fields. Struct values flow through the IR as single SSA values. Field access uses extractvalue and insertvalue instructions with index paths.

Padding must be explicitly represented in the struct type. Padding fields are typically represented as [byte(N); M] (array of N-byte chunks) or byte(N) for individual padding bytes.

Example: struct{int, [byte(8); 1], bool} is a struct with an int field, 8 bytes of padding, and a bool field. The padding is explicit and must be accounted for in field indices.

The backend legalize pass expands struct values into register-sized pieces according to the target ABI (e.g., System V AMD64 ABI rules for struct passing and return).

`[T; N]`

Array type with element type T and count N. Represents a homogeneous aggregate with N elements of type T. Array values flow through the IR as single SSA values. Element access uses extractvalue and insertvalue instructions with index paths.

Example: [int; 10] is an array of 10 integers.

The backend legalize pass expands array values into register-sized pieces according to the target ABI.

Values

A value in tuffy IR is one of five kinds, as defined in Lean (TuffyLean.IR.Value):

Kind	Description
`int(v)`	An infinite precision integer with mathematical value `v`
`bool(v)`	A boolean value (`true` or `false`)
`bytes(bs)`	A sequence of abstract bytes
`ptr(p)`	A pointer with provenance (allocation ID + offset)
`poison`	A poison value — the result of undefined behavior

Abstract Bytes

Each byte in memory is represented as an abstract byte (TuffyLean.IR.AbstractByte):

State	Description
`bits(val)`	A concrete byte value (0–255)
`ptrFragment(allocId, index)`	The `index`-th byte of a pointer to allocation `allocId`
`uninit`	Uninitialized memory (a distinct abstract state, not “random bits”)
`poison`	A poisoned byte

Pointers and Provenance

A pointer consists of an allocation ID and an integer offset (TuffyLean.IR.Pointer). The allocation ID tracks which allocation the pointer belongs to, enabling provenance-based alias analysis.

Functions

A function is the top-level unit of compilation. It has a name, typed parameters, an optional return type, and a body organized as a hierarchical CFG.

func @name(param_types...) -> ret_type {
  ...
}

Name: prefixed with @ in text format.
Parameters: a list of types. Parameter values are created via param instructions. Parameters may optionally carry source-level names for readability (see below).
Return type: optional. Functions with no return type return void.
Body: blocks and nested regions directly inside the function braces. The implicit top-level function region is not written in the text format.

Named Parameters

When source-level parameter names are available (e.g., extracted from MIR debug info), they are displayed in the function signature and param instructions:

func @add(%a: int, %b: int) -> int {
  bb0:
    v0 = param %a
    v1 = param %b
    v2 = add v0, v1
    ret v2
}

Parameter names are prefixed with % in text format. When no name is available, the numeric ABI index is used as a fallback:

func @add(int, int) -> int {
  bb0:
    v0 = param 0
    v1 = param 1

Named parameters are display-only — the internal representation uses numeric ABI indices (Op::Param(u32)) for code generation.

Regions

The CFG is organized as a tree of SESE (Single Entry, Single Exit) regions. Each region contains an ordered sequence of basic blocks and child regions.

Region Kinds

Kind	Description
`function`	Top-level function region. Every function has exactly one.
`loop`	Loop region. The entry block is the loop header. Backedges use `continue`.
`plain`	Generic SESE region for structured control flow.

Nesting and Implicit Capture

Regions nest to form a tree. Values defined in an outer region are visible in inner regions via implicit capture (VPlan style). There is no explicit capture list — the live-in set is computed on demand.

func @example(int) -> int {
  bb0:
    v0 = iconst 10       // v0 defined in function body

  region loop {
    bb1(v1: int):
      v2 = add v1, v0    // v0 captured implicitly from outer scope
      ...
  }
}

Loop Regions

A loop region’s entry block is the loop header. The continue terminator transfers control back to the loop header, passing new values for the header’s block arguments. Exiting a loop is done by branching to a block outside the loop region.

Basic Blocks

A basic block is a straight-line sequence of instructions with no internal control flow. Blocks are named bb0, bb1, etc. in text format.

Block Arguments

Tuffy IR uses block arguments instead of PHI nodes to represent values that merge at control flow join points. Block arguments are declared in the block header and receive values from predecessor branches:

bb0:
  v0 = iconst 1
  br bb1(v0)

bb1(v1: int):       // v1 is a block argument
  ret v1

Each branch instruction (br, brif, continue) passes values to the target block’s arguments. The number and types of passed values must match the target block’s argument list.

Instruction Ordering

Instructions within a block execute sequentially. The last instruction in a block must be a terminator (ret, br, brif, continue, region_yield).

Instructions

Every instruction produces a value (named vN in text format) and has a result type. Terminators are not separated from regular instructions — they are simply the last instruction in a block by convention.

Constants

`iconst`

vN = iconst <immediate>

Produces an integer constant. The immediate is an arbitrary-precision integer, matching the Lean spec’s Int type.

Semantics: Value.int(immediate)

`bconst`

vN = bconst <true|false>

Produces a boolean constant (true or false).

Semantics: Value.bool(immediate)

`fconst`

vN = fconst <float_type> <bits>

Produces a floating-point constant. The bits represent the IEEE 754 bit pattern for the declared float width and may use up to 128 bits for f128.

Semantics: Value.float(bits)

`param`

vN = param <index>
vN = param %name

References a function parameter by index (0-based) or by name. When source-level parameter names are available, the named form param %name is used in the text format. The internal representation always uses the numeric ABI index. This is not a true constant but creates a named value for the parameter within the instruction stream.

Arithmetic

All integer arithmetic operates on mathematical integers with infinite precision. There is no overflow. The formal semantics are defined in TuffyLean.IR.Semantics.

`add`

vN = add vA, vB

Integer addition. Semantics: evalAdd(a, b) = a + b

`sub`

vN = sub vA, vB

Integer subtraction. Semantics: evalSub(a, b) = a - b

`mul`

vN = mul vA, vB

Integer multiplication. Semantics: evalMul(a, b) = a * b

`div`

vN = div vA, vB

Integer division. Produces poison if vB is zero. Signedness is a property of operand annotations, not the operation itself — in the infinite precision integer model, signed and unsigned division are mathematically identical. Semantics: evalDiv(a, b) = if b = 0 then poison else a / b

`rem`

vN = rem vA, vB

Integer remainder. Produces poison if vB is zero. Signedness is a property of operand annotations, not the operation itself. Semantics: evalRem(a, b) = if b = 0 then poison else a % b

`and`

vN = and vA, vB

Bitwise AND on infinite precision two’s complement integers. Semantics: evalAnd(a, b) = Int.land a b

`or`

vN = or vA, vB

Bitwise OR on infinite precision two’s complement integers. Semantics: evalOr(a, b) = Int.lor a b

`xor`

vN = xor vA, vB

Bitwise XOR on infinite precision two’s complement integers. Semantics: evalXor(a, b) = Int.xor a b

`band`

vN = band vA, vB

Boolean AND (logical conjunction). Both operands must be Bool type. Semantics: evalBAnd(a, b) = a && b

`bor`

vN = bor vA, vB

Boolean OR (logical disjunction). Both operands must be Bool type. Semantics: evalBOr(a, b) = a || b

`bxor`

vN = bxor vA, vB

Boolean XOR (logical exclusive or). Both operands must be Bool type. Semantics: evalBXor(a, b) = xor a b

`shl`

vN = shl vA, vB

Left shift. Produces poison if the shift amount vB is negative. Semantics: evalShl(a, b) = if b < 0 then poison else a <<< b

`shr`

vN = shr vA, vB

Right shift. Produces poison if the shift amount vB is negative. Signedness is a property of operand annotations, not the operation. In infinite precision, logical and arithmetic right shift are identical. Semantics: evalShr(a, b) = if b < 0 then poison else a >>> b

`count_ones`

vN = count_ones vA

Population count: returns the number of set bits in the binary representation of vA. Produces poison if vA is negative. Semantics: evalCountOnes(a) = if a < 0 then poison else popcount(a)

`clz`

vN = clz.<width> vA

Count leading zeros in an n-bit representation. Returns the number of consecutive zero bits starting from the most significant bit. Produces poison if width is 0. Semantics: evalCountLeadingZeros(a, width) = if width = 0 then poison else clzN(width, a mod 2^width)

`ctz`

vN = ctz vA

Count trailing zeros: returns the number of consecutive zero bits starting from the least significant bit. Produces poison if vA is zero or negative. Semantics: evalCountTrailingZeros(a) = if a ≤ 0 then poison else ctz(a)

`bswap`

vN = bswap.<bytes> vA

Byte-swap: reverses the byte order of the low bytes bytes. Produces poison if bytes is 0. Semantics: evalBswap(a, bytes) = if bytes = 0 then poison else byteswap(a mod 2^(8*bytes))

`bitreverse`

vN = bitreverse.<bits> vA

Bit-reverse: reverses the bit order of the low bits bits. Produces poison if bits is 0. Semantics: evalBitReverse(a, bits) = if bits = 0 then poison else bitreverse(a mod 2^bits)

`rotl`

vN = rotl.<width> vA, vAmt

Rotate left by vAmt positions in a width-bit field. Produces poison if width is 0. Semantics: evalRotateLeft(a, amt, width) = if width = 0 then poison else rotate_left(a mod 2^width, amt mod width)

`rotr`

vN = rotr.<width> vA, vAmt

Rotate right by vAmt positions in a width-bit field. Produces poison if width is 0. Semantics: evalRotateRight(a, amt, width) = if width = 0 then poison else rotate_right(a mod 2^width, amt mod width)

`merge`

vN = merge.<width> vA, vB

Replace the low width bits of vA with the low width bits of vB, producing a single integer. Produces poison if width is 0. Semantics: evalMerge(a, b, width) = if width = 0 then poison else (a with low width bits cleared) | (b AND (2^width - 1))

`split`

vHi, vLo = split.<width> vA

Decompose vA at bit position width. Produces two results:

vLo = the low width bits of vA (zero-extended)
vHi = vA right-shifted by width bits

Produces poison if width is 0. This is a multi-result instruction. Semantics: evalSplitHi(a, width) = a >>> width, evalSplitLo(a, width) = a mod 2^width

`clmul`

vN = clmul vA, vB

Carry-less multiplication (polynomial multiplication over GF(2)). Multiplies vA and vB using XOR instead of addition for accumulating partial products. Produces poison if either operand is negative. Semantics: evalClmul(a, b) = if a < 0 ∨ b < 0 then poison else clmulNat(a, b)

`uadd_sat`

vN = uadd_sat.<width> vA, vB

Unsigned saturating addition in width bits. Clamps result to [0, 2^width-1]. Produces poison if width is 0. Semantics: evalSaturatingAdd(a, b, width) = if width = 0 then poison else min(a + b, 2^width - 1)

`usub_sat`

vN = usub_sat.<width> vA, vB

Unsigned saturating subtraction in width bits. Clamps result to [0, 2^width-1]. Produces poison if width is 0. Semantics: evalSaturatingSub(a, b, width) = if width = 0 then poison else max(a - b, 0)

`sadd_sat`

vN = sadd_sat.<width> vA, vB

Signed saturating addition in width bits. Clamps result to [-2^(width-1), 2^(width-1)-1]. Produces poison if width is 0. Semantics: evalSignedSaturatingAdd(a, b, width) = if width = 0 then poison else clamp(a + b, -2^(width-1), 2^(width-1)-1)

`ssub_sat`

vN = ssub_sat.<width> vA, vB

Signed saturating subtraction in width bits. Clamps result to [-2^(width-1), 2^(width-1)-1]. Produces poison if width is 0. Semantics: evalSignedSaturatingSub(a, b, width) = if width = 0 then poison else clamp(a - b, -2^(width-1), 2^(width-1)-1)

`sadd_overflow`

vSum, vOverflow = sadd_overflow.<width> vA, vB

Signed addition with overflow detection in width bits. Returns two results: the wrapped sum and a boolean overflow flag. Produces poison if width is 0. Semantics: evalSAddOverflow(a, b, width) = (wrapped_sum, overflow_flag)

`uadd_overflow`

vSum, vOverflow = uadd_overflow.<width> vA, vB

Unsigned addition with overflow detection in width bits. Returns two results: the wrapped sum and a boolean overflow flag. Produces poison if width is 0. Semantics: evalUAddOverflow(a, b, width) = (wrapped_sum, overflow_flag)

`ssub_overflow`

vDiff, vOverflow = ssub_overflow.<width> vA, vB

Signed subtraction with overflow detection in width bits. Returns two results: the wrapped difference and a boolean overflow flag. Produces poison if width is 0. Semantics: evalSSubOverflow(a, b, width) = (wrapped_diff, overflow_flag)

`usub_overflow`

vDiff, vOverflow = usub_overflow.<width> vA, vB

Unsigned subtraction with overflow detection in width bits. Returns two results: the wrapped difference and a boolean overflow flag. Produces poison if width is 0. Semantics: evalUSubOverflow(a, b, width) = (wrapped_diff, overflow_flag)

`smul_overflow`

vProd, vOverflow = smul_overflow.<width> vA, vB

Signed multiplication with overflow detection in width bits. Returns two results: the wrapped product and a boolean overflow flag. Produces poison if width is 0. Semantics: evalSMulOverflow(a, b, width) = (wrapped_prod, overflow_flag)

`umul_overflow`

vProd, vOverflow = umul_overflow.<width> vA, vB

Unsigned multiplication with overflow detection in width bits. Returns two results: the wrapped product and a boolean overflow flag. Produces poison if width is 0. Semantics: evalUMulOverflow(a, b, width) = (wrapped_prod, overflow_flag)

`scarrying_mul_add`

vLo, vHi = scarrying_mul_add.<width> vA, vB, vCarry, vAdd

Signed full-width multiply-add in width bits. Computes the exact 2*width-bit two’s-complement result of a * b + carry + add, then returns the low width bits as vLo and the high width bits as vHi. Produces poison if width is 0. Semantics: evalSCarryingMulAdd(a, b, carry, add, width) = (lo, hi)

`ucarrying_mul_add`

vLo, vHi = ucarrying_mul_add.<width> vA, vB, vCarry, vAdd

Unsigned full-width multiply-add in width bits. Computes the exact 2*width-bit result of a * b + carry + add, then returns the low width bits as vLo and the high width bits as vHi. Produces poison if width is 0. Semantics: evalUCarryingMulAdd(a, b, carry, add, width) = (lo, hi)

`min`

vN = min vA, vB

Integer minimum. Returns the smaller of two integers. Semantics: evalMin(a, b) = if a ≤ b then a else b

`max`

vN = max vA, vB

Integer maximum. Returns the larger of two integers. Semantics: evalMax(a, b) = if a ≥ b then a else b

Floating Point Arithmetic

Floating point operations operate on values of float type (bf16, f16, f32, f64). The formal semantics use Lean 4’s native Float type (IEEE 754 binary64) as a placeholder model. The formal float model will be refined when full NaN payload and denormal semantics are decided.

Rewrite flags

Binary floating-point arithmetic instructions (fadd, fsub, fmul, fdiv) may carry optional rewrite flags that grant the optimizer permission to apply algebraic transformations. These flags do not change the instruction’s operational semantics — they only widen the set of legal rewrites.

Flag	Meaning
`reassoc`	Allow associativity / commutativity reordering
`contract`	Allow contraction (e.g., `a*b+c → fma(a,b,c)`)

Flags appear between the opcode and the operands:

vN = fadd reassoc contract vA, vB

When no flags are present the instruction follows strict IEEE 754 semantics. Mirrors TuffyLean.IR.FpRewriteFlags.

`fadd`

vN = fadd [flags] vA, vB

Floating point addition. Semantics: evalFAdd(a, b) = a + b

`fsub`

vN = fsub [flags] vA, vB

Floating point subtraction. Semantics: evalFSub(a, b) = a - b

`fmul`

vN = fmul [flags] vA, vB

Floating point multiplication. Semantics: evalFMul(a, b) = a * b

`fdiv`

vN = fdiv [flags] vA, vB

Floating point division. Semantics: evalFDiv(a, b) = a / b

`fminnum`

vN = fminnum vA, vB

IEEE 754-2008 minNum operation. NaN-suppressing: if one operand is qNaN and the other is numeric, returns the numeric value. Semantics: evalFMinNum(a, b)

`fmaxnum`

vN = fmaxnum vA, vB

IEEE 754-2008 maxNum operation. NaN-suppressing: if one operand is qNaN and the other is numeric, returns the numeric value. Semantics: evalFMaxNum(a, b)

`fneg`

vN = fneg vA

Floating point negation. Semantics: evalFNeg(a) = -a

`fabs`

vN = fabs vA

Floating point absolute value. Semantics: evalFAbs(a) = Float.abs a

`copysign`

vN = copysign vMag, vSign

Produce a value with the magnitude of vMag and the sign bit of vSign. Semantics: evalCopySign(mag, sign) = if sign < 0 then -(abs mag) else abs mag

`frem`

vN = frem vA, vB

Floating point remainder (IEEE 754 fmod). Computes the remainder of vA / vB with truncation toward zero. Semantics: evalFRem(a, b) = a - trunc(a/b) * b

`fcmp`

vN = fcmp.<pred> vA, vB

Floating point comparison. Returns a bool value: true if the comparison holds, false otherwise.

Predicates:

Predicate	Description
`false`	Always false
`oeq`	Ordered equal
`ogt`	Ordered greater than
`oge`	Ordered greater than or equal
`olt`	Ordered less than
`ole`	Ordered less than or equal
`one`	Ordered not equal
`ord`	Ordered (neither operand is NaN)
`uno`	Unordered (either operand is NaN)
`ueq`	Unordered or equal
`ugt`	Unordered or greater than
`uge`	Unordered or greater than or equal
`ult`	Unordered or less than
`ule`	Unordered or less than or equal
`une`	Unordered or not equal
`true`	Always true

Ordered predicates return false if either operand is NaN. Unordered predicates return true if either operand is NaN.

Semantics: evalFCmp(op, a, b) = bool(op(a, b))

Comparison

`icmp`

vN = icmp.<pred> vA, vB

Integer comparison. Returns a bool value: true if the comparison holds, false otherwise.

Predicates:

Predicate	Description
`eq`	Equal
`ne`	Not equal
`lt`	Less than (signedness from annotation)
`le`	Less than or equal (signedness from annotation)
`gt`	Greater than (signedness from annotation)
`ge`	Greater than or equal (signedness from annotation)

Semantics: evalICmp(op, a, b) = bool(op(a, b))

`select`

vN = select vCond, vTrue, vFalse

Conditional select. If vCond is true, produces vTrue; otherwise produces vFalse. The condition must be of type bool. The result type matches the type of vTrue/vFalse.

Semantics: evalSelect(cond, tv, fv) = if cond then tv else fv

Value Annotations

Range constraints and bit-level facts are encoded as annotations on value definitions (result-side) and use edges (use-side), rather than as standalone instructions. See RFC: at-use-annotations for the full design.

Annotation types

Annotation	Meaning
`:s<N>`	Value is in signed N-bit range `[-2^(N-1), 2^(N-1)-1]`
`:u<N>`	Value is in unsigned N-bit range `[0, 2^N-1]`
`:known(<ternary>)`	Per-bit four-state constraint
`:nofpclass(<classes>)`	Float value must not belong to the listed FP classes
(none)	No constraint; unconstrained `int`

Annotations are composable: :s32:known(...) applies both constraints simultaneously.

Known bits encoding

Each bit in a known annotation is one of four states:

Symbol	Meaning
`0`	Bit is known to be 0
`1`	Bit is known to be 1
`?`	Unknown — bit is demanded but value not determined
`x`	Don’t-care — bit is not demanded

nofpclass

Constrains which IEEE 754 floating-point value classes a float value may belong to. If the value falls into an excluded class, the result is poison. This is separate from integer annotations (:s<N>, :u<N>) and mirrors LLVM’s nofpclass attribute.

The 10 individual FP classes (mirroring LLVM’s FPClassTest):

Class	Description
`snan`	Signaling NaN
`qnan`	Quiet NaN
`ninf`	Negative infinity
`nnorm`	Negative normal
`nsub`	Negative subnormal
`nzero`	Negative zero
`pzero`	Positive zero
`psub`	Positive subnormal
`pnorm`	Positive normal
`pinf`	Positive infinity

Convenience groups:

Group	Expands to
`nan`	`snan qnan`
`inf`	`ninf pinf`
`zero`	`nzero pzero`

Syntax:

v0:nofpclass(nan inf) = fadd v1, v2   // result must not be NaN or ±Inf
v1 = fsub v0:nofpclass(nzero), v2     // use-side: v0 must not be -0.0

Mirrors TuffyLean.IR.FpClassMask.

Result-side annotations

Declared on the value produced by an instruction. Violation causes the instruction to produce poison.

v0:s32 = add v1, v2    // if true result outside [-2^31, 2^31-1], v0 is poison

Use-side annotations

Declared on an operand of a consuming instruction. Violation causes the consuming instruction to produce poison. May be stronger than the result-side annotation.

v1 = foo v0:u8          // if v0 is outside [0, 255], foo produces poison

Function signature annotations

Function signatures carry annotations as ABI-level contracts:

func @add_i32(int:s32, int:s32) -> int:s32 { ... }

Parameter annotations are caller guarantees; return annotations are callee guarantees.

Memory Operations

All memory operations participate in Memory SSA. Operations that modify memory state consume a mem token and produce a new one. Plain loads consume a mem token but do not produce one. See types.md for details on the mem type.

`load`

vN = load vPtr, vMem -> <type>

Load a value of the specified type from the address pointed to by vPtr. Takes a mem token as input (MemoryUse). Returns a data value only — does not produce a new mem token.

The loaded type may be any non-mem data type, including struct{...} and [T; N]. Aggregate loads use the explicit field/element layout described by the type and do not introduce implicit padding beyond what is already represented in the aggregate.

Semantics: evalLoad(mem, addr, ty) = value of type ty loaded from mem[addr..]

`store`

vN = store vVal, vPtr, vMem

Store a value to the address pointed to by vPtr. Takes a mem token as input (MemoryDef) and produces a new mem token as its result.

The stored value may be any non-mem data type, including aggregates. Aggregate stores write the explicit byte layout induced by the value’s type.

Semantics: evalStore(mem, addr, val) = mem with val written at addr

`stack_slot`

vN = stack_slot <bytes>

Allocate bytes bytes on the stack. Returns a ptr(0) to the allocated memory.

`memcpy`

vN = memcpy vDst, vSrc, vCount, align=<N>, vMem

Copy vCount bytes from vSrc to vDst. The source and destination regions must not overlap (non-overlapping memcpy semantics). align is the minimum alignment hint for both pointers (must be a power of two). Takes a mem token as input (MemoryDef) and produces a new mem token.

Semantics: evalMemCopy(mem, dst, src, count) = evalStore(mem, dst, readBytes(mem, src, count))

`memmove`

vN = memmove vDst, vSrc, vCount, align=<N>, vMem

Copy vCount bytes from vSrc to vDst. Unlike memcpy, overlapping regions are handled correctly: all source bytes are read before any destination bytes are written. align is the minimum alignment hint for both pointers. Takes a mem token as input (MemoryDef) and produces a new mem token.

Semantics: evalMemMove(mem, dst, src, count) = evalStore(mem, dst, readBytes(mem, src, count)) (bytes are captured from mem before the store, making overlap safe)

`memset`

vN = memset vDst, vVal, vCount, align=<N>, vMem

Fill vCount bytes at vDst with the byte value vVal (low 8 bits). align is the minimum alignment hint for the destination pointer. Takes a mem token as input (MemoryDef) and produces a new mem token.

Semantics: evalMemSet(mem, dst, val, count) = evalStore(mem, dst, replicate(count, val & 0xFF))

Atomic Operations

Atomic operations provide thread-safe memory access with explicit memory ordering constraints. The formal semantics in Lean define sequential-only behavior; a formal concurrency model (e.g., based on C11) is TBD.

Enumerations

MemoryOrdering — memory ordering constraints for atomic operations:

Ordering	Description
`relaxed`	No ordering constraints
`acquire`	Subsequent reads see writes from the releasing thread
`release`	Prior writes are visible to the acquiring thread
`acqrel`	Combined acquire + release
`seqcst`	Sequentially consistent total order

AtomicRmwOp — read-modify-write operation kinds:

Op	Description
`xchg`	Exchange (swap)
`add`	Integer addition
`sub`	Integer subtraction
`and`	Bitwise AND
`or`	Bitwise OR
`xor`	Bitwise XOR

`load.atomic`

vM, vN = load.atomic.<ordering> vPtr, vMem

Atomic load from pointer vPtr with the specified memory ordering. Takes a mem token as input (MemoryDef) and produces two results: a new mem token and the loaded data value.

`store.atomic`

vN = store.atomic.<ordering> vVal, vPtr, vMem

Atomic store of vVal to pointer vPtr with the specified memory ordering. Takes a mem token as input (MemoryDef) and produces a new mem token.

`rmw`

vM, vN = rmw.<op>.<ordering> vPtr, vVal, vMem

Atomic read-modify-write. Atomically reads the value at vPtr, applies <op> with vVal, writes the result back. Takes a mem token as input (MemoryDef) and produces two results: a new mem token and the original value.

`cmpxchg`

vM, vN = cmpxchg.<success_ord>.<failure_ord> vPtr, vExpected, vDesired, vMem

Atomic compare-and-exchange. Atomically compares the value at vPtr with vExpected; if equal, writes vDesired. Takes a mem token as input (MemoryDef) and produces two results: a new mem token and the old value. The caller uses icmp to determine whether the exchange succeeded. <success_ord> applies on successful exchange; <failure_ord> applies on failure.

`fence`

vN = fence.<ordering> vMem

Memory fence. Establishes ordering constraints without accessing memory. Takes a mem token as input (MemoryDef) and produces a new mem token.

Type Conversion

`bitcast`

vN = bitcast vA

Reinterpret the bits of vA as a different type. The result type is determined by the instruction’s type annotation.

`sext`

vN = sext vA, <bits>

Sign-extend vA to bits bits. Used during lowering to make bit widths explicit for instruction selection.

`zext`

vN = zext vA, <bits>

Zero-extend vA to bits bits. Used during lowering to make bit widths explicit for instruction selection.

`fp_to_si`

vN = fp_to_si vA

Convert a floating-point value to a signed integer by truncation toward zero. The result carries an :sN annotation indicating the target bit width.

`fp_to_ui`

vN = fp_to_ui vA

Convert a floating-point value to an unsigned integer by truncation toward zero. The result carries a :uN annotation indicating the target bit width.

`si_to_fp`

vN = si_to_fp vA -> <float_type>

Convert a signed integer to a floating-point value of the specified float type. The integer operand is interpreted as a signed value per its annotation.

`ui_to_fp`

vN = ui_to_fp vA -> <float_type>

Convert an unsigned integer to a floating-point value of the specified float type. The integer operand is interpreted as an unsigned value per its annotation.

`fp_convert`

vN = fp_convert vA

Convert a floating-point value from one float type to another (widen or narrow). The result type is determined by the instruction’s type annotation.

Pointer Operations

Pointer operations manipulate pointers with explicit provenance tracking. The formal semantics are defined in TuffyLean.IR.Semantics.

`ptradd`

vN = ptradd vPtr, vOffset

Pointer addition. Offsets the pointer vPtr by vOffset bytes. The result preserves the provenance of the base pointer. Semantics: evalPtrAdd(base, offset) = ptr { allocId = base.allocId, offset = base.offset + offset }

`ptrdiff`

vN = ptrdiff vA, vB

Pointer difference. Computes the byte offset between two pointers. Both pointers must belong to the same allocation; otherwise the result is poison. Semantics: evalPtrDiff(a, b) = if a.allocId = b.allocId then a.offset - b.offset else poison

`ptrtoint`

vN = ptrtoint vPtr

Convert a pointer to an integer. The provenance is captured — the resulting integer retains knowledge that it came from a pointer. Semantics: evalPtrToInt(p) = p.offset

`ptrtoaddr`

vN = ptrtoaddr vPtr

Extract the address from a pointer, discarding provenance. Returns a plain integer with no provenance information. Semantics: evalPtrToAddr(p) = p.offset

`inttoptr`

vN = inttoptr vAddr

Create a pointer from an integer address. The resulting pointer has no valid provenance (wildcard provenance). Semantics: evalIntToPtr(addr, allocId) = ptr { allocId, offset = addr }

Aggregate Operations

`extractvalue`

vN = extractvalue vAgg, <index0>, <index1>, ...

Extract a field or element from a struct or array value at the given index path. The result type is the type of the addressed field/element. For example, extractvalue v0, 0 extracts the first field of a struct, and extractvalue v0, 1, 2 extracts a nested element.

`insertvalue`

vN = insertvalue vAgg, vVal, <index0>, <index1>, ...

Produce a new aggregate value by replacing the field/element at the given index path in vAgg with vVal. The result type matches the type of vAgg.

Function Calls

`symbol_addr`

vN = symbol_addr @name

Load the address of a named symbol (function or static data). The @name refers to an entry in the module’s symbol table. The result type is ptr.

This instruction makes function calls and static data references self-contained in the IR — the callee of a call is a value produced by symbol_addr rather than a side-channel mapping.

Semantics: Produces a pointer to the symbol identified by @name.

`tls_symbol_addr`

vN = tls_symbol_addr @name

Load the thread-local address of a #[thread_local] static. Like symbol_addr, the @name refers to an entry in the module’s symbol table, but the resulting pointer is to the thread-local instance of the symbol. The result type is ptr.

`call`

vM = call vCallee(vArg0, vArg1, ...), vMem
vM, vN = call vCallee(vArg0, vArg1, ...), vMem -> <ret_type>
vM, vN = call vCallee(vArg0, vArg1, ...), vMem -> <ret_type>  ; may carry an internal cleanup label

Call the function pointed to by vCallee with the given arguments. Takes a mem token as input (MemoryDef). For void calls, produces a single mem token result. For non-void calls, produces two results: a new mem token and the return value. Calls may also carry an internal cleanup label used by the unwinder to enter a landing-pad wrapper block if the call unwinds.

`call_ret2`

vN = call_ret2 vMem

Read the secondary machine return value associated with the call that produced vMem. This is used for ABIs that return values in two machine registers without relying on a backend side table.

Terminators

Terminators are instructions that end a basic block by transferring control flow. By convention, they are the last instruction in a block.

`ret`

ret vA, vMem
ret vA, vB, vMem
ret vMem

Return from the function. Takes a mem token as the final memory state operand. Optionally returns one or two data values. A function with a return type must return at least the primary value; a void function uses ret vMem with only the mem token. The optional second return value is used for ABIs with a secondary machine return register.

`br`

br bbN(vA, vB, ...)
br bbN

Unconditional branch to block bbN, passing values as block arguments. If the target block has no arguments, the argument list is omitted.

`brif`

brif vCond, bbThen(args...), bbElse(args...)
brif vCond, bbThen, bbElse

Conditional branch. The condition vCond must be of type bool. If true, branches to bbThen; otherwise branches to bbElse. Each target may receive block arguments.

`continue`

continue vA, vB, ...
continue

Loop backedge. Transfers control back to the entry block of the enclosing loop region, passing values as the header block’s arguments. Only valid inside a loop region.

`region_yield`

region_yield vA, vB, ...
region_yield

Exit the enclosing region, yielding values to the parent region. Used for structured control flow where a region produces result values.

`unreachable`

unreachable

Indicates that control flow should never reach this point. If executed, the behavior is undefined (the optimizer may assume this path is dead).

`trap`

trap

Unconditionally abort execution. Used for runtime checks such as failed assertions (e.g., division-by-zero guards). Unlike unreachable, reaching a trap is not UB — it is a well-defined program abort.

Exception Handling

`landing_pad`

vN = landing_pad

Capture the exception pointer at a landing-pad entry. The unwinder deposits the exception object in the return-value register (%rax on x86-64) before transferring control to the landing pad. This instruction materializes that value as a ptr so it can be saved and later passed to _Unwind_Resume.

Text Format

Tuffy IR uses a Cranelift-inspired text format for human-readable output.

Naming Conventions

Entity	Format	Example
Functions	`@name`	`@add`, `@factorial`
Values	`vN`	`v0`, `v1`, `v2`
Blocks	`bbN`	`bb0`, `bb1`
Regions	`region <kind>`	`region loop`, `region plain`

Values are numbered sequentially (v0, v1, v2, …) in region tree order. Within each block, block arguments are numbered before instruction results.

Grammar

function     ::= 'func' '@' name '(' param_list ')' ('->' type annotation)? '{' body '}'
param_list   ::= (type annotation (',' type annotation)*)?
body         ::= (block | region)*
region       ::= 'region' region_kind '{' region_body '}'
region_kind  ::= 'loop' | 'plain'
region_body  ::= (block | region)*
block        ::= block_header instruction*
block_header ::= 'bb' N block_args? ':'
block_args   ::= '(' block_arg (',' block_arg)* ')'
block_arg    ::= value ':' type
instruction  ::= (value annotation '=')? opcode operands
operand      ::= value annotation
annotation   ::= (':' range_ann)*
range_ann    ::= 's' N | 'u' N | 'known' '(' ternary ')'
ternary      ::= [01?x_]+
value        ::= 'v' N
type         ::= 'int' | 'byte' | 'ptr'

Complete Example

A factorial function demonstrating nested regions, block arguments, and control flow:

func @factorial(int:s32) -> int:s32 {
  bb0:
    v0:s32 = param 0
    v1 = iconst 1
    v2 = iconst 1
    br bb1(v2, v1)

  region loop {
    bb1(v4: int, v5: int):
      v6 = icmp.le v5:s32, v0:s32
      brif v6, bb2, bb3

    bb2:
      v8 = mul v4, v5
      v9 = sub v5, v1
      continue v8, v9
  }

  bb3:
    ret v4:s32
}

Planned Features

The following features are described in the IR Design RFC but are not yet implemented in the Rust codebase.

Floating Point Semantics

The IR adopts IEEE 754-2008 as the floating-point semantics standard. Basic operations (fadd, fsub, fmul, fdiv, fneg, fabs, copysign) are implemented. Per-instruction rewrite flags (reassoc, contract) and float value class constraints (nofpclass) are defined.

IEEE 754 Lean library

The current Lean model uses Lean 4’s native Float type (IEEE 754 binary64) as a placeholder. Float is opaque — it maps directly to hardware double and cannot be unfolded in proofs. Properties that follow from the IEEE 754 standard (e.g., “fadd produces -0.0 only when both operands are -0.0”) must be axiomatized (see IR/FloatAxioms.lean). This is unsatisfactory for a formal-first project.

We need a Lean 4 library that provides a transparent, proof-friendly model of IEEE 754 floating-point arithmetic covering all widths (f16, bf16, f32, f64, f128), rounding modes, NaN payloads, subnormals, and signed zeros. Existing efforts:

HOLFloat-Lean — formalizes FP semantics in Lean 4, building on prior HOL formalizations.
Flean — individual effort for IEEE 754 in Lean; requires rewrite.
Flocq port — the Coq Flocq library is the most mature FP formalization; porting it to Lean 4 is a community proposal.

Until a suitable library is available, tuffy will continue axiomatizing IEEE 754 properties as needed. The axiom set should be kept minimal and well-documented so that it can be discharged once a real library is adopted.

Scalable Vector Types

Vector types parameterized by total bit-width, not element count. Element count is derived: count = total_bits / element_bits. Scalable vectors use vscale × base_width, where vscale is a runtime constant determined by hardware (cf. SVE, RVV). Well-formedness: total_bits % element_bits == 0 and lane count must be a power of two. See the scalable vector RFC (planned).

Memory SSA

Memory dependencies encoded directly into the IR as memory version tokens on load/store operations, enabling progressive refinement as alias analysis improves.

Atomic Operations

Atomic operations (load.atomic, store.atomic, rmw, cmpxchg, fence) are implemented with MemoryOrdering and AtomicRmwOp enumerations. See Atomic Operations. The formal concurrency model is TBD — current Lean semantics define sequential behavior only.

Control Flow Structuralization

Translation from rustc MIR to tuffy IR with automatic structuralization of loops and branches into SESE regions. Irreducible control flow remains as flat CFG within a region.

MIR to Tuffy IR Translation

This document describes how Rust MIR (Mid-level Intermediate Representation) operations are translated to Tuffy IR instructions in the rustc_codegen_tuffy backend.

Operands

MIR operands represent values used in operations. There are three kinds:

MIR Operand	Translation
`Copy(place)`	Read value from place (may generate `load` if stack-allocated)
`Move(place)`	Read value from place (same as Copy at IR level)
`Constant(const)`	Translate to `iconst`, `bconst`, or `symbol_addr`

Copy vs Move: The distinction between Copy and Move is enforced by MIR’s borrow checker. At the IR level, both translate to reading the value from the place. For register-allocated locals, this is a direct value reference. For stack-allocated locals, this generates a load instruction.

Constants: Constant operands are translated based on their type:

Integer/bool literals → iconst / bconst
Function pointers → symbol_addr
Static references → symbol_addr + optional offset

Binary Operations

Integer Arithmetic

MIR BinOp	Tuffy IR	Result Annotation	Extension
`Add`	`add`	`:dontcare(N)`	`sext` (signed) / `zext` (unsigned)
`Sub`	`sub`	`:dontcare(N)`	`sext` (signed) / `zext` (unsigned)
`Mul`	`mul`	`:dontcare(N)`	`sext` (signed) / `zext` (unsigned)
`Div`	`div`	`:sN` / `:uN`	None
`Rem`	`rem`	`:sN` / `:uN`	None
`AddUnchecked`	`add`	`:sN` / `:uN`	None
`SubUnchecked`	`sub`	`:sN` / `:uN`	None
`MulUnchecked`	`mul`	`:sN` / `:uN`	None

Wrapping arithmetic (Add, Sub, Mul): The IR instruction produces a result with :dontcare(N) annotation, then sext (for signed types) or zext (for unsigned types) is inserted to correctly interpret the low N bits. This prevents high bits from 64-bit operations bleeding through for sub-64-bit types.

Unchecked arithmetic: MIR guarantees no overflow, so the result uses :sN or :uN annotation directly without extension. The optimizer can assume the result is in range.

Floating Point Arithmetic

MIR BinOp	Tuffy IR
`Add`	`fadd`
`Sub`	`fsub`
`Mul`	`fmul`
`Div`	`fdiv`
`Rem`	`frem`

Note: Floating point operations do not use annotations or extensions.

Overflow Detection

MIR BinOp	Tuffy IR (Signed)	Tuffy IR (Unsigned)
`AddWithOverflow`	`sadd_overflow`	`uadd_overflow`
`SubWithOverflow`	`ssub_overflow`	`usub_overflow`
`MulWithOverflow`	`smul_overflow`	`umul_overflow`

These return two values: the wrapped result and a boolean overflow flag.

Bitwise Operations

MIR BinOp	Tuffy IR	Result Annotation	Notes
`BitAnd`	`and`	`:sN` / `:uN`	Direct passthrough
`BitOr`	`or`	`:sN` / `:uN`	Direct passthrough
`BitXor`	`xor`	`:sN` / `:uN`	Direct passthrough
`Shl`	`shl`	`:sN` / `:uN`	Shift amount masked to `% bit_width`
`ShlUnchecked`	`shl`	`:sN` / `:uN`	Shift amount masked to `% bit_width`
`Shr`	`shr`	`:sN` / `:uN`	Shift amount masked to `% bit_width`
`ShrUnchecked`	`shr`	`:sN` / `:uN`	Shift amount masked to `% bit_width`

Note: Bitwise operations use the result annotation directly without extension. Shift operations mask the shift amount to value & (bit_width - 1) to match Rust semantics.

Comparison Operations

MIR BinOp	Tuffy IR (Integer)	Tuffy IR (Float)
`Eq`	`icmp.eq`	`fcmp.oeq`
`Ne`	`icmp.ne`	`fcmp.une`
`Lt`	`icmp.lt`	`fcmp.olt`
`Le`	`icmp.le`	`fcmp.ole`
`Gt`	`icmp.gt`	`fcmp.ogt`
`Ge`	`icmp.ge`	`fcmp.oge`
`Cmp`	Special handling*	N/A

Note: Cmp (three-way comparison) is lowered to a sequence of comparisons and selects that produce -1, 0, or 1.

Pointer Operations

MIR BinOp	Tuffy IR
`Offset`	`ptradd`

Unary Operations

MIR UnOp	Tuffy IR
`Not`	`xor v, -1`
`Neg`	`sub 0, v` (int) / `fneg v` (float)

Note: Bitwise NOT is currently emulated using XOR with -1. A dedicated not instruction may be added in the future.

Statements

MIR statements represent actions within a basic block.

MIR Statement	Translation
`Assign(place, rvalue)`	Translate rvalue, store result to place
`StorageLive(local)`	No-op (ignored)
`StorageDead(local)`	No-op (ignored)
`SetDiscriminant { place, variant }`	`store` tag value to discriminant field
`Nop`	No-op
`Intrinsic(intrinsic)`	See Intrinsics section

StorageLive/StorageDead: These are lifetime markers used by MIR for stack slot reuse and borrow checking. At the IR level, they are ignored since Tuffy IR uses explicit stack slots and does not perform lifetime-based optimizations.

SetDiscriminant: Writes the discriminant (tag) value for an enum variant. For single-variant enums, this is a no-op. For multi-variant enums:

Computes the tag value based on the variant index and tag encoding (Direct or Niche)
Calculates the address of the discriminant field (base + offset)
Emits a store instruction to write the tag value

Intrinsics

Bit Manipulation

MIR Intrinsic	Tuffy IR
`ctpop`	`count_ones`
`ctlz` / `ctlz_nonzero`	`clz`
`cttz` / `cttz_nonzero`	`ctz`
`bswap`	`bswap`
`bitreverse`	`bitreverse`
`rotate_left`	`rotl`
`rotate_right`	`rotr`

Memory Operations

MIR Intrinsic	Tuffy IR
`copy_nonoverlapping`	`memcpy`
`write_bytes`	`memset`
`size_of`	`iconst` (compile-time constant)
`align_of`	`iconst` (compile-time constant)

Floating Point

MIR Intrinsic	Tuffy IR
`fabsf32` / `fabsf64`	`fabs`
`copysignf32` / `copysignf64`	`copysign`
`floorf32` / `floorf64`	External call
`ceilf32` / `ceilf64`	External call
`truncf32` / `truncf64`	External call
`sqrtf32` / `sqrtf64`	External call

Note: Transcendental functions (floor, ceil, sqrt, etc.) are not yet implemented as IR instructions and are lowered to external function calls.

Atomic Operations

MIR Intrinsic	Tuffy IR
`atomic_load`	`load.atomic`
`atomic_store`	`store.atomic`
`atomic_xchg`	`rmw.xchg`
`atomic_xadd`	`rmw.add`
`atomic_xsub`	`rmw.sub`
`atomic_and`	`rmw.and`
`atomic_or`	`rmw.or`
`atomic_xor`	`rmw.xor`
`atomic_cxchg` / `atomic_cxchgweak`	`cmpxchg`
`atomic_fence`	`fence`

Other Intrinsics

MIR Intrinsic	Translation
`black_box`	Identity (no optimization barrier in IR)
`assume`	No-op
`is_val_statically_known`	`bconst false`
`assert_inhabited`	No-op (compile-time check)
`assert_zero_valid`	No-op (compile-time check)

Terminators

MIR Terminator	Tuffy IR
`Return`	`ret`
`Goto`	`br`
`SwitchInt`	Lowered to `brif` tree
`Call`	`call`
`Unreachable`	`unreachable`
`Drop`	Call to drop glue + `br`
`Assert`	`brif` + `trap` on failure
`InlineAsm`	Pattern-based translation (see below)
`UnwindResume`	`call _Unwind_Resume` + `unreachable`
`UnwindTerminate`	`call panic_cannot_unwind` + `trap`

SwitchInt: Multi-way branch lowered to a binary decision tree using nested brif instructions.

InlineAsm: Inline assembly is handled via pattern recognition, not general-purpose execution. The select_unpredictable pattern (cmovnz/cmovne + cmovz/cmove) is translated to icmp + select. Other patterns use identity copies or zero-initialization of outputs.

UnwindResume: Loads the exception pointer from the landing-pad slot, calls _Unwind_Resume to resume stack unwinding, then emits unreachable.

UnwindTerminate: Resolves and calls core::panicking::panic_cannot_unwind, then emits trap as a safety backstop.

Type Conversions

MIR Cast	Tuffy IR
Integer extension (signed)	`sext`
Integer extension (unsigned)	`zext`
Integer truncation	Annotation change
Float to signed int	`fp_to_si` (saturating)
Float to unsigned int	`fp_to_ui` (saturating)
Signed int to float	`si_to_fp`
Unsigned int to float	`ui_to_fp`
Float to float (widen/narrow)	`fp_convert`
Pointer to int	`ptrtoint`
Int to pointer	`inttoptr`
FnPtr to ptr	`symbol_addr`
Ptr to ptr	Bitwise move
Transmute	Via temporary `stack_slot` + `store`/`load`

Float to int: Follows Rust’s saturating semantics — NaN converts to 0, out-of-range values clamp to the target type’s MIN/MAX. For 128-bit targets, converts to 64-bit first then extends.

Int to float: Narrow integers (< 64 bits) are sign/zero-extended before conversion to ensure correct results.

Transmute: Reinterprets bit patterns via a temporary stack slot. Stores the source value, then loads it as the target type.

Aggregate Construction

Rvalue::Aggregate constructs tuples, structs, closures, and enums.

Strategy: Allocates a stack_slot with the aggregate’s size, then stores each field at its computed offset via ptradd + store.

Tuples/Structs: Direct field-by-field storage at layout-computed offsets.
Enums: After storing variant fields, writes the discriminant via write_enum_tag() (a store to the tag field using Direct or Niche encoding).
Closures/Coroutines: Stores captured variables; initializes the state discriminant to 0.
Fat pointers within aggregates: Stores data pointer at offset 0 and metadata at offset +8.

Discriminant Reads

Rvalue::Discriminant reads an enum’s discriminant value.

Three cases based on enum layout:

Single variant (Variants::Single): Returns a constant (iconst of the discriminant value).
Direct tag (TagEncoding::Direct): Loads the tag field via ptradd (if offset ≠ 0) + load.
Niche encoding (TagEncoding::Niche): Loads the niche field, then decodes via arithmetic (sub for relative index, icmp for range check, select for final discriminant). Option-like enums (1 niche variant at offset 0) use a simplified icmp.eq + select.

Not Yet Implemented

The following MIR operations are not yet translated to Tuffy IR:

Tail calls: TerminatorKind::TailCall
Coroutines: Yield, CoroutineDrop

These operations either require additional IR support or are Rust-specific features that don’t map cleanly to a low-level IR.

Initial Design Discussion

Created: 2026-02-07
Status: Frozen — This document is a historical record. Do not modify.

Background

Tuffy is an experimental optimizing compiler written in Rust with LLM assistance. This document records the initial design discussion and key decisions made before implementation begins.

Long-term Vision

Tuffy aims to be a production-grade, language-agnostic optimizing compiler backend:

Multi-language backend — Serve as a codegen backend for multiple programming languages, not just Rust
Custom IR — Design and implement a custom intermediate representation with optimization passes
Multi-platform — Support multiple target architectures from the ground up
Balanced optimization — Pursue both fast compilation and competitive runtime performance
Formal verification — Provide formal correctness guarantees for optimization passes

Short-term Milestone

The first milestone is rustc_codegen_tuffy, a codegen backend for rustc as an alternative to LLVM.

Key Design Decisions

Target platform: Multi-platform architecture from the start, with x86-64 as the initial implementation target
Optimization direction: Balance between compilation speed and runtime performance (not exclusively pursuing either)
IR design: Custom IR — build our own intermediate representation, perform optimizations on it, then lower to machine code (rather than translating directly from rustc MIR)
IR interpreter: Implement a tuffy IR interpreter for testing and validation, similar to Miri
Backend strategy: Implement rustc_codegen_ssa traits to reuse existing MIR lowering code (~10,000 lines). Builder generates tuffy IR instead of LLVM IR. Same approach as rustc_codegen_cranelift. May switch to direct MIR traversal later if more control is needed.

IR Design

Structure:

Single data structure for all stages — no separate SDAG/MIR like LLVM. Different stages have different normalization (legalization) constraints on the same IR, progressively lowering toward machine code.

UB model (following Alive2):

Only poison — no undef. Simplifies semantics and avoids the well-known undef/poison confusion in LLVM.
Uninitialized memory is allowed at the memory level, but values are either concrete or poison.

Byte type (b<N>):

Introduced from the start, based on the LLVM byte type RFC. Represents raw memory data distinct from integers, with per-byte poison tracking. Enables sound memcpy lowering and load merging.

Integer type:

A single int type with infinite precision — no fixed bitwidth (no i8/i32/i64).
No overflow: arithmetic operations are mathematical (e.g., add is true addition).
Range constraints via assert nodes: e.g., assertsext(val, 32) constrains a value to the 32-bit signed range, producing poison if violated.
zext/sext/trunc instructions are eliminated from the IR.
Bitwise operations (and/or/xor/shift) are also defined on infinite precision integers.

At-use bit-level analysis (encoded in IR):

Each use edge carries bit-level annotations: which bits are demanded, and which bits are known to be 0 or 1.
UB constraints (from assert nodes, memory operations, ABI boundaries) propagate backward through the IR, refining known bits on upstream values.
This is analogous to LLVM’s KnownBits and DemandedBits analyses, but promoted to first-class IR constructs rather than side analyses.
Follows the “analysis is also a transformation” principle: analysis results are encoded directly into the IR, not stored in external tables.
Instruction selection in the final stage derives concrete machine types from these at-use annotations.

Representation form:

Hierarchical CFG with nested SESE regions. Loops and branches are explicitly represented as regions, not inferred from CFG back-edges.
Region internals use traditional SSA basic blocks with fixed instruction ordering.
Translation from rustc MIR requires control flow structuralization to recover nested regions from the flat CFG.
Enables early loop optimizations (vectorization, LICM, unrolling) by preserving loop structure in the IR.
Inspired by LLVM VPlan’s hierarchical CFG and recipe-based transformation model.

Memory model:

Abstract pointer model: pointer = allocation ID + offset, preserving full provenance.
Two ptr-to-int instructions with distinct semantics:
- ptrtoint: converts pointer to integer, capturing provenance. inttoptr(ptrtoint(p)) is a safe roundtrip.
- ptrtoaddr: extracts only the address portion, discarding provenance.
Four-state byte representation: Bits(u8) | PtrFragment(Pointer, n) | Uninit | Poison. Each byte in memory is one of these states.
Uninitialized memory (Uninit) is a distinct abstract state, not “random bits”. Any operation on uninit values is UB (produces poison).
Poison can exist at the byte level in memory, aligned with the per-byte poison tracking of the byte type.

Testing Strategy

Validation and benchmarking will use the following targets (in progressive order):

Internal unit tests — small test cases for basic codegen correctness (arithmetic, control flow, function calls)
no_std programs — minimal programs to validate the backend without standard library dependencies
serde — widely used crate with heavy generics and macros, tests monomorphization and trait dispatch
rustc test suite — rustc’s ui tests for broad correctness coverage
ripgrep — real-world CLI project for end-to-end correctness and performance comparison against LLVM backend
clippy — later-stage validation target due to tight coupling with rustc internals

Notes

Self-hosting is not a separate goal since tuffy is written in Rust — once the backend can compile Rust, self-hosting follows naturally
JIT support and outperforming LLVM are not current goals but may be revisited later

Target Abstraction

Triple-based target identification, following LLVM’s convention.
TargetLowering design follows LLVM’s approach (legal types, operation lowering, register classes).
Sub-method override via command-line: individual TargetLowering properties (pointer size, legal integer widths, max vector width, etc.) can be overridden via --target-override key=value flags. This enables testing IR-level passes against specific target properties without implementing a full architecture backend.
Cost model: deferred, not yet designed.

IR Memory Layout

Encapsulation: memory layout is an implementation detail, hidden behind a stable API. The rest of the compiler operates on opaque handles, not raw pointers. Layout can be changed without affecting pass code.
Arena + index: IR nodes are allocated in arenas. References are u32 indices (opaque handles), not pointers. Reduces reference size and enables serialization.
Contiguous instruction storage: instructions within the same basic block are stored contiguously in the arena. Iterating a BB’s instructions is a linear memory scan, not pointer chasing.
Tombstone + periodic compaction: deleted instructions are marked as tombstones. Periodic compaction restores contiguity.
No SOA: struct-of-arrays layout rejected; standard AOS with contiguous storage is sufficient.

Crate Structure

Tuffy is a language-agnostic compiler backend. rustc_codegen_tuffy is one frontend adapter; future language frontends reuse the entire optimization and backend stack.

tuffy/
├── tuffy_ir/              — IR definition: types, instructions, CFG, memory model
├── tuffy_ir_interp/       — IR interpreter (miri-like, UB-aware)
├── tuffy_opt/             — optimization passes (generated from declarative rewrite rules)
├── tuffy_target/          — target abstraction layer (trait + CLI override)
├── tuffy_target_x86/      — x86 backend (i386 + x86-64)
├── tuffy_codegen/         — instruction selection, register allocation, machine code emission
├── tuffy_trace/           — rewrite path tracing
├── tuffy_tv/              — translation validation (Alive2-style SMT verifier)
├── rustc_codegen_tuffy/   — rustc frontend adapter (MIR → tuffy IR)

Dependencies: tuffy_ir is the bottom layer with no internal dependencies. All other crates depend on it directly or transitively.

Object file emission: use the object crate (same as Cranelift and rustc) for writing ELF, Mach-O, COFF object files. No custom object file handling.

References

A curated list of reference documents and resources for the tuffy project.

Compiler Optimization

Future Directions for Optimizing Compilers — Nuno P. Lopes, John Regehr, 2018. Discusses challenges in making optimizing compilers faster, less buggy, and more capable.
Cranelift: Using E-Graphs for Verified, Cooperating Middle-End Optimizations — Chris Fallin, Bytecode Alliance. Proposes acyclic e-graphs (aegraphs) as a unified middle-end optimization framework for Cranelift, replacing separate GVN/LICM/constant-folding passes. Key ideas: pure nodes float above a side-effect skeleton (the original CFG), scoped elaboration naturally subsumes GVN and LICM, and ISLE-based declarative rewrite rules enable verifiable optimizations. Directly relevant to tuffy’s declarative rewrite rules and “analysis is also a transformation” design.
optir — Jamey Sharp. Research prototype combining e-graph equality saturation (via egg) with RVSDG (Regionalized Value State Dependence Graph). Control flow is encoded structurally via switch/loop/func operators rather than explicit branches, enabling optimizations across complex control flow including irreducible CFGs. Relevant to tuffy’s hierarchical CFG regions and declarative rewrite approach.
egg: Fast and Extensible Equality Saturation — Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, Pavel Panchekha, POPL 2021. Introduces the egg Rust library for e-graphs and equality saturation with novel techniques: rebuilding for amortized invariant maintenance, and e-class analyses for domain-specific data attached to equivalence classes. Relevant to tuffy’s declarative rewrite rules — equality saturation is a candidate execution model for applying verified rewrite rules without phase-ordering concerns.

Memory Analysis

MemorySSA — LLVM. Virtual SSA form for memory operations using a single memory partition. Three node types: MemoryDef (stores, calls, fences, atomics), MemoryUse (loads), MemoryPhi (CFG joins). Each MemoryDef creates a new version of “all memory”; precision is recovered via a clobber walker backed by alias analysis rather than upfront partitioning. Key design insight: precise memory partitioning is impractical and not worth the cost — a single chain with on-demand disambiguation is more efficient. Directly informs tuffy’s MemSSA design: single memory token chain encoded in IR, with progressive refinement as alias analysis improves.

IR Design and Semantics

RVSDG: An Intermediate Representation for Optimizing Compilers — Nico Reissmann, Jan Christian Meyer, Helge Bahmann, Magnus Själander, 2019. Formalizes the Regionalized Value State Dependence Graph, a data-flow-centric IR where nodes represent computations, edges represent dependencies, and regions capture hierarchical program structure. Demonstrates Dead Node and Common Node Elimination on the representation. Directly relevant to tuffy’s hierarchical CFG with SESE regions and the optir approach built on top of RVSDG.
RFC: Add a New Byte Type to LLVM IR — Juneyoung Lee, Pedro Lobo, Nuno Lopes, George Mitenkov. Proposes b<N> byte type to represent raw memory data, enabling per-byte poison tracking and paving the way for undef removal.
Leaving the Sea of Nodes — V8 team. Post-mortem on 10 years of Sea of Nodes in TurboFan: effect chain complexity, poor cache locality, limited floating benefits for effectful operations, compilation speed issues. Replaced with CFG-based Turboshaft, halving compile time.
A Simple Reply — Cliff Click. Rebuttal to V8’s SoN criticism: argues problems stem from JS’s lack of strong typing, not SoN itself. Strong typing (Java, Rust) provides ECA naturally, simplifying effect management. Worklist algorithms solve visitation order. Dead code elimination is near-free with reference counting.
VPlan: Vectorization Plan — LLVM. Hierarchical CFG with nested SESE regions and recipe-based transformations for vectorization. Key ideas: plan-then-execute (analysis doesn’t modify IR), multiple candidates with cost estimation, recipes naturally record transformation provenance. Relevant to tuffy’s “analysis is also a transformation” and rewrite path tracing goals.

Memory Model and Pointer Semantics

Pointers Are Complicated, or: What’s in a Byte? — Ralf Jung, 2018. Argues pointers are not integers; proposes abstract pointer model (allocation ID + offset) preserving provenance. Defines bytes as Bits(u8) | PtrFragment(Pointer, n) | Uninit. Directly relevant to tuffy’s byte type and uninitialized memory model.
Pointers Are Complicated II, or: We Need Better Language Specs — Ralf Jung, 2020. Shows three individually-correct LLVM optimizations that produce wrong results when combined, due to unspecified pointer provenance. Concludes: pointers have provenance, integers do not; ptr-to-int-to-ptr roundtrips are not no-ops.
Pointers Are Complicated III, or: Pointer-Integer Casts Exposed — Ralf Jung, 2022. Proposes “provenance exposure” model: ptr-to-int casts have side effects (exposing provenance), int-to-ptr casts are non-deterministic. Rust’s Strict Provenance API (ptr.addr(), with_addr()) offers a cleaner alternative. Relevant to tuffy’s pointer representation design.
Uninit — Ralf Jung, 2019. Uninitialized memory is not “random bits” but a distinct abstract state (None). Any operation on uninit values is UB. Validates tuffy’s decision to allow uninitialized memory at the memory level while keeping values as concrete-or-poison.
Do We Really Need Undefined Behavior? — Ralf Jung, 2021. Argues unrestricted UB is necessary for practical optimization (register allocation, constant folding). Platform-specific UB would make even basic codegen impossible. Supports tuffy’s poison-based UB model as a principled middle ground.
Clarifying the Semantics of ptrtoint — LLVM discussion. Proposes separating ptrtoint (full bitwise representation) from ptrtoaddr (address only). Highlights that compiler-level provenance and hardware provenance (CHERI) are distinct. Relevant to tuffy’s pointer representation: need to decide whether ptr-to-int exposes provenance or just address.
RFC: Replacing getelementptr with ptradd — nikic (Nikita Popov). Proposes replacing type-based GEP with offset-based ptradd instruction. Eliminates redundant encodings, makes address arithmetic explicit and visible to CSE/LICM, simplifies analysis by removing expensive GEP decomposition. Tuffy adopts ptradd directly instead of GEP.

ABI and Calling Conventions

RFC: An ABI Lowering Library for LLVM — nikic (Nikita Popov). Proposes a standalone LLVMABI library with its own type system dedicated to ABI lowering, extracting logic currently duplicated across frontends (Clang, Rust, etc.). Key design: independent type system capturing only ABI-relevant information, per-target calling convention implementations, ABIArgInfo results guiding IR generation. Layered to depend only on LLVM Support, with a C API for non-C++ frontends. Directly relevant to tuffy’s plan for a dedicated ABI library that maps int parameters/returns to concrete register classes and calling conventions.

Testing and Validation

RFC: Upstreaming LLVM UB-Aware Interpreter — dtcxzyw. Proposes upstreaming llubi, a UB-aware LLVM IR interpreter that detects undefined behavior during execution and properly propagates poison values. Key motivation: reducing miscompilation reproducers directly on LLVM IR (minutes vs hours with C-level tools). Separate implementation from legacy ExecutionEngine because GenericValue cannot represent poison or pointer provenance. Relevant to tuffy’s IR interpreter design.
llvm-ub-aware-interpreter (llubi) — dtcxzyw. Implementation of UB-aware LLVM IR interpreter. Detects guardable UBs at execution time, tracks poison propagation, integrates with llvm-reduce for test case minimization and Csmith/Rustlantis for fuzzing. Treats undef as zero (practical simplification). Does not model pointer aliasing or full provenance. Directly informs tuffy’s interpreter design, though tuffy’s poison-only model (no undef) and four-state bytes enable a cleaner implementation.

Rustc Backend Integration

Lowering MIR — rustc dev guide. MIR lowering flow: codegen_crate → codegen_mir per function, with modules for blocks, statements, operands, places, rvalues. SSA analysis runs before translation. One MIR basic block generally maps to one backend basic block.
Backend Agnostic Codegen — rustc dev guide. Trait-based backend interface: CodegenBackend, ExtraBackendMethods, CodegenMethods (CodegenCx), BuilderMethods (Builder). Associated types: Value, BasicBlock, Type, Function. ~10,000 lines of reusable backend-agnostic code in rustc_codegen_ssa. Custom backends implement these traits to plug in.

Feature Name: `tuffy_ir`

Status: Draft
Created: 2026-02-07
Completed: N/A

Summary

This RFC defines the intermediate representation (IR) for the tuffy compiler. The IR uses a single data structure across all compilation stages with stage-dependent normalization, infinite precision integers with at-use bit-level analysis, a four-state byte memory model, and a hierarchical CFG with nested SESE regions.

Motivation

Existing compiler IRs (LLVM IR, Cranelift CLIF) carry historical design baggage that limits optimization potential and complicates formal verification:

Fixed-width integers force premature commitment to bit widths, introducing zext/sext/trunc noise that obscures mathematical equivalences.
undef/poison duality in LLVM creates well-documented semantic confusion and miscompilation bugs. Approximately 18% of detected LLVM miscompilations stem from undef alone.
Flat CFGs lose structural information (loops, branches) that must be expensively recovered for high-level optimizations.
Side-table analyses (KnownBits, DemandedBits) are disconnected from the IR, creating phase ordering problems and stale analysis results.
Multiple IR layers (LLVM IR → SelectionDAG → MachineIR) require costly translations with potential information loss at each boundary.

Tuffy’s IR addresses these issues from first principles, designed for formal verification, efficient compilation, and clean semantics.

Guide-level explanation

Type system

Tuffy IR has a minimal type system:

int — Infinite precision integer. No bit width, no signedness. Arithmetic is mathematical: add(3, 5) is 8, period. There is no overflow.
b<N> — Byte type representing raw memory data, where N is a multiple of 8. Distinct from integers; preserves pointer fragments and per-byte poison tracking.
ptr<AS> — Pointer type carrying provenance (allocation ID + offset) in address space AS. Default address space is 0. Different address spaces may have different pointer sizes and semantics.
float / double — IEEE 754 floating point types.
vec<vscale x N x T> — Scalable vector type. vscale is a runtime constant determined by hardware; N is the minimum number of elements; T is the element type (int, float, double, b<N>, ptr<AS>). Fixed-width vectors are represented as vscale=1 (e.g., vec<1 x 4 x b32> is a 128-bit fixed vector of 4 x 32-bit elements). When the element type is int, the actual vector width is unknown until instruction selection, since int has no fixed bit width; the element count is determined by the IR, but the physical vector register mapping is deferred.

Infinite precision integers

Instead of i32 or i64, all integer values are simply int. Range constraints are expressed via at-use annotations on value definitions and use edges (see RFC: at-use-annotations):

%raw = load b32, ptr %p         // load 32 bits of raw bytes from memory
%x:s32 = bitcast %raw to int   // cast to int, result annotated as signed 32-bit
%y:s32 = add %x:s32, 1         // mathematical addition, no overflow; result in s32 range
%out = bitcast %y:s32 to b32   // cast back to bytes for storing
store b32 %out, ptr %q

Loads always operate on byte types (b<N>). The byte type carries the bit width; integers do not. The decision of sign-extension vs zero-extension, and whether to use a GPR or FPR, is deferred entirely to instruction selection based on at-use annotations.

There are no zext, sext, or trunc instructions. Bitwise operations (and, or, xor, shift) are defined on infinite precision integers.

At-use annotations

Range constraints and bit-level analysis results are encoded directly into the IR as annotations on value definitions (result-side) and use edges (use-side). See RFC: at-use-annotations for the full design.

Each annotation carries a four-state per-bit encoding:

0: bit is known to be 0
1: bit is known to be 1
?: bit is demanded but value unknown
x: bit is don’t-care (not demanded)

Shorthand annotations :s32, :u8 etc. are sugar over the full known-bits representation.

Result-side annotations are generated by the frontend from source language type information. Use-side annotations may be added by optimization passes when they discover stronger constraints at specific use sites. This follows the “analysis is also a transformation” principle — analysis results are encoded directly into the IR, not stored in side tables.

Hierarchical CFG

The IR uses a hierarchical control flow graph:

func @example(int:s32) -> int:s32 {
  region loop {
    bb0(i: int):
      %cond = cmp.lt %i:s32, %arg:s32
      br %cond, bb1, exit

    bb1:
      %i.next = add %i, 1
      br bb0(%i.next)
  }
  exit:
    ret %i:s32
}

Loops and branches are explicitly represented as nested SESE (Single-Entry Single-Exit) regions. Inside each region, instructions are organized in traditional SSA basic blocks with fixed ordering.

Memory model

Pointers are abstract values with provenance:

%p = alloca int              // %p has provenance tied to this allocation
%addr = ptrtoaddr %p         // extracts address only, discards provenance
%bits = ptrtoint %p          // converts to integer, captures provenance
%q = inttoptr %bits          // safe roundtrip: recovers provenance

Each byte in memory is one of four states:

Bits(u8): a concrete byte value (0–255)
PtrFragment(Pointer, n): the nth byte of an abstract pointer
Uninit: uninitialized memory (not “random bits”; a distinct abstract state)
Poison: a poisoned byte

Memory SSA

Memory dependencies are encoded directly into the IR as Memory SSA (MemSSA). Each memory operation (load, store, call) carries a memory version token representing the memory state it depends on or produces.

%mem0 = entry_mem                    // initial memory state
%mem1 = store b32 %v, ptr %p, %mem0 // store produces new memory version
%raw = load b32, ptr %q, %mem1      // load depends on memory version

MemSSA supports progressive refinement: initially a conservative single memory chain, then split into finer-grained versions as alias analysis improves. This follows the “analysis is also a transformation” principle — memory dependency information lives in the IR, not in side tables.

Atomic operations

Atomic memory operations are regular load/store/rmw/cmpxchg instructions annotated with a memory ordering:

%raw = load.atomic b32, ptr %p, ordering=acquire, %mem0
%mem1 = store.atomic b32 %v, ptr %p, ordering=release, %mem0
%old = rmw.add b32, ptr %p, %delta, ordering=seqcst, %mem0
%res, %ok = cmpxchg b32, ptr %p, %expected, %desired, success=acqrel, failure=acquire, %mem0

Supported orderings: relaxed, acquire, release, acqrel, seqcst. Fence instructions are also supported as standalone operations.

Atomic operations participate in MemSSA like any other memory operation, with ordering constraints restricting reordering during optimization.

UB model

The IR uses only poison for undefined behavior. There is no undef.

Operations with UB (e.g., division by zero) produce poison.
Annotation violations produce poison: result-side violations at the defining instruction, use-side violations at the consuming instruction.
Any operation on Uninit memory produces poison.
poison propagates through computations but is only UB when “observed” (e.g., used as a branch condition, stored to observable memory).

Single data structure, multiple stages

The IR uses one data structure throughout compilation. Different stages impose different normalization constraints:

High-level: close to MIR translation, high-level operations allowed (e.g., aggregate operations, generic intrinsics).
Optimized: after optimization passes, IR is normalized but still target-independent.
Legalized: operations are lowered to target-legal forms (e.g., 64-bit multiply on 32-bit targets becomes a sequence of 32-bit operations).
Pre-emission: fully scheduled, register-allocated, ready for machine code emission.

No separate SDAG or MachineIR. The same graph structure is progressively constrained.

MIR translation

Translation from rustc MIR to tuffy IR involves:

Control flow structuralization: attempt to normalize to natural loops and promote to SESE regions. Irreducible control flow is left as flat CFG within a region.
mem2reg: MIR uses mutable Place values (not pure SSA). A mem2reg pass promotes stack allocations to SSA values where possible.
Drop/unwind: Rust’s drop glue and panic unwind edges are preserved in CFG form, not promoted to SESE regions. Unwind paths remain as explicit CFG edges.
Aggregate scalarization: structs, enums, arrays, and tuples are fully decomposed into scalar operations (analogous to LLVM’s SROA). No aggregate types exist in tuffy IR.
Type translation: MIR fixed-width integers (u8, i32, u64, etc.) are translated to int with corresponding :u<N>/:s<N> result-side annotations at definition points.

Machine code lowering

Lowering from tuffy IR to machine code:

ABI: a dedicated ABI library maps int parameters/returns to concrete register classes and calling conventions based on the target triple.
Stack frame layout: derived from the target triple; alloca sizes are resolved after instruction selection determines concrete bit widths.
Instruction selection: at-use bit-level annotations determine concrete types; the instruction selector chooses sext/zext and GPR/FPR placement.
Debug info: DWARF generation follows LLVM’s approach.
Unwind tables: exception handling table generation follows LLVM’s approach.

Reference-level explanation

Type system details

Type	Description
`int`	Infinite precision integer, no fixed bit width
`b<N>`	Byte type, N must be a multiple of 8
`ptr`	Pointer with provenance (allocation ID + offset)
`float`	IEEE 754 single precision
`double`	IEEE 754 double precision
`vec<vscale x N x T>`	Scalable vector, vscale is runtime constant, N is min element count, T is element type

Integer operations

All integer arithmetic operates on mathematical integers:

Operation	Semantics
`add %a, %b`	Mathematical addition
`sub %a, %b`	Mathematical subtraction
`mul %a, %b`	Mathematical multiplication
`div %a, %b`	Integer division (poison if %b == 0)
`rem %a, %b`	Integer remainder (poison if %b == 0)
`and %a, %b`	Bitwise AND on infinite precision
`or %a, %b`	Bitwise OR on infinite precision
`xor %a, %b`	Bitwise XOR on infinite precision
`shl %a, %b`	Left shift
`shr %a, %b`	Right shift (signedness from annotation)

Value annotations

Annotations on value definitions and use edges:

Annotation	Semantics
`:s<N>`	Poison if value outside signed N-bit range `[-2^(N-1), 2^(N-1)-1]`
`:u<N>`	Poison if value outside unsigned N-bit range `[0, 2^N-1]`
`:known(<ternary>)`	Per-bit four-state constraint (`0`, `1`, `?`, `x`)

Result-side violation → instruction produces poison. Use-side violation → consuming instruction produces poison.

Pointer operations

Operation	Semantics
`ptrtoint %p`	Convert pointer to integer, capturing provenance
`ptrtoaddr %p`	Extract address only, discard provenance
`inttoptr %i`	Convert integer to pointer, recovering captured provenance
`ptradd %p, %offset`	Pointer arithmetic within an allocation (provenance-preserving offset addition)
`ptrdiff %p, %q`	Pointer difference; returns `%p.offset - %q.offset` as int; poison if %p and %q do not belong to the same allocation

Byte type operations

Operation	Semantics
`bitcast %v to b<N>`	Type-punning conversion between byte type and other types
`load b<N>, ptr %p`	Load raw bytes from memory
`store b<N> %v, ptr %p`	Store raw bytes to memory

Vector operations

All vector operations work on vec<vscale x N x T> types. A mask operand (vec<vscale x N x int>) is optional on most operations to support predicated execution (SVE/RVV style).

Elementwise arithmetic:

Operation	Semantics
`vec.add %a, %b`	Elementwise addition
`vec.sub %a, %b`	Elementwise subtraction
`vec.mul %a, %b`	Elementwise multiplication
`vec.div %a, %b`	Elementwise division
`vec.rem %a, %b`	Elementwise remainder
`vec.and %a, %b`	Elementwise bitwise AND
`vec.or %a, %b`	Elementwise bitwise OR
`vec.xor %a, %b`	Elementwise bitwise XOR
`vec.shl %a, %b`	Elementwise left shift
`vec.shr %a, %b`	Elementwise right shift (signedness from annotation)
`vec.fadd %a, %b`	Elementwise floating point addition
`vec.fmul %a, %b`	Elementwise floating point multiplication

Horizontal reductions:

Operation	Semantics
`vec.reduce.add %v`	Sum all elements
`vec.reduce.mul %v`	Multiply all elements
`vec.reduce.smin %v`	Signed minimum across elements
`vec.reduce.smax %v`	Signed maximum across elements
`vec.reduce.umin %v`	Unsigned minimum across elements
`vec.reduce.umax %v`	Unsigned maximum across elements
`vec.reduce.and %v`	Bitwise AND across elements
`vec.reduce.or %v`	Bitwise OR across elements
`vec.reduce.xor %v`	Bitwise XOR across elements
`vec.reduce.fadd %v`	Floating point sum (ordered or unordered via flag)
`vec.reduce.fmin %v`	Floating point minimum across elements
`vec.reduce.fmax %v`	Floating point maximum across elements

Scatter/gather (indexed vector memory access):

Operation	Semantics
`vec.gather %base, %indices, %mask, %mem`	Load elements from `base + indices[i]` for each lane where mask is true; produces vector + memory token
`vec.scatter %v, %base, %indices, %mask, %mem`	Store elements to `base + indices[i]` for each lane where mask is true; produces memory token

Masking/predication:

Operation	Semantics
`vec.select %mask, %a, %b`	Per-lane select: `mask[i] ? a[i] : b[i]`
`vec.compress %v, %mask`	Pack active lanes (mask=true) to front
`vec.expand %v, %mask`	Scatter contiguous elements to active lanes

Lane manipulation:

Operation	Semantics
`vec.splat %scalar`	Broadcast scalar to all lanes
`vec.extract %v, %idx`	Extract scalar from lane
`vec.insert %v, %scalar, %idx`	Insert scalar into lane
`vec.shuffle %a, %b, %indices`	Permute lanes from two vectors

Contiguous vector memory access:

Operation	Semantics
`vec.load %ptr, %mask, %mem`	Masked contiguous load
`vec.store %v, %ptr, %mask, %mem`	Masked contiguous store

Platform intrinsics:

Platform-specific intrinsics (e.g., std::arch::x86_64::*) support two modes, selectable via compiler flag or function-level attribute:

Map mode (default): recognized intrinsics are mapped to tuffy vector operations, enabling further optimization by the compiler.
Passthrough mode: intrinsics are preserved as opaque target-specific calls (target.intrinsic "x86.avx2.add.ps" %a, %b), emitted as-is during instruction selection. Suitable for hand-tuned SIMD code where the user wants exact instruction control.

Unrecognized intrinsics always use passthrough mode.

Memory byte states

Each byte in the abstract memory is represented as:

enum AbstractByte {
    Bits(u8),                    // concrete value 0-255
    PtrFragment(AllocId, usize), // nth byte of a pointer
    Uninit,                      // uninitialized
    Poison,                      // poisoned
}

Hierarchical CFG structure

enum CfgNode {
    BasicBlock(Vec<Instruction>, Terminator),
    Region(RegionKind, Vec<CfgNode>),
}

enum RegionKind {
    Loop { header: BasicBlockId },
    Branch { condition: Value, then_region: RegionId, else_region: RegionId },
    Sequence,
}

Regions are always SESE. Translation from rustc MIR requires control flow structuralization.

At-use annotation representation

Each use edge and each instruction result carries an optional annotation:

struct Annotation {
    known_ones: BigUint,     // bits known to be 1
    known_zeros: BigUint,    // bits known to be 0
    demanded: BigUint,       // bits that are demanded
}

Invariant: known_ones & known_zeros == 0.

Shorthand :s<N> and :u<N> are sugar over this representation. No annotation means unconstrained int; the verifier reports an error if instruction selection cannot determine a concrete machine type.

Drawbacks

Infinite precision integers add complexity to instruction selection. The final stage must derive concrete bit widths from at-use annotations, which is a non-trivial analysis.
Hierarchical CFG requires control flow structuralization from MIR, which may introduce extra blocks for irreducible control flow.
At-use annotations on every edge increase memory footprint of the IR.
Single data structure means late-stage IR carries fields irrelevant to early stages and vice versa, potentially wasting memory.
Novel design means less existing tooling and literature to draw from compared to traditional SSA CFG.

Rationale and alternatives

Why infinite precision integers?

Two equally important motivations:

Formal verification: Infinite precision integers are mathematical integers — exactly what theorem provers (Lean 4, Mathlib) natively support. Rewrite rule proofs operate on standard integer arithmetic with induction, algebraic properties, and equivalences available out of the box. Fixed-width integers would require every proof to reason about overflow, wrapping semantics, and sign extension, dramatically increasing proof complexity.
Pattern recognition: Source languages (especially C) introduce redundant integer conversions due to implicit promotion rules (e.g., C’s integer promotion widens char and short to int). In LLVM IR, these become chains of zext/sext/trunc that obscure the underlying mathematical equivalence. Pattern matching in optimization passes fails to see through the conversion noise, causing missed optimizations. Infinite precision integers eliminate this noise entirely — the mathematical structure is directly visible in the IR.

Together, these make both the proof side and the optimization side simpler and more effective.

Alternative: arbitrary-width integers (LLVM-style). Rejected because they still require zext/sext/trunc and overflow reasoning in every pass and every proof.

Alternative: fixed set of widths (Cranelift-style). Simpler but still carries the same fundamental problems.

Why hierarchical CFG?

Flat CFGs lose loop structure, requiring expensive loop detection passes. Since tuffy aims to perform loop optimizations early, preserving this structure in the IR is essential.

Alternative: flat SSA CFG. Rejected because loop detection is expensive and error-prone.

Alternative: Sea of Nodes. Considered seriously. Advantages for strongly-typed languages (Cliff Click’s arguments), but engineering complexity (scheduling, debugging, cache locality) outweighs benefits. V8’s experience is cautionary even if their problems were partly JS-specific.

Alternative: E-graph. Attractive for formal verification but immature for handling side effects and control flow at scale.

Why four-state bytes?

The Uninit state (separate from Poison) correctly models uninitialized memory as an abstract concept rather than “random bits.” The PtrFragment state preserves pointer provenance through byte-level memory operations like memcpy. Poison at the byte level enables per-byte poison tracking aligned with the byte type design.

Why ptrtoint captures provenance?

This enables safe inttoptr(ptrtoint(p)) roundtrips while ptrtoaddr provides a provenance-stripping alternative. This avoids the LLVM problem where three individually-correct optimizations produce wrong results when combined (Ralf Jung, 2020).

Prior art

LLVM IR: SSA CFG with fixed-width integers, undef/poison duality, flat CFG. Tuffy’s IR addresses known LLVM pain points (undef removal, byte type, phase ordering).
Cranelift CLIF: Single IR with stage-dependent legalization. Tuffy adopts this “single data structure” approach but adds hierarchical CFG and infinite precision integers.
LLVM VPlan: Hierarchical CFG with SESE regions for vectorization. Inspires tuffy’s region-based structure.
V8 TurboFan / Turboshaft: Sea of Nodes abandoned in favor of CFG after 10 years. Informs tuffy’s decision to use SSA basic blocks inside regions.
Cliff Click’s Sea of Nodes: Strong typing (Java, Rust) mitigates many of V8’s problems. Tuffy acknowledges this but prioritizes engineering simplicity.
Alive2: Poison-only UB model, byte type proposal. Directly informs tuffy’s UB and type design.
Miri: Abstract pointer model (allocation ID + offset), abstract byte representation. Informs tuffy’s memory model.

Unresolved questions

Floating point semantics: How should NaN payloads and denormals be handled? Are there opportunities for infinite-precision-like simplification?
SIMD/vector types: How do vector operations fit into the type system and hierarchical CFG?
Aggregate types: Should structs/arrays exist in the IR or be fully decomposed into scalar operations?
Exception handling: How do Rust panics and unwinding interact with SESE regions?
Concurrency model: How are atomic operations and memory ordering represented?
Stack slot lifetime and coloring: stack_slot currently takes only a size parameter. To support stack coloring (overlapping non-live slots), lifetime information is needed. The preferred direction is region-based: stack slots belong to a SESE region and are automatically dead when control exits the region. Open question: after lowering to a flat CFG, how is region-based lifetime information preserved? Options include (a) materializing lifetime markers at region boundaries during lowering, or (b) maintaining a region tree alongside the flat CFG. The lifetime.start/end intrinsic approach (LLVM-style) is not preferred.

Future possibilities

E-graph integration: Equality saturation could be applied as an optimization strategy on subgraphs within the hierarchical CFG, without replacing the core IR representation.
Superoptimizer: The formal-first design and rewrite path tracing make it natural to integrate solver-based optimization discovery (Souper-style).
Multi-language frontend: The IR is designed to be language-agnostic. Future frontends for other languages can target the same IR.
JIT compilation: The single data structure with stage-dependent normalization could support a JIT mode that skips late-stage optimizations for faster compilation.
Translation validation: Rewrite path traces provide the foundation for end-to-end translation validation of the compilation pipeline.

Feature Name: `tuffy_pipeline`

Status: Draft
Created: 2026-02-07
Completed: N/A

Summary

This RFC defines the compilation pipeline design for the tuffy compiler, focusing on four core aspects: value-level analysis encoded as IR transformations, automatic debug info preservation via origin chains, automatic profiling info preservation with merging semantics, and a hybrid pass ordering strategy with fine-grained user control.

Motivation

Compiler pipelines face recurring engineering problems that erode correctness and developer productivity:

Phase ordering: In LLVM, analyses like KnownBits and DemandedBits live in side tables disconnected from the IR. When a transform modifies the IR without updating these tables, analysis results become stale. Subsequent passes either use stale data (miscompilation) or must conservatively discard and recompute (lost optimization opportunities). This is the root cause of many phase ordering bugs.
Debug info loss: LLVM requires each optimization pass to manually preserve debug info. In practice, many passes silently drop it. The result is poor debugging experience at higher optimization levels — a well-known pain point for all LLVM-based compilers.
Profiling info loss: Profile-guided optimization data (branch weights, execution counts) suffers the same manual-maintenance problem. Transforms that restructure control flow or merge/split blocks often discard profile annotations.

Tuffy’s pipeline addresses these by making preservation automatic at the infrastructure level, rather than relying on per-pass discipline.

Guide-level explanation

Pass classification

Tuffy distinguishes two categories of analysis:

Structural analyses (side table, on-demand):

Dominator tree
Loop tree
Post-dominator tree

These are cheaply recomputable from the CFG and do not carry semantic information that could be irreversibly lost by transforms. They live in traditional side tables and are invalidated/recomputed as needed.

Value-level analyses (encoded in IR, transform passes):

KnownBits — which bits of a value are provably 0 or 1
DemandedBits — which bits of a value are actually needed at each use site
MemSSA refinement — splitting memory version chains as alias analysis improves

These carry semantic facts about values that, if lost, cannot be recovered without re-deriving from first principles. In tuffy, these are transform passes — they read the IR, compute results, and write those results back into the IR (on use-edge annotations or memory tokens). They are scheduled in the pipeline alongside optimization passes, not invoked as side analyses.

This follows the “analysis is also a transformation” principle: there is no distinction between “analysis pass” and “optimization pass” for value-level information. Both modify the IR.

Origin chain

Every instruction in the IR carries an origin — a reference to the instruction(s) it was derived from. When a transform creates a new instruction, the builder API requires specifying an origin. This is a mandatory parameter, not optional metadata.

// Builder API enforces origin
let new_add = builder.build_add(lhs, rhs, origin: old_mul.origin());

The origin chain serves two purposes:

Debug info derivation: Debug information (source locations, variable mappings) is not stored on every instruction. Instead, it is computed from the origin chain and current IR values. As long as the chain is unbroken, debug info can be derived for any instruction by tracing back to its origin.
Profiling info association: Profiling data (execution counts, branch weights) is associated with instructions via origin. When transforms merge or split instructions, profile data is merged accordingly — this is valid because profile data ultimately serves code generation decisions, not source-level tracing.

Origin in practice

Most transforms have a natural origin for new instructions:

Transform	New instruction	Origin source
Strength reduction	`shl %x, 1` replacing `mul %x, 2`	The original `mul`
Loop canonicalization	Preheader branch	Latch branch
Instruction combining	`add %a, %c` from `add(add(%a, %b), %c)`	Both original `add` instructions
Dead code elimination	(no new instructions)	N/A
Loop unrolling	Duplicated loop body	Original loop body instructions

When multiple origins contribute, the origin is a set. The debug info derivation picks the most specific source location from the set.

Pipeline structure

A tuffy compilation pipeline is a sequence of passes:

MIR translation
  → [value-level analysis: initial KnownBits/DemandedBits]
  → [optimization passes interleaved with value-level analysis updates]
  → legalization
  → instruction selection
  → register allocation
  → machine code emission

Value-level analysis passes can be scheduled at any point. Since they are transforms (they modify the IR), the pass manager treats them identically to optimization passes.

Pass ordering

Tuffy uses a hybrid pass ordering strategy: a fixed high-level framework with local iteration, combined with fine-grained user control to prevent performance regressions.

High-level framework: The pipeline has a fixed sequence of major phases (MIR translation → optimization → legalization → instruction selection → register allocation → emission). Within the optimization phase, passes are grouped into stages (e.g., early simplification, loop optimization, late cleanup), each stage iterating locally until the IR stabilizes or an iteration limit is reached.

User control granularity:

Level	Mechanism	Example
Version lock	Pin entire pipeline to a versioned strategy	`--opt-policy=v1.2.3`
Preset profile	Select a predefined pipeline	`-O2`, `-Os`, `-Ofast`
Pass-level	Enable/disable individual passes	`--disable-pass=loop-unroll`
Iteration control	Cap local iteration counts	`--max-instcombine-iterations=3`
Function-level	Per-function attribute overrides	`#[tuffy::opt_policy("v1.2.3")]`

Function-level control enables mixed strategies within a single compilation unit:

#[tuffy::opt_policy("v1.2.3")]      // lock to specific version
fn stable_hot_path() { ... }

#[tuffy::opt_level("aggressive")]   // aggressive optimization
fn compute_kernel() { ... }

#[tuffy::opt_level("fast-compile")] // minimal passes, fast compilation
fn cold_error_handler() { ... }

Dirty bit mechanism

Value-level analyses (KnownBits, DemandedBits, MemSSA) are scheduled on demand using a dirty bit protocol:

When an optimization pass modifies an instruction, the infrastructure automatically marks affected use edges as dirty.
When an optimization pass performs a lossy transform (refinement, not equivalence), downstream annotations are also marked dirty — the existing annotations may be overly precise and no longer sound.
A subsequent pass that reads a dirty annotation triggers lazy recomputation on that annotation only. Clean annotations are never recomputed.

This is incremental lazy evaluation: mark dirty on write, recompute on read.

Transform equivalence declaration

Each transform pass declares whether it performs equivalence-preserving or lossy (refinement) transformations:

Equivalence: the transform preserves all semantic properties (e.g., strength reduction mul %x, 2 → shl %x, 1). Existing value-level annotations remain sound; no dirty marking needed beyond modified instructions.
Lossy refinement: the transform narrows behavior (e.g., range narrowing, speculative execution). Downstream annotations may be invalidated and must be marked dirty.

This declaration is manual, but the formal verification / proof framework can verify its correctness.

Other pipeline aspects

The following aspects follow LLVM’s established approach and are not novel to tuffy:

Legalization: progressive lowering of target-illegal operations to target-legal forms, following LLVM’s legalization model.
Pass manager: pass registration, scheduling, and region-aware execution follow LLVM’s new pass manager design.
Interprocedural optimization: inlining, cross-function analysis, and call graph management follow LLVM’s approach.
Verification passes: IR invariant checks inserted between passes in debug builds, following LLVM’s verifier model.

Reference-level explanation

Origin representation

struct Origin {
    sources: SmallVec<[InstructionId; 1]>,
}

Most instructions have a single origin (1-element SmallVec). Merged instructions carry multiple origins. The SmallVec<[_; 1]> avoids heap allocation in the common case.

Debug info derivation

Debug info is stored in a separate table indexed by InstructionId:

struct DebugInfoTable {
    /// Maps original (pre-optimization) instruction IDs to source locations
    source_locations: HashMap<InstructionId, SourceLocation>,
    /// Maps original instruction IDs to variable bindings
    variable_bindings: HashMap<InstructionId, Vec<VariableBinding>>,
}

To query the source location of a current instruction:

Follow its origin chain to find original instruction IDs
Look up those IDs in the debug info table
If multiple origins, pick the most specific (innermost) source location

This design means optimization passes never touch debug info directly. The table is write-once (populated during MIR translation) and read-only thereafter.

Profiling info

Profile data is stored similarly:

struct ProfileData {
    /// Execution counts per original basic block
    block_counts: HashMap<BasicBlockId, u64>,
    /// Branch weights per original terminator
    branch_weights: HashMap<InstructionId, Vec<(BasicBlockId, u32)>>,
}

When a transform merges blocks or restructures control flow, the profile data for new blocks is computed by summing or averaging the profile data of their origins. This is done automatically by the infrastructure when blocks are created via the builder API.

Builder API enforcement

The instruction builder requires origin at construction:

impl Builder {
    fn build_add(&mut self, lhs: Value, rhs: Value, origin: Origin) -> Value;
    fn build_load(&mut self, ty: ByteType, ptr: Value, mem: MemToken, origin: Origin) -> Value;
    // ... all build methods require origin
}

There is no way to create an instruction without specifying an origin. This is a compile-time guarantee, not a runtime check.

Value-level analysis as transform

A KnownBits analysis pass operates as:

fn run_known_bits_analysis(func: &mut Function) {
    for inst in func.instructions_reverse_postorder() {
        let known = compute_known_bits(inst, func);
        for use_edge in func.uses_of(inst) {
            use_edge.update_known_bits(known);  // modifies IR in-place
        }
    }
}

This is registered in the pass manager as a regular transform pass. The pass manager does not distinguish it from an optimization pass.

Dirty bit protocol

Each use-edge annotation carries a dirty flag:

struct UseAnnotation {
    demanded_bits: BitSet,
    known_zeros: BitSet,
    known_ones: BitSet,
    dirty: bool,
}

The infrastructure manages dirty flags automatically:

impl Function {
    /// Called by infrastructure when an instruction is modified
    fn mark_uses_dirty(&mut self, inst: InstructionId) {
        for use_edge in self.uses_of(inst) {
            use_edge.annotation.dirty = true;
        }
    }

    /// Called by infrastructure for lossy transforms
    fn mark_downstream_dirty(&mut self, inst: InstructionId) {
        // Transitively marks all downstream use edges dirty
        let mut worklist = vec![inst];
        while let Some(id) = worklist.pop() {
            for user in self.users_of(id) {
                for use_edge in self.uses_of(user) {
                    if !use_edge.annotation.dirty {
                        use_edge.annotation.dirty = true;
                        worklist.push(user);
                    }
                }
            }
        }
    }
}

Transform equivalence declaration

Each pass declares its transform category via a trait:

enum TransformKind {
    Equivalence,
    Refinement,
}

trait Pass {
    fn name(&self) -> &str;
    fn transform_kind(&self) -> TransformKind;
    fn run(&self, func: &mut Function);
}

The pass manager uses transform_kind() to determine dirty propagation scope after each pass runs. The proof framework can verify that a pass declared as Equivalence does not perform lossy transforms.

Pass ordering configuration

Pipeline configuration is a serializable data structure:

struct PipelineConfig {
    version: String,
    stages: Vec<Stage>,
}

struct Stage {
    name: String,
    passes: Vec<PassConfig>,
    max_iterations: u32,
}

struct PassConfig {
    name: String,
    enabled: bool,
    params: HashMap<String, String>,
}

Users can export, modify, and lock pipeline configurations. Function-level attributes override the global config for specific functions.

Drawbacks

Origin overhead: Every instruction carries an origin reference. For single-origin instructions this is one InstructionId (typically 4 bytes), but merged instructions allocate. In practice, most instructions have single origins, so the overhead is modest.
Debug info derivation cost: Computing debug info on demand (rather than storing it per-instruction) adds cost at debug info emission time. This is acceptable because debug info emission happens once at the end of compilation.
Profile merging heuristics: Automatically merging profile data during transforms may produce inaccurate estimates. However, this is no worse than the status quo (where profile data is often silently dropped), and the merged data is still useful for code generation heuristics.
Dirty bit propagation cost: Lossy transforms trigger transitive downstream dirty marking, which in the worst case touches a large portion of the IR. In practice, most transforms are equivalence-preserving, limiting the frequency of transitive propagation.
Transform declaration burden: Pass authors must correctly declare equivalence vs refinement. Incorrect declarations compromise soundness (under-marking) or efficiency (over-marking). The proof framework mitigates this but adds verification overhead.

Rationale and alternatives

Why encode value-level analysis in IR?

The alternative is LLVM’s approach: side-table analyses invalidated by transforms. This creates phase ordering problems — the order in which analyses and transforms run affects the final result. By encoding analysis results in the IR, every pass sees the most recent information, and transforms that don’t affect a particular annotation leave it intact.

Why origin chains instead of per-instruction debug info?

Alternative: LLVM’s DebugLoc approach. Each instruction carries a DebugLoc that passes must manually preserve. Rejected because it relies on per-pass discipline and fails in practice — many LLVM passes silently drop debug locations.

Alternative: Debug info as IR instructions (LLVM’s dbg.value). Intrinsic calls represent debug info in the IR. Rejected because these pseudo-instructions complicate every transform (must be moved, cloned, or deleted alongside real instructions) and are still frequently mishandled.

Origin chains move the burden from pass authors to the infrastructure. Pass authors specify “where did this instruction come from” (which they always know), and the infrastructure handles debug/profile info automatically.

Why allow profile merging?

Profile data is inherently approximate (sampled, potentially stale). Merging during transforms preserves the relative hotness information that matters for code generation, even if absolute counts shift. The alternative — dropping profile data when transforms can’t preserve it exactly — is strictly worse.

Prior art

LLVM DebugLoc: Per-instruction debug location, manually maintained by passes. Known to be fragile; significant ongoing effort to improve preservation.
LLVM dbg.value/dbg.declare: Debug info as intrinsic calls in IR. Adds complexity to every transform.
GCC tree-ssa: Similar side-table analysis approach with phase ordering issues.
MLIR: Operation-level location tracking with fusion semantics for merged operations. Closer to tuffy’s origin chain concept.

Unresolved questions

Origin chain compression: Long optimization pipelines may create deep origin chains. Should origins be transitively compressed (point directly to the original MIR instruction) or kept as a full chain?
Profile data merging policy: What specific merging rules apply for different transform types (block merging, loop unrolling, function inlining)?
Dirty bit granularity: Should dirty marking be per-use-edge (current design) or per-instruction? Per-instruction is coarser but cheaper to maintain.
Pipeline config format: What serialization format for pipeline configurations? TOML, JSON, or a custom DSL?

Future possibilities

Differential debugging: Origin chains enable automatic bisection — given a miscompiled instruction, trace its origin chain to identify which transform introduced the bug.
Optimization reports: Origin chains combined with rewrite path traces provide detailed optimization reports showing what happened to each source-level operation.
Selective optimization: Profile data preservation enables fine-grained PGO where hot and cold paths receive different optimization strategies throughout the pipeline.

Feature Name: `tuffy_interpreter`

Status: Draft
Created: 2026-02-07
Completed: N/A

Summary

This RFC defines the design of the tuffy IR interpreter, a miri-like execution engine that directly interprets tuffy IR with full UB detection, poison tracking, pointer provenance enforcement, and four-state byte memory semantics. The interpreter serves as the primary testing and validation tool for transform correctness.

Motivation

Compiler testing traditionally relies on end-to-end execution: compile a program, run it, check the output. This approach has two fundamental limitations:

Coarse granularity: end-to-end tests only detect bugs that manifest as observable output differences. Silent UB, provenance violations, and poison mishandling may not produce wrong outputs on any particular test input.
Slow reduction: when a miscompilation is found, reducing the reproducer from a full program to a minimal IR fragment requires expensive tools (creduce, llvm-reduce) operating at the source or LLVM IR level.

The LLVM UB-aware interpreter (llubi) demonstrated that an IR-level interpreter with UB detection can reduce miscompilation reproducers from hours to minutes. Tuffy’s IR design — poison-only UB model, four-state bytes, abstract pointer provenance — is inherently more amenable to precise interpretation than LLVM IR.

Guide-level explanation

Core capabilities

The tuffy interpreter executes IR functions directly, maintaining a concrete machine state:

Infinite precision integers: implemented via arbitrary-precision integer library (e.g., num-bigint). No overflow, no wrapping — arithmetic is mathematical.
Four-state byte memory: each byte in the interpreter’s memory is one of Bits(u8), PtrFragment(AllocId, usize), Uninit, or Poison.
Pointer provenance tracking: every pointer carries an (AllocId, offset) pair. Out-of-bounds access, use-after-free, and provenance violations are detected.
Poison propagation: poison values propagate through computations. Using poison as a branch condition, storing poison to observable memory, or violating an assert node is flagged as UB.
Assert node enforcement: assertsext(v, N) and assertzext(v, N) are checked at runtime; violations produce poison (and are reported as UB).

Execution modes

The interpreter supports multiple execution modes:

Mode	Purpose
Normal	Execute and produce output, report UB as warnings
Strict	Abort on first UB detection
Differential	Execute two IR versions (pre/post optimization), compare results
Reduce	Deterministic execution for use with IR reducer tools

Differential testing

The primary use case for transform validation:

tuffy-interp --differential --before=pre_opt.tir --after=post_opt.tir --input=test_case

The interpreter runs both IR versions on the same input and reports any behavioral divergence. Combined with a fuzzer, this provides high-confidence testing of individual transforms.

UB detection

The interpreter detects and reports:

UB category	Detection mechanism
Assert node violation	Runtime range check on `assertsext`/`assertzext`
Uninit read	Memory byte is `Uninit` at load time
Poison observation	Poison used as branch condition or stored to observable memory
Out-of-bounds access	Pointer offset exceeds allocation bounds
Use-after-free	Access to freed allocation
Provenance violation	Pointer used with wrong `AllocId`
Division by zero	Divisor is zero
Null dereference	Pointer with null address

Integration with fuzzing

The interpreter integrates with fuzzing tools:

Rustlantis: generates random Rust programs, compiles with tuffy, runs through interpreter
IR-level fuzzer: generates random tuffy IR fragments, runs through optimization pipeline, differential-tests with interpreter
Reducer: when a bug is found, an IR reducer (similar to llvm-reduce) minimizes the reproducer using the interpreter as the oracle

Reference-level explanation

Value representation

enum Value {
    Int(BigInt),
    Bytes(Vec<AbstractByte>),
    Ptr(Pointer),
    Float(f32),
    Double(f64),
    Vec(Vec<Value>),
    Poison,
}

enum AbstractByte {
    Bits(u8),
    PtrFragment(AllocId, usize),
    Uninit,
    Poison,
}

struct Pointer {
    alloc_id: AllocId,
    offset: usize,
    address_space: u32,
}

Memory model

struct Memory {
    allocations: HashMap<AllocId, Allocation>,
    next_alloc_id: AllocId,
}

struct Allocation {
    bytes: Vec<AbstractByte>,
    size: usize,
    alive: bool,
}

impl Memory {
    fn alloc(&mut self, size: usize) -> Pointer;
    fn free(&mut self, ptr: Pointer) -> Result<(), UBError>;
    fn load(&self, ptr: Pointer, size: usize) -> Result<Vec<AbstractByte>, UBError>;
    fn store(&mut self, ptr: Pointer, bytes: Vec<AbstractByte>) -> Result<(), UBError>;
}

Every memory operation checks:

ptr.alloc_id refers to a live allocation
ptr.offset + size <= allocation.size
Provenance matches (the pointer was derived from this allocation)

Instruction execution

Each instruction maps directly to a Rust function:

fn exec_add(a: &Value, b: &Value) -> Value {
    match (a, b) {
        (Value::Int(x), Value::Int(y)) => Value::Int(x + y),
        (Value::Poison, _) | (_, Value::Poison) => Value::Poison,
        _ => panic!("type error"),
    }
}

fn exec_assertsext(v: &Value, n: u32) -> Value {
    match v {
        Value::Int(x) => {
            let min = -(BigInt::one() << (n - 1));
            let max = (BigInt::one() << (n - 1)) - 1;
            if *x >= min && *x <= max {
                Value::Int(x.clone())
            } else {
                Value::Poison  // UB: report to user
            }
        }
        Value::Poison => Value::Poison,
        _ => panic!("type error"),
    }
}

Hierarchical CFG execution

The interpreter walks the hierarchical CFG:

Region.Loop: execute the loop body repeatedly until the exit branch is taken
Region.Branch: evaluate the condition, execute the corresponding sub-region
Region.Sequence: execute sub-regions in order
BasicBlock: execute instructions sequentially, then the terminator

MemSSA handling

The interpreter ignores MemSSA tokens — they are compile-time analysis artifacts. The interpreter uses its own concrete memory state for all load/store operations.

Drawbacks

Performance: arbitrary-precision integers and per-byte tracking are slower than native execution. The interpreter is not suitable for running large programs at scale.
Incomplete semantics: some IR features (atomics, concurrency) are difficult to interpret faithfully. Single-threaded sequential execution is a simplification.
Maintenance burden: the interpreter must be kept in sync with IR semantics changes. Every new instruction or semantic change requires interpreter updates.

Rationale and alternatives

Why a custom interpreter instead of lowering to LLVM and using lli?

Lowering to LLVM IR loses tuffy-specific semantics (infinite precision integers, four-state bytes, provenance). The whole point of the interpreter is to validate tuffy IR semantics directly, not after translation.

Why arbitrary-precision integers in the interpreter?

The IR defines int as infinite precision. The interpreter must faithfully implement this — using fixed-width integers would mask overflow bugs that the IR semantics are designed to prevent.

Why not use Miri directly?

Miri interprets Rust MIR, not tuffy IR. The tuffy interpreter operates on tuffy IR after optimization, catching bugs introduced by transforms. Miri and the tuffy interpreter serve complementary roles: Miri validates Rust semantics, the tuffy interpreter validates tuffy IR semantics.

Prior art

Miri: Rust MIR interpreter with UB detection, pointer provenance tracking, and abstract memory model. Primary inspiration for tuffy’s interpreter design.
llubi (LLVM UB-aware interpreter): LLVM IR interpreter with poison tracking and UB detection. Demonstrated the value of IR-level interpretation for miscompilation debugging. Limited by LLVM’s undef/poison duality and lack of provenance tracking.
lli (LLVM interpreter): basic LLVM IR interpreter without UB awareness. Insufficient for correctness validation.
KLEE: symbolic execution engine for LLVM IR. More powerful but much slower; tuffy’s interpreter focuses on concrete execution for speed.

Unresolved questions

Atomic semantics: how should the interpreter handle atomic operations? Sequential consistency by default, or model relaxed memory?
External function calls: how to handle calls to external functions (libc, system calls)? Stub library, or FFI to native implementations?
Vector operations: should the interpreter support scalable vectors, or only fixed-width vectors with a concrete vscale?
Performance optimization: are there opportunities to fast-path common cases (e.g., when all bytes are Bits, skip per-byte tracking)?

Future possibilities

Symbolic execution: extend the interpreter to support symbolic values, enabling exhaustive exploration of execution paths for bounded verification.
Interactive debugging: step-through execution with inspection of IR values, memory state, and provenance at each instruction.
Coverage-guided fuzzing: instrument the interpreter to report coverage, guiding the fuzzer toward unexplored paths.
Regression test generation: automatically generate minimal test cases from fuzzer-discovered bugs, adding them to the test suite.

Feature Name: `tuffy_rewrite_trace`

Status: Draft
Created: 2026-02-07
Completed: N/A

Summary

This RFC defines the rewrite path tracing infrastructure for the tuffy compiler. Every transform emits a machine-readable trace entry recording the rewrite rule, local context, before/after IR fragments, and decision rationale. A replay tool can re-execute the optimization sequence from a trace, enabling regression bisection, policy locking, and transform auditing.

Motivation

When a compiler optimization causes a performance regression or miscompilation, diagnosing the root cause is notoriously difficult:

Black-box pipeline: the optimization pipeline is opaque — developers see input IR and output IR, but not the chain of decisions that produced the result.
Non-reproducible decisions: heuristic decisions (cost model thresholds, unrolling factors) depend on analysis state that is not recorded. Changing the compiler version may silently change decisions.
Manual bisection: finding which transform caused a regression requires manually enabling/disabling passes, a tedious and error-prone process.

Rewrite path tracing makes the optimization pipeline fully transparent and reproducible. Every decision is recorded, and the entire optimization sequence can be replayed mechanically.

Guide-level explanation

Trace entries

Each time a transform fires, it emits a trace entry containing:

Rule identity: which rewrite rule fired, and where in the pipeline
Local context: the preconditions that enabled the transform (known bits, ranges, constants)
Before/after IR: the local IR fragment before and after the rewrite
Decision rationale: for equivalence transforms, the proof reference; for heuristic transforms, the cost model inputs and outputs

Trace format

Traces are structured, machine-readable data. Example (equivalence transform):

{
  "seq": 42,
  "rule": "strength_reduce_mul_pow2",
  "stage": "early_simplification",
  "iteration": 2,
  "function": "@compute",
  "region": "loop.0",
  "context": {
    "%C": {"kind": "const", "value": 8, "properties": ["is_power_of_2"]},
    "%x": {"kind": "known_bits", "known_zeros": "0xFFFFFFFF00000000"}
  },
  "before": {"instructions": ["%r = mul %x, %C"]},
  "after": {"instructions": ["%r = shl %x, 3"]},
  "decision": {
    "kind": "equivalence",
    "proof_ref": "Tuffy.Rewrites.StrengthReduce.mul_pow2"
  }
}

Example (heuristic transform):

{
  "seq": 87,
  "rule": "loop_unroll",
  "stage": "loop_optimization",
  "function": "@process",
  "region": "loop.2",
  "context": {
    "trip_count": 4,
    "body_instruction_count": 12,
    "body_has_calls": false
  },
  "before": {"instructions": ["region.loop { ... }"]},
  "after": {"instructions": ["region.sequence { ... }"]},
  "decision": {
    "kind": "cost_model",
    "before_cost": {"estimated_cycles": 48, "code_size": 24},
    "after_cost": {"estimated_cycles": 16, "code_size": 96},
    "threshold": "unroll if speedup > 2x",
    "speedup": 3.0
  }
}

Replay

A replay tool reads a trace and re-executes the same transforms on a given IR:

tuffy-replay --trace=opt_trace.json --input=before.tir --output=after.tir

At each step, the replay tool:

Matches the pattern from the trace entry against the current IR
Verifies the preconditions still hold
Applies the rewrite
If a precondition fails or the pattern doesn’t match, reports a divergence

Divergences indicate that the IR or compiler has changed in a way that invalidates the recorded optimization sequence.

Use cases

Use case	How trace helps
Regression bisect	Replay trace step by step, compare decisions with new compiler version, find first divergence
Policy locking	Save trace as a “policy file”, replay forces the same decisions on recompilation
Transform audit	Review each trace entry to verify optimization rationale
Optimization report	Generate human-readable report from trace showing what happened to each hot function
Debugging	When a miscompilation is found, trace pinpoints which transform introduced the bug

Overhead control

Tracing is optional, controlled by compiler flag:

--emit-trace=none (default): no tracing, zero overhead
--emit-trace=decisions: record rule + decision only (lightweight)
--emit-trace=full: record rule + context + before/after IR + decision (detailed)

When disabled, trace emission code is compiled out via conditional compilation, ensuring zero runtime cost.

Relationship to other features

Origin chain: tracks instruction-level lineage (where did this instruction come from). Trace tracks transform-level decisions (why was this rewrite applied). Complementary, not overlapping.
Policy-mechanism separation: trace is the execution record of a policy. Replaying a trace = enforcing a policy. Locking to a versioned policy = saving and replaying a trace.
Transform equivalence declaration: trace entries reference the declaration (equivalence or refinement) and, for verified rules, the Lean 4 proof.

Reference-level explanation

Trace data structures

struct Trace {
    metadata: TraceMetadata,
    entries: Vec<TraceEntry>,
}

struct TraceMetadata {
    compiler_version: String,
    pipeline_config: PipelineConfig,
    target_triple: String,
    timestamp: String,
}

struct TraceEntry {
    seq: u64,
    rule: String,
    stage: String,
    iteration: u32,
    function: String,
    region: Option<String>,
    context: Vec<ContextFact>,
    before: IrFragment,
    after: IrFragment,
    decision: Decision,
}

enum ContextFact {
    Const { name: String, value: BigInt, properties: Vec<String> },
    KnownBits { name: String, known_zeros: BitSet, known_ones: BitSet },
    Range { name: String, min: BigInt, max: BigInt },
    AssertSext { name: String, width: u32 },
    AssertZext { name: String, width: u32 },
}

struct IrFragment {
    instructions: Vec<String>,
}

enum Decision {
    Equivalence {
        proof_ref: Option<String>,
    },
    CostModel {
        before_cost: CostEstimate,
        after_cost: CostEstimate,
        threshold: String,
    },
}

struct CostEstimate {
    estimated_cycles: Option<u64>,
    code_size: Option<u64>,
    custom: HashMap<String, String>,
}

Trace emission

The pass infrastructure provides a trace emitter that transforms call:

impl TraceEmitter {
    fn emit(&mut self, entry: TraceEntry);
    fn is_enabled(&self) -> bool;
}

In declarative rewrite rules, trace emission is automatic — the codegen generator produces trace emission code alongside the transform code. The context is extracted from the pattern match bindings, and the decision is derived from the rule’s TransformKind.

For the lightweight --emit-trace=decisions mode, before/after IR fragments and detailed context are omitted.

Replay engine

struct ReplayEngine {
    trace: Trace,
    current_step: usize,
}

enum ReplayResult {
    Success,
    Divergence {
        step: u64,
        expected_rule: String,
        reason: DivergenceReason,
    },
}

enum DivergenceReason {
    PatternNotFound,
    PreconditionFailed { fact: String, expected: String, actual: String },
    DifferentDecision { expected: Decision, actual: Decision },
}

The replay engine processes trace entries sequentially. At each step, it attempts to apply the recorded transform. If the pattern matches and preconditions hold, the transform is applied. Otherwise, a divergence is reported with diagnostic information.

Serialization

Traces are serialized as JSON for interoperability. For large traces, streaming JSON (one entry per line, JSONL format) avoids loading the entire trace into memory.

tuffy --emit-trace=full -o output program.tir > trace.jsonl

Drawbacks

Trace size: full traces with before/after IR fragments can be large for complex programs. The decisions mode mitigates this.
Replay brittleness: replaying a trace on a different compiler version may diverge early if IR normalization changes, even if the final result is equivalent.
Serialization cost: even with conditional compilation, the trace data structures add code complexity.

Rationale and alternatives

Why machine-readable structured traces?

Alternative: human-readable logs (like LLVM’s -pass-remarks). Rejected because logs cannot be replayed. They are useful for human inspection but do not enable mechanical regression bisection or policy locking.

Alternative: IR snapshots at each stage. Recording full IR after each pass is expensive and doesn’t capture the decision rationale. Traces are more compact and more informative.

Why JSON/JSONL?

Widely supported, human-inspectable, easy to process with standard tools. Binary formats would be more compact but harder to debug and less interoperable.

Why optional with zero overhead when disabled?

Tracing is a development and debugging tool. Production builds should not pay any cost for it. Conditional compilation ensures this.

Prior art

LLVM -pass-remarks: human-readable optimization remarks. Useful but not machine-replayable.
LLVM opt-bisect-limit: binary search for which pass causes a bug. Coarser than trace-based bisection (pass-level vs transform-level).
GCC -fdump-tree-*: dumps IR at various stages. Snapshots, not decision traces.
Alive2: extracts local function-to-function rewrites for verification. Inspires tuffy’s local context extraction in trace entries.

Unresolved questions

Trace stability: how to handle trace replay across compiler versions when IR normalization changes? Should traces include a format version for compatibility?
Partial replay: should the replay tool support replaying only a subset of trace entries (e.g., only transforms in a specific function)?
Trace diffing: should there be a tool to diff two traces (e.g., from two compiler versions) and highlight decision changes?

Future possibilities

Trace-guided optimization: use traces from previous compilations to warm-start the optimizer, skipping exploration of decisions already known to be good.
Trace visualization: graphical tool showing the optimization pipeline as a sequence of local rewrites, with drill-down into context and decisions.
Community trace sharing: share traces for known-good optimization sequences on specific workloads, enabling reproducible performance.

Feature Name: `tuffy_verification`

Status: Draft
Created: 2026-02-07
Completed: N/A

Summary

This RFC defines the correctness guarantee framework for the tuffy compiler. All optimization transforms are declarative rewrite rules with machine-checked correctness proofs in Lean 4. Instruction selection (isel) — the lowering from tuffy IR to machine instructions — is also formally verified in Lean 4, proving that the generated machine code faithfully implements the IR semantics. An Alive2-style automatic verifier and traditional testing (interpreter + fuzzer) serve as auxiliary discovery tools, not as correctness foundations.

Motivation

Compiler optimization bugs are a persistent source of real-world software defects. LLVM’s InstCombine alone has accumulated hundreds of miscompilation bugs, many stemming from subtle semantic errors in hand-written transforms. The root causes are:

Imperative transforms: hand-written code that directly manipulates IR is error-prone. Each transform must correctly handle edge cases (poison, overflow, pointer provenance) that are easy to overlook.
Informal correctness arguments: most transforms are justified by informal reasoning or review, not machine-checked proofs. Subtle errors survive review.
Disconnected verification: tools like Alive2 verify transforms after the fact, but are not integrated into the development workflow as a hard gate.

Instruction selection has similar risks. The lowering from IR to machine instructions must preserve semantics — incorrect register sizing, wrong sign/zero extension, or mishandled non-standard bit-widths (e.g., u17 annotations) can silently produce wrong code. Traditional compilers rely on testing to catch these bugs; tuffy formally verifies isel correctness.

Tuffy takes a formal-first approach: every rewrite rule and every isel lowering rule must have a Lean 4 proof before it enters the production pipeline.

Guide-level explanation

Three-layer verification

Tuffy uses three layers of verification, with clear trust hierarchy:

Layer	Role	Trust level
Lean 4 proofs	Gold standard. Machine-checked correctness of every rewrite rule, MemSSA refinement rule, annotation propagation rule, isel lowering rule, and MIR translation rule	Authoritative
Alive2-style verifier	SMT-based quick check for discovering missed optimizations and regressions	Auxiliary
Interpreter + fuzzer + test suite	End-to-end behavioral testing	Auxiliary

Only Lean 4 proofs gate production inclusion. The other two layers are discovery tools — they find bugs and missed optimizations, but do not constitute correctness evidence.

Verified optimization transforms

Transforms are not hand-written imperative IR manipulation. They are declarative rewrite rules:

rule strength_reduce_mul_pow2 {
    pattern:  mul %x, const(C) where is_power_of_2(C)
    replace:  shl %x, log2(C)
    kind:     equivalence
}

rule narrow_add_known_bits {
    pattern:  add %x, %y where known_zeros(%x, [32:63]) && known_zeros(%y, [32:63])
    replace:  add %x, %y  // same operation, but refine known_zeros on result
    refine:   known_zeros(result, [32:63])
    kind:     refinement
}

Each rule specifies:

pattern: what IR fragment to match
replace: what to produce
kind: equivalence or refinement (used by dirty bit mechanism)
preconditions: constraints that must hold for the rule to apply

Workflow

The development workflow for a new optimization:

Define the rewrite rule and prove correctness in Lean 4 — the rule definition and its proof live together in the same Lean file, against the formal IR semantics. This ensures the proof and the rule are always consistent.
Lean exports declarative transform description — Lean generates a machine-readable description (JSON) of the verified rule, including pattern, replacement, preconditions, and transform kind.
Codegen generator produces Rust code from the exported description via codegen.
Generated code uses the builder API (origin, dirty bit handled automatically by Rust type system)
Alive2-style verifier cross-checks the rule for additional confidence
Fuzzer exercises the rule on diverse inputs

Steps 1–3 are mandatory. Steps 5–6 are recommended but not gating.

The key insight: by having Lean own both the rule definition and the proof, there is no possibility of the proof and the implementation diverging. The codegen generator only reads what Lean exports.

MemSSA refinement is a transform that rewrites a load’s mem operand to point to an earlier MemoryDef when alias analysis proves that intervening defs do not clobber the load’s memory location. This follows the “analysis is also a transformation” principle — refinement modifies the IR, not a side table.

Each refinement step is a rewrite rule with a proof obligation:

Given a load from location L depending on mem token %memN, and an intervening MemoryDef %memK → %memN that does not alias L, the load’s mem operand can be rewritten from %memN to %memK without changing observable behavior.

These rules are proven in Lean 4 against the formal memory semantics (evalLoad, evalStore, and the four-state byte model). The proof must account for pointer provenance — two pointers with different AllocIds never alias, but pointers derived via inttoptr may alias any exposed allocation.

Verified annotation propagation

Value-level analyses (KnownBits, DemandedBits) are transforms that write analysis results into IR annotations. Each propagation rule must be proven sound: the annotation produced must be a valid over-approximation of the actual value set.

Examples of proof obligations:

KnownBits through add: if %x has known_zeros at bits [32:63] and %y has known_zeros at bits [32:63], then add %x, %y has known_zeros at bits [33:63] (not [32:63] — carry may set bit 32).
DemandedBits through shift: if only bits [0:7] of shl %x, 8 are demanded, then no bits of %x are demanded (the result is always zero in those positions).

Unsound annotations cause silent miscompilation: downstream passes and instruction selection trust annotations to select machine operations. An incorrect :s32 annotation on a value that exceeds 32-bit signed range produces wrong code with no runtime error.

Annotation propagation rules follow the same workflow as optimization rewrite rules: defined and proven in Lean 4, exported as declarative descriptions, codegen’d to Rust.

Verified instruction selection

Instruction selection lowers tuffy IR (infinite precision integers with annotations) to target machine instructions (fixed-width registers). This lowering must preserve semantics: the machine code must produce the same observable behavior as the IR.

Key properties to verify:

Annotation-driven sizing: an :s32 annotation lowers to 32-bit machine operations; non-standard widths (e.g., :u17) lower to the correct shift/mask sequences.
Signedness correctness: signed annotations select signed machine operations (SAR, IDIV, signed condition codes) and unsigned annotations select unsigned ones (SHR, DIV, unsigned condition codes).
Extension correctness: sext/zext from annotation width N produces the correct result, including the no-op cases (zext from unsigned, sext from signed) and the cases requiring explicit shift/mask sequences.

Each isel lowering rule is defined in Lean 4 as a relation between IR instruction semantics and machine instruction semantics. The proof obligation is:

For all inputs satisfying the annotation constraints, the machine instruction sequence produces the same result as the IR instruction under the formal semantics.

The Lean formalization requires a machine-level semantics model (x86-64 integer operations on fixed-width registers) in addition to the existing IR semantics.

Verified MIR translation

MIR translation converts rustc MIR to tuffy IR. If this translation is wrong, all downstream verification is meaningless. Key non-trivial translation steps:

Integer type translation: MIR fixed-width integers (u8, i32, u64) → int with :s<N> / :u<N> annotations. Incorrect annotations cause silent miscompilation.
Control flow structuralization: flat MIR CFG → hierarchical SESE regions.
Aggregate scalarization: structs, enums, arrays → scalar operations.
mem2reg: MIR mutable Places → SSA values.

Approach: define a MIR semantic subset in Lean 4 that covers the MIR constructs tuffy consumes. This is not a full MIR formalization — it models only the subset relevant to translation, independent of rustc’s internal MIR definition. Each translation rule is proven correct against this subset model and the tuffy IR formal semantics.

When rustc’s MIR semantics change, the Lean MIR subset model is updated and affected translation proofs are re-verified. The interpreter + fuzzer layer provides additional confidence that the subset model faithfully reflects actual MIR behavior.

Trust boundary

┌─────────────────────────────────────────┐
│              Trusted components          │
│                                         │
│  ┌───────────────┐  ┌────────────────┐  │
│  │ Lean 4 kernel │  │ IR formal      │  │
│  │               │  │ semantics      │  │
│  └───────────────┘  └────────────────┘  │
│  ┌────────────────┐  ┌───────────────┐  │
│  │ x86-64 machine │  │ MIR semantic  │  │
│  │ semantics      │  │ subset model  │  │
│  └────────────────┘  └───────────────┘  │
│  ┌───────────────┐                      │
│  │ Codegen       │                      │
│  │ generator     │                      │
│  └───────────────┘                      │
├─────────────────────────────────────────┤
│              Verified components         │
│                                         │
│  ┌────────────────────────────────────┐  │
│  │ Optimization rewrite rules        │  │
│  │ (proven correct in Lean 4)        │  │
│  └────────────────────────────────────┘  │
│  ┌────────────────────────────────────┐  │
│  │ MemSSA refinement rules           │  │
│  │ (proven correct in Lean 4)        │  │
│  └────────────────────────────────────┘  │
│  ┌────────────────────────────────────┐  │
│  │ Annotation propagation rules      │  │
│  │ (proven correct in Lean 4)        │  │
│  └────────────────────────────────────┘  │
│  ┌────────────────────────────────────┐  │
│  │ Isel lowering rules               │  │
│  │ (proven correct in Lean 4)        │  │
│  └────────────────────────────────────┘  │
│  ┌────────────────────────────────────┐  │
│  │ MIR translation rules             │  │
│  │ (proven against MIR subset model) │  │
│  └────────────────────────────────────┘  │
├─────────────────────────────────────────┤
│              Constrained components      │
│                                         │
│  ┌────────────────────────────────────┐  │
│  │ Generated Rust transform code     │  │
│  │ (type system enforces invariants) │  │
│  └────────────────────────────────────┘  │
└─────────────────────────────────────────┘

IR formal semantics in Lean 4

The foundation of all proofs is a formal semantics of tuffy IR defined in Lean 4. This includes:

Type system semantics (infinite precision integers, byte types, pointers, mem tokens, vectors)
Instruction semantics (each operation’s mathematical definition)
Poison propagation rules
Memory model (four-state bytes, provenance, MemSSA token semantics)
Annotation semantics (range constraints, KnownBits, DemandedBits soundness conditions)
Assert node semantics (assertsext, assertzext)

A separate MIR semantic subset model defines the MIR constructs that tuffy consumes, serving as the source semantics for MIR translation proofs.

This formalization is the single source of truth for what the IR means. All rewrite rule, MemSSA refinement, annotation propagation, isel, and MIR translation proofs reference these definitions.

Alive2-style verifier

A separate SMT-based tool that can:

Automatically verify simple rewrite rules (bounded bit widths, no loops)
Discover missed optimizations by enumerating candidate rewrites
Regression-test existing rules against IR semantics changes
Run in CI as a fast feedback loop

This tool is valuable for development velocity but is not a substitute for Lean 4 proofs. It may miss edge cases that the theorem prover catches, and it cannot handle unbounded reasoning (infinite precision integers require induction).

Interpreter and fuzzer

The tuffy IR interpreter (miri-like) combined with fuzzing provides end-to-end validation:

Compile Rust programs with tuffy, run the output, compare against reference (rustc + LLVM)
Fuzz IR fragments through the optimization pipeline, check that interpreter results are preserved
Exercise edge cases in poison propagation, pointer provenance, and memory model

Reference-level explanation

Rewrite rule representation

struct RewriteRule {
    name: String,
    pattern: Pattern,
    replacement: Replacement,
    preconditions: Vec<Precondition>,
    kind: TransformKind,
    proof_ref: LeanProofRef,
}

struct LeanProofRef {
    module: String,      // Lean 4 module path
    theorem: String,     // theorem name
    verified: bool,      // checked by build system
}

Codegen pipeline

Lean 4 rewrite rule definition + proof
        │
        ▼
  Codegen generator (trusted)
        │
        ▼
  Generated Rust code:
    - Pattern matching function
    - Replacement construction (via Builder API)
    - TransformKind declaration
    - Precondition checks

The generator produces Rust code that:

Implements the Pass trait
Uses the builder API (origin is mandatory, dirty bit is automatic)
Declares TransformKind matching the Lean 4 proof (equivalence or refinement)

Build system integration

The build system verifies that:

Every rewrite rule in the production pipeline has a corresponding Lean 4 proof
The proof references a theorem that is successfully checked by Lean 4
The generated Rust code is up-to-date with the rule definition

Rules without proofs can exist in a experimental pipeline for development, but cannot be included in release builds.

Drawbacks

Lean 4 learning curve: contributors must learn Lean 4 to add new optimizations. This raises the barrier to contribution.
Proof effort: proving correctness of complex transforms (loop optimizations, vectorization) requires significant effort. Some transforms may be difficult to formalize.
Formalization maintenance: changes to IR semantics require updating the Lean 4 formalization and potentially re-proving existing theorems.
Codegen generator trust: the generator is in the trusted computing base. Bugs in the generator could produce incorrect Rust code from correct proofs.
Development velocity: requiring proofs for every optimization may slow down the pace of adding new optimizations compared to traditional compilers.

Rationale and alternatives

Why Lean 4?

Active ecosystem with Mathlib providing extensive mathematical foundations
Good metaprogramming support (tactics, macros) for proof automation
Overlap with Rust community (shared interest in formal methods)
Compiled language with reasonable performance for proof checking

Alternative: Coq. Mature ecosystem but less active development. Lean 4’s metaprogramming is more ergonomic.

Alternative: Isabelle. Strong automation (Sledgehammer) but less mainstream adoption and weaker programming language integration.

Why declarative transforms?

Alternative: imperative transforms with post-hoc verification. Write transforms in Rust, then verify them with Alive2 or similar. Rejected because imperative code has a large surface area for bugs that post-hoc tools may miss, and it’s harder to establish a formal correspondence between the Rust code and the verified property.

Declarative rules have a narrow semantic gap between the rule definition and the Lean 4 proof. The codegen generator bridges this gap with minimal trusted code.

Why not rely solely on Alive2-style verification?

SMT-based verification is bounded — it checks finite bit widths and cannot perform induction. Tuffy’s infinite precision integers require inductive proofs that only an interactive theorem prover can provide. Additionally, SMT solvers may time out on complex transforms, giving no answer rather than a proof.

Prior art

CompCert: fully verified C compiler in Coq. Proves correctness of the entire compilation pipeline, not just individual transforms. Tuffy takes a similar approach for isel (verified lowering) while using declarative rules for optimizations.
Alive / Alive2: SMT-based verification of LLVM InstCombine transforms. Discovered numerous bugs. Tuffy uses this as an auxiliary tool, not the primary correctness mechanism.
CakeML: verified ML compiler in HOL4. Similar full-pipeline verification to CompCert.
Lean 4 + Mathlib: growing library of formalized mathematics. Provides foundations for reasoning about integers, bit operations, and algebraic properties.
Souper: superoptimizer for LLVM that discovers optimizations via SMT. Could complement tuffy’s approach by suggesting candidate rewrite rules for human proof.

Unresolved questions

Declarative rule expressiveness: can all useful transforms be expressed declaratively, or will some require escape hatches for imperative logic (e.g., complex loop transforms)?
Proof automation: how much of the proof burden can be automated via Lean 4 tactics? Custom tactics for common proof patterns (bit manipulation, poison propagation) would significantly reduce effort.
Codegen generator verification: should the generator itself be verified (in Lean 4), or is manual audit sufficient given its narrow scope?
Incremental proof checking: how to efficiently re-check proofs when IR semantics change? Full re-verification may be expensive.

Future possibilities

Proof-carrying optimizations: distribute verified rewrite rules as packages, enabling a community-driven optimization library with machine-checked correctness.
Superoptimizer integration: use SMT-based search (Souper-style) to discover candidate rewrites, then prove them in Lean 4 for inclusion in the production pipeline.
End-to-end verification: with optimizations and isel now verified, extend verification to the remaining pipeline stages (MIR translation, register allocation, machine code emission) to close the full compilation gap.
LLM-assisted proofs: use language models to draft Lean 4 proofs for new rewrite rules, with human review and machine checking.

Feature Name: `at-use-annotations`

Status: Draft
Created: 2026-02-08
Completed: N/A

Summary

Replace standalone assert_sext/assert_zext instructions with annotations on value definitions (result-side) and use edges (use-side). Range constraints, known bits, and demanded bits are encoded directly on the IR edges rather than as separate instructions. Additionally, simplify the text format by eliding the implicit top-level region function.

Motivation

The current IR uses assert_sext and assert_zext as standalone instructions to express range constraints on infinite precision integers:

v0 = param 0
v1 = assert_sext v0, 32
v2 = param 1
v3 = assert_sext v2, 32
v4 = add v1, v3
v5 = assert_sext v4, 32
ret v5

This has several problems:

Instruction bloat — Every constrained value requires a separate assert instruction, roughly doubling the instruction count for typical integer-heavy code.
Artificial data dependencies — Assert nodes create new SSA values (v1 instead of v0), forcing all downstream uses to reference the asserted value. This complicates pattern matching in optimization passes.
No place for analysis results — The planned at-use bit-level analysis (known bits, demanded bits) has no natural home. Assert nodes only express range constraints, not arbitrary bit-level facts.
Phase ordering — Analysis results stored in side tables become stale after transformations. Encoding them in the IR itself keeps them synchronized.

Additionally, the text format wraps every function body in region function { ... }, which is redundant since every function has exactly one top-level function region.

Guide-level explanation

Annotations on values

Every value reference in tuffy IR can carry an optional annotation describing constraints on that value. Annotations appear in two positions:

Result-side: on the value defined by an instruction, declaring the range of the output.
Use-side: on an operand of an instruction, declaring a constraint the consumer requires.

Annotation types

Annotation	Meaning
`:s<N>`	Value is in signed N-bit range `[-2^(N-1), 2^(N-1)-1]`
`:u<N>`	Value is in unsigned N-bit range `[0, 2^N-1]`
`:known(<ternary>)`	Per-bit three-state constraint (see below)
(none)	No constraint; value is an unconstrained `int`

Annotations are composable: :s32:known(...) applies both constraints simultaneously.

Known bits ternary encoding

Each bit in a known annotation is one of four states:

Symbol	Meaning
`0`	Bit is known to be 0
`1`	Bit is known to be 1
`?`	Unknown — bit is demanded but its value has not been determined
`x`	Don’t-care — bit is not demanded by the consumer

Example: :known(xxxx_??11_1111_0000) means:

Bits [0:3] are known-0
Bits [4:9] are known-1
Bits [8:9] are demanded but unknown
Bits [10:] are don’t-care (not demanded)

The distinction between ? and x matters for optimization:

? (unknown): the consumer needs this bit, so the producer must supply a well-defined value. An analysis pass may later refine ? to 0 or 1.
x (don’t-care): the consumer ignores this bit entirely. The producer is free to supply any value, enabling dead-bit elimination and narrowing optimizations.

Example: `i32 a + i32 b` (C signed addition)

Before (assert nodes):

func @add_i32(int, int) -> int {
  region function {
    bb0:
      v0 = param 0
      v1 = assert_sext v0, 32
      v2 = param 1
      v3 = assert_sext v2, 32
      v4 = add v1, v3
      v5 = assert_sext v4, 32
      ret v5
  }
}

After (at-use annotations, top-level region elided):

func @add_i32(int:s32, int:s32) -> int:s32 {
  bb0:
    v0:s32 = param 0
    v1:s32 = param 1
    v2:s32 = add v0:s32, v1:s32
    ret v2:s32
}

6 instructions reduced to 4. No artificial SSA values from assert nodes.

Example: optimization strengthening a use-side annotation

After an analysis pass discovers that v1 is always in [0, 100] at a particular use:

func @example(int:s32) -> int:s32 {
  bb0:
    v0:s32 = param 0
    v1:s32 = add v0:s32, v0:s32
    v2:s32 = foo v1:u7              // use-side: optimizer proved v1 in [0, 127] here
    ret v2:s32
}

The :u7 on the use of v1 is stronger than the :s32 on its definition. This is valid because the use-side annotation is a local fact about this specific use site.

Example: known bits after optimization

v3:s32:known(xxxx_xxxx_xxxx_xxxx_????_????_????_???0) = and v1:s32, v2:s32

The and result has bit 0 known to be 0 (discovered by analysis), bits [1:15] are demanded but unknown, and bits [16:31] are don’t-care.

Semantics

Result-side annotation violation: the instruction itself produces poison.

v0:s8 = add v1, v2    // if the true result is 200, v0 becomes poison

Use-side annotation violation: the consuming instruction produces poison.

v1 = foo v0:u4        // if v0 is 20, then foo produces poison

Constraint strength: a use-side annotation may be stronger (narrower) than the result-side annotation on the referenced value. It must not be weaker — a weaker use-side annotation is redundant and should be omitted.

No annotation: the value is an unconstrained infinite precision int. During instruction selection, the verifier reports an error if it cannot determine a concrete machine type for an unconstrained value.

Function signatures

Function signatures carry annotations on parameter types and the return type. These are ABI-level contracts between caller and callee:

func @add_i32(int:s32, int:s32) -> int:s32 { ... }

Parameter annotation: the caller guarantees the argument satisfies the constraint. Inside the function body, param result-side annotations match the signature.
Return annotation: the callee guarantees the return value satisfies the constraint. The ret use-side annotation must be at least as strong as the signature’s return annotation.

Text format: elide top-level region

Every function has exactly one top-level region function. The text format elides it:

// Before:
func @f(int) -> int {
  region function {
    bb0:
      ...
  }
}

// After:
func @f(int) -> int {
  bb0:
    ...
}

Inner regions (loop, plain) are still written explicitly.

Reference-level explanation

Annotation representation

In the Rust implementation, annotations are stored alongside each operand (use-side) and each instruction result (result-side). A compact representation:

#![allow(unused)]
fn main() {
/// Per-bit four-state annotation.
struct KnownBits {
    /// Bits known to be 1.
    ones: BigUint,
    /// Bits known to be 0.
    zeros: BigUint,
    /// Bits that are demanded by the consumer.
    demanded: BigUint,
}
}

Invariant: ones & zeros == 0 (a bit cannot be simultaneously known-0 and known-1).

The :s<N> and :u<N> shorthand annotations are sugar over KnownBits:

:s32 implies demanded bits [0:31], with the sign bit propagation constraint.
:u32 implies demanded bits [0:31], with bits [32:] known-0.

Interaction with other features

Block arguments: block arguments carry result-side annotations. Branch instructions carry use-side annotations on the passed values:

bb0:
  v0:s32 = iconst 42
  br bb1(v0:s32)

bb1(v1:s32: int):
  ret v1:s32

Terminators: ret, br, brif, continue, region_yield all carry use-side annotations on their operands.

Memory operations: load result carries a result-side annotation. store value operand carries a use-side annotation. Pointer operands are not annotated (pointers have concrete types).

Instruction selection: the verifier walks all values and checks that every int value has sufficient annotation (result-side or use-side) to determine a concrete machine type. Missing annotations are reported as errors.

Grammar changes

annotation   ::= (':' range_ann)*
range_ann    ::= 's' N | 'u' N | 'known' '(' ternary ')'
ternary      ::= [01?x_]+

function     ::= 'func' '@' name '(' param_list ')' ('->' type annotation)? '{' body '}'
param_list   ::= (type annotation (',' type annotation)*)?
body         ::= (block | region)*
instruction  ::= (value annotation '=')? opcode operands
operand      ::= value annotation

The top-level region function { ... } wrapper is removed from the grammar. The function body directly contains blocks and regions.

Drawbacks

Annotation storage overhead — Every use edge now carries optional annotation data. For dense integer code this may increase memory usage compared to standalone assert instructions (one annotation per use vs one instruction per value).
Complexity in the builder API — The builder must accept annotations on every operand, making the API surface larger.
Parsing complexity — The text format parser must handle annotations on every value reference, increasing grammar complexity.

Rationale and alternatives

Why annotations on edges instead of standalone instructions?

The core insight is that range constraints and bit-level facts are properties of how a value is used, not properties of the value itself. A value v0 may be in the full int range, but a particular consumer may only need 8 bits of it. Encoding this on the use edge is the natural representation.

Alternative: keep assert nodes, add side-table analyses

This is the LLVM approach — KnownBits and DemandedBits live in analysis caches outside the IR. The problem is phase ordering: after a transformation, the side tables are stale and must be recomputed. Tuffy’s “analysis is also a transformation” principle rejects this approach.

Alternative: assert nodes with use-side annotations (hybrid)

Keep assert nodes for result-side constraints, add use-side annotations only for analysis results. This avoids changing the instruction set but retains the instruction bloat problem and the artificial SSA value issue.

Why signed/unsigned is a binary choice

A natural question is whether annotations should support expressing both signed and unsigned constraints simultaneously — e.g., a value that is both :s32 and :u16. The answer is no, because the intersection of any two range annotations is always expressible as a single annotation:

Intersection	Result	Reasoning
`:s<M>` ∩ `:s<N>`	`:s<min(M,N)>`	Smaller signed range is a subset of the larger
`:u<M>` ∩ `:u<N>`	`:u<min(M,N)>`	Smaller unsigned range is a subset of the larger
`:s<M>` ∩ `:u<N>`	`:u<min(M-1, N)>`	The non-negative portion of `:s<M>` is `[0, 2^(M-1)-1]`, which equals `:u<M-1>`

The third case is the key insight. A signed N-bit value is in [-2^(N-1), 2^(N-1)-1]. An unsigned M-bit value is in [0, 2^M-1]. Their intersection is [0, min(2^(N-1)-1, 2^M-1)], which is always an unsigned range expressible as :u<min(N-1, M)>.

Since any combination of signed and unsigned constraints collapses to a single annotation, supporting both simultaneously adds no expressive power — it only adds implementation complexity.

For constraints that range annotations genuinely cannot express (e.g., “bit 3 is known to be 1” or “bits above 8 are don’t-care”), the KnownBits extension is the correct generalization. KnownBits subsumes range annotations entirely: :s32 and :u32 are sugar over specific KnownBits patterns.

Impact of not doing this

Without this change, the IR would need both assert nodes (for range constraints) and a separate mechanism (side tables or new annotation syntax) for known/demanded bits. Two parallel systems for the same concept.

Prior art

LLVM KnownBits / DemandedBits — Side-table analyses that track per-bit information. Effective but disconnected from the IR, causing phase ordering issues and stale results after transformations. Tuffy’s approach promotes these to first-class IR constructs.
Cranelift aegraph annotations — Cranelift’s e-graph framework attaches facts to values (range constraints for proof-carrying code). Similar in spirit to result-side annotations, but not on use edges.
MLIR type refinement — MLIR dialects can carry refined type information on values. However, MLIR’s approach is type-based (the type itself changes), not annotation-based (the type stays int, annotations are metadata).

Unresolved questions

Compact storage for common cases — Most annotations will be simple :s32 or :u64. Should the implementation use a small-enum optimization to avoid allocating BigUint for these common cases?
Annotation canonicalization — Should redundant annotations (e.g., :s32 on a use when the result already has :s32) be stripped automatically, or preserved for readability?
Lean formalization — How should annotations be represented in the Lean formal model? As metadata on the operational semantics, or as part of the value type?

Future possibilities

Annotation inference pass — A dedicated pass that propagates known bits forward and demanded bits backward, filling in ? bits with 0/1 and narrowing ? to x where possible.
Annotation-driven instruction selection — The instruction selector reads annotations directly to choose machine instructions, register classes, and sign/zero extension without a separate legalization phase.
Annotation-aware rewrite rules — Declarative rewrite rules that match on annotation patterns, e.g., “if all bits above bit 7 are don’t-care, narrow to 8-bit operation”.
Floating point annotations — Extend the annotation system to floating point values (e.g., known-NaN, known-finite, known-positive).

Feature Name: `scalable-vectors`

Status: Draft
Created: 2026-02-09
Completed: N/A

Summary

Replace the element-count-parameterized vector type vec<vscale x N x T> with a bit-width-parameterized vector type vec<W> (fixed) / vec<vscale x W> (scalable), where W is the total bit-width of the vector. Element count is derived as W / element_bits, consistent with tuffy’s infinite precision integer model where bit-width is an annotation property, not a type property.

Motivation

In tuffy IR, integers have infinite precision and bit-width is expressed through annotations (:s32, :u16, etc.). Annotations are always droppable — they constrain but never determine semantics. This creates a tension with the current vector type design vec<vscale x N x T>, which bakes element count into the type:

Element count requires knowing element size. vec<1 x 4 x int> says “4 integers” but doesn’t specify how wide each integer is. The physical vector register width is unknown until isel reads annotations. This makes the type incomplete without annotations, violating the principle that annotations are droppable.
Hardware thinks in bit-widths. SSE is 128-bit, AVX is 256-bit, AVX-512 is 512-bit, SVE is vscale × 128-bit. The fundamental hardware parameter is total register width, not element count.
Google Highway validates this model. Highway’s ScalableTag<T, N> computes lane count from N / sizeof(T), where N is a byte count (bit-width / 8). The library is designed around total width as the primary parameter.

A bit-width-parameterized vector type aligns with all three concerns.

Guide-level explanation

Vector type syntax

vec<128>              -- fixed 128-bit vector
vec<256>              -- fixed 256-bit vector
vec<vscale x 128>     -- scalable vector, vscale × 128 bits

The single parameter W is the total bit-width. Element count is derived from context: when operating on 32-bit integers, vec<128> has 4 lanes; when operating on 64-bit floats, it has 2 lanes.

Element width comes from annotations

Since tuffy IR integers are infinite precision, the element width is determined by annotations on the vector operands, not by the vector type itself:

v0:vec<128> = vec.add v1:vec<128>:s32, v2:vec<128>:s32
    -- 4 lanes of 32-bit signed integers

v3:vec<128> = vec.add v4:vec<128>:u8, v5:vec<128>:u8
    -- 16 lanes of 8-bit unsigned integers

Dropping the annotation relaxes the constraint but doesn’t change the vector width. The vector is still 128 bits; the element interpretation is simply unconstrained.

Scalable vectors

For hardware with runtime-determined vector width (ARM SVE, RISC-V RVV):

v0:vec<vscale x 128> = vec.add v1:vec<vscale x 128>:s32, v2:vec<vscale x 128>:s32

Here vscale is a runtime constant. On a machine with vscale=4, this is a 512-bit vector with 16 lanes of 32-bit integers.

Fractional vectors

Half-width or quarter-width vectors are expressed naturally as smaller bit-widths:

vec<64>               -- half of vec<128>
vec<vscale x 64>      -- half of vec<vscale x 128>

No special “fractional vector” mechanism is needed.

Reference-level explanation

Type definition

VectorType :=
  | Fixed { width : Nat }           -- fixed-width vector
  | Scalable { baseWidth : Nat }    -- vscale × baseWidth

Well-formedness constraints

A vector operation vec.op v:vec<W>:ann is well-formed when:

Bit-width is positive: W > 0
Byte-aligned: W % 8 == 0
Divisible by element width: When an annotation specifies element width E, W % E == 0
Power-of-two lane count: W / E must be a power of two
Scalable base width: For vec<vscale x B>, the same constraints apply to B

Constraints 3–4 are checked only when annotations are present. Without annotations, the vector is a raw bag of bits with no lane structure.

Interaction with annotations

Annotations on vector values specify per-element constraints:

Annotation	Meaning on `vec<W>`
`:s32`	Each of the `W/32` elements is in `[-2^31, 2^31-1]`
`:u8`	Each of the `W/8` elements is in `[0, 255]`
(none)	No per-element constraint; raw bit vector

Instruction selection

Isel reads the annotation to determine the machine instruction:

IR	Annotation	x86 (128-bit)	ARM NEON
`vec.add`	`:s32`	`paddd`	`add.4s`
`vec.add`	`:u8`	`paddb`	`add.16b`
`vec.shr`	`:s32`	`psrad`	`sshr.4s`
`vec.shr`	`:u32`	`psrld`	`ushr.4s`

Vector operations

All vector operations from the IR design RFC remain unchanged in semantics. The only change is the type parameter: vec<vscale x N x T> becomes vec<W> or vec<vscale x W>.

Operations that need element width (e.g., vec.shr for signed vs unsigned shift) read it from operand annotations, exactly as scalar shr does today.

Memory operations

Vector loads and stores specify the vector width:

v0:vec<128> = vec.load %ptr, %mask
vec.store %v:vec<128>, %ptr, %mask

The byte count transferred is W / 8. Element interpretation for scatter/gather index computation comes from annotations.

Drawbacks

Annotation-dependent semantics for lane count. Without an annotation, the lane structure is unknown. Operations like vec.extract %v, %idx need to know the element width to compute the extraction offset. This could be addressed by requiring annotations on such operations.
Diverges from LLVM. LLVM uses <N x T> with explicit element count and type. Tooling and documentation that references LLVM’s model may cause confusion.

Rationale and alternatives

Why bit-width over element count?

Consistency: Matches the infinite precision integer model where bit-width is an annotation, not a type property.
Hardware alignment: Maps directly to physical register widths.
Simplicity: One parameter instead of two (count + element type).

Alternative: Keep `vec<vscale x N x T>`

The LLVM-style approach. Rejected because it requires baking element type into the vector type, which conflicts with tuffy’s annotation-based design. It also creates redundancy: vec<1 x 4 x int>:s32 encodes “4 elements” in the type and “32-bit” in the annotation, but 4 × 32 = 128 is the actual hardware parameter.

Alternative: `vec<N x T>` with `T = b<W>`

Use byte types as elements: vec<4 x b32>. This avoids the infinite-precision problem but introduces a second parameterization axis. Rejected in favor of the simpler single-parameter model.

Prior art

Google Highway: ScalableTag<T, N> where N is a byte count. Lane count is N / sizeof(T). Validates the bit-width-first approach.
ARM SVE: Vectors are vscale × 128 bits. Element width is encoded in instructions, not in the register type.
RISC-V RVV: VLEN is the register bit-width. SEW (selected element width) is set per-instruction via vtype.
LLVM: <vscale x N x T> with explicit element count and type. The most widely used model but designed for a typed IR, not an infinite-precision one.

Unresolved questions

Lane-manipulating operations: vec.extract, vec.insert, and vec.shuffle need element width to compute offsets. Should they require annotations, or should they take an explicit element-width immediate?
Mixed-width operations: How to express widening/narrowing operations (e.g., multiply 16-bit elements producing 32-bit results)?
Mask type: Should masks be vec<W> with 1-bit-per-element (annotation :u1), or a separate predicate type?

Future possibilities

Tuple types: vec<128> × 2 for multi-register operations (ld2/st2 on ARM).
Matrix types: mat<R x C x W> for AMX/SME-style matrix operations.
Automatic vectorization: The bit-width model simplifies cost modeling since the hardware register width is directly available.

SOP: Modifying the Tuffy IR

Created: 2026-02-08
Status: Active

Overview

This document describes the standard operating procedure for making changes to the Tuffy IR. Because the IR is the core data structure shared across the entire compiler, changes must be propagated to multiple components in a specific order.

Guiding Principle

The Lean 4 formal definitions are the source of truth. All other artifacts (Rust code, spec document) must conform to the Lean definitions. When in doubt, the Lean code is authoritative.

Checklist

1. Lean 4 Formal Definitions (Source of Truth)

Update the formal model first. Depending on the nature of the change:

What changed	File to update
Types, values, memory model	`lean/TuffyLean/IR/Types.lean`
Instruction semantics	`lean/TuffyLean/IR/Semantics.lean`

After editing, verify with:

cd lean && lake build

If the change affects existing proofs, fix all proof breakages before proceeding.

2. Rust Implementation (`tuffy_ir`)

Translate the Lean changes into the Rust IR crate. Files to consider:

What changed	File to update
Type system (new type or type change)	`tuffy_ir/src/types.rs`
New/modified instruction or opcode	`tuffy_ir/src/instruction.rs`
Reference handles (ValueRef, BlockRef, etc.)	`tuffy_ir/src/value.rs`
Function/block/region structure	`tuffy_ir/src/function.rs`
Builder API for constructing IR	`tuffy_ir/src/builder.rs`
Textual format / display	`tuffy_ir/src/display.rs`

Run after changes:

cargo test -p tuffy_ir
cargo clippy -p tuffy_ir

3. Downstream Crates

IR changes typically break downstream consumers. Update in dependency order:

Interpreter (tuffy_ir_interp/src/lib.rs) — Add/update evaluation logic for new or changed instructions.
Optimizer (tuffy_opt/) — Update any passes that pattern-match on changed instructions.
Code Generation (tuffy_codegen/) — Update instruction selection (src/isel.rs), encoding (src/encode.rs), or emission (src/emit.rs) as needed.
Trace / TV (tuffy_trace/, tuffy_tv/) — Update if the change affects tracing or visualization.

4. Spec Document

Update the human-readable reference:

docs/spec/ (split into multiple files; see docs/spec/index.md)

Ensure the spec accurately reflects the new Lean definitions and Rust implementation. The spec is not authoritative — it is documentation derived from the Lean model.

5. Tests

Add or update tests at every layer:

Layer	Location
Rust unit tests	`tuffy_ir/src/tests.rs`
Interpreter tests	`tuffy_ir_interp/tests/`
Codegen tests	`tuffy_codegen/src/tests.rs`
UI tests	`rustc_codegen_tuffy/tests/` (run via `rustc_codegen_tuffy/tests/run-ui-tests.sh`)

6. Final Verification

Before committing, all of the following must pass:

cd lean && lake build
cargo test
cargo clippy
rustc_codegen_tuffy/tests/run-ui-tests.sh

Commit Strategy

Each component should be committed separately per the project’s commit convention. A typical IR change produces commits in this order:

feat(lean): ... — Lean formal definitions
feat(ir): ... — Rust tuffy_ir crate
feat(interp): ... / feat(codegen): ... — downstream crates
docs: ... — spec update

Every individual commit must pass all tests.

SOP: Updating Documentation

Created: 2026-02-09
Status: Active

Overview

This document describes the checks required whenever documentation files under docs/ are added or modified. Skipping these checks can cause broken links, missing pages on the published site, or stale content that contradicts the source of truth.

Checklist

1. SUMMARY.md

Every Markdown file that should appear on the published site must be listed in docs/SUMMARY.md. mdBook only builds pages present in the summary — unlisted files will result in 404 errors.

When adding a new page:

Add an entry to docs/SUMMARY.md in the correct section and nesting level.
If the page belongs to an existing group (e.g., spec/), add it as a nested entry under the group’s parent.

2. Spec Index

If the change adds or removes a page under docs/spec/, update the Table of Contents in docs/spec/index.md to match.

3. Source of Truth Conformance

The spec documents (docs/spec/) describe the IR but are not authoritative. The Lean 4 definitions (lean/TuffyLean/IR/) are the source of truth.

When updating spec docs, verify that:

Descriptions match the current Lean definitions.
Semantics formulas reference the correct eval* function names from TuffyLean.IR.Semantics.
Type names and instruction opcodes match the Rust implementation in tuffy_ir/.

4. Internal Links

Check that all cross-references between documentation pages are valid:

Relative links (e.g., [types](types.md)) point to files that exist.
Anchor links (e.g., instructions.md#comparison) reference headings that exist in the target file.

5. Build Verification

After making changes, build the documentation locally and verify:

mdbook build

The build must complete without errors. Optionally preview with:

mdbook serve

6. Commit

Documentation-only changes use the docs commit type:

docs: <description>
docs(spec): <description>    # for spec-specific changes

If the documentation change accompanies a code change (e.g., adding a new instruction), the docs update should be a separate commit from the code change, per the project’s commit convention.

SOP: Adding rustc_codegen_tuffy Codegen Tests

Purpose

This SOP defines the standard procedure for adding new codegen tests to rustc_codegen_tuffy/tests/codegen/. These tests verify MIR-to-IR translation correctness.

When to Use

After fixing a bug in rustc_codegen_tuffy/mir_to_ir/ (regression test)
When adding new translation logic for MIR constructs
When modifying existing translation behavior

Prerequisites

Backend must be built: cd rustc_codegen_tuffy && cargo build
Understand which mir_to_ir/ module your test targets

Procedure

1. Determine Test Location

Organize tests by the mir_to_ir/ module being tested:

Module	Test Directory
`rvalue.rs`	`tests/codegen/rvalue/`
`intrinsic.rs`	`tests/codegen/intrinsic/`
`terminator.rs`	`tests/codegen/terminator/`
`call.rs`	`tests/codegen/call/`
Other modules	Create corresponding subdirectory

2. Create Test File

Create a new .rs file in the appropriate directory with a descriptive name:

tests/codegen/rvalue/my_feature.rs

3. Write Test Structure

Use this template:

#![allow(unused)]
fn main() {
// compile-flags: -Zmir-opt-level=3 -C debug-assertions=off

#![crate_type = "lib"]
#![no_std]

#[no_mangle]
pub fn test_function(a: i32) -> i32 {
    // Minimal code that exercises the feature
    a + 1
}
}

Key requirements:

Use standard compile flags: -Zmir-opt-level=3 -C debug-assertions=off
Include #![crate_type = "lib"] and #![no_std]
Use #[no_mangle] on test functions
Keep code minimal and focused
Use meaningful identifier names

4. Generate CHECK Lines

Run the update script to automatically generate CHECK annotations:

cd rustc_codegen_tuffy
./tests/update-codegen-test.sh tests/codegen/rvalue/my_feature.rs

This will:

Compile the test with the tuffy backend
Capture MIR and IR output
Insert CHECK lines into the test file

CRITICAL: Never write CHECK lines manually. Always use update-codegen-test.sh.

5. Verify Test Passes

Run the codegen test suite:

cd rustc_codegen_tuffy
./tests/run-codegen-tests.sh

Verify your new test passes.

6. Review Generated Output

Open the test file and review:

CHECK lines match expected IR structure
MIR output is as expected
Test exercises the intended code path

7. Commit

Commit the test file with an appropriate message:

test(rustc-codegen): add codegen test for <feature>

Add regression test for <bug> or test coverage for <feature>.
Verifies that <specific behavior> translates correctly to IR.

Notes

CHECK lines use exact string matching (not regex)
Indentation in CHECK lines must match IR output exactly
Each CHECK line must appear in order in the generated output
Empty CHECK lines (// CHECK:) preserve blank lines in output
The script automatically filters out panic messages from IR output

Common Patterns

Testing Multiple Functions

#![allow(unused)]
fn main() {
#[no_mangle]
pub fn func1(a: i32) -> i32 { a + 1 }

#[no_mangle]
pub fn func2(a: u64) -> u64 { a * 2 }
}

Each function will have separate MIR and IR CHECK blocks.

Testing with Specific Compile Flags

#![allow(unused)]
fn main() {
// compile-flags: -Zmir-opt-level=0 -C opt-level=0
}

Only use non-standard flags when testing specific behaviors (e.g., debug assertions, unoptimized codegen).

Testing Intrinsics

#![allow(unused)]
fn main() {
#[no_mangle]
pub fn test_rotate(x: u32, n: u32) -> u32 {
    x.rotate_left(n)
}
}

The test will capture both the high-level intrinsic call and its lowered implementation.

References

rustc_codegen_tuffy/README.md — Component architecture and testing conventions
rustc_codegen_tuffy/tests/update-codegen-test.sh — CHECK line generation script
rustc_codegen_tuffy/tests/run-codegen-tests.sh — Test runner

Keyboard shortcuts

Tuffy Documentation