Feature Name: tuffy_ir
- Status: Draft
- Created: 2026-02-07
- Completed: N/A
Summary
This RFC defines the intermediate representation (IR) for the tuffy compiler. The IR uses a single data structure across all compilation stages with stage-dependent normalization, infinite precision integers with at-use bit-level analysis, a four-state byte memory model, and a hierarchical CFG with nested SESE regions.
Motivation
Existing compiler IRs (LLVM IR, Cranelift CLIF) carry historical design baggage that limits optimization potential and complicates formal verification:
- Fixed-width integers force premature commitment to bit widths, introducing zext/sext/trunc noise that obscures mathematical equivalences.
- undef/poison duality in LLVM creates well-documented semantic confusion and miscompilation bugs. Approximately 18% of detected LLVM miscompilations stem from
undefalone. - Flat CFGs lose structural information (loops, branches) that must be expensively recovered for high-level optimizations.
- Side-table analyses (KnownBits, DemandedBits) are disconnected from the IR, creating phase ordering problems and stale analysis results.
- Multiple IR layers (LLVM IR → SelectionDAG → MachineIR) require costly translations with potential information loss at each boundary.
Tuffy’s IR addresses these issues from first principles, designed for formal verification, efficient compilation, and clean semantics.
Guide-level explanation
Type system
Tuffy IR has a minimal type system:
int— Infinite precision integer. No bit width, no signedness. Arithmetic is mathematical:add(3, 5)is8, period. There is no overflow.b<N>— Byte type representing raw memory data, where N is a multiple of 8. Distinct from integers; preserves pointer fragments and per-byte poison tracking.ptr<AS>— Pointer type carrying provenance (allocation ID + offset) in address space AS. Default address space is 0. Different address spaces may have different pointer sizes and semantics.float/double— IEEE 754 floating point types.vec<vscale x N x T>— Scalable vector type.vscaleis a runtime constant determined by hardware;Nis the minimum number of elements;Tis the element type (int,float,double,b<N>,ptr<AS>). Fixed-width vectors are represented asvscale=1(e.g.,vec<1 x 4 x b32>is a 128-bit fixed vector of 4 x 32-bit elements). When the element type isint, the actual vector width is unknown until instruction selection, sinceinthas no fixed bit width; the element count is determined by the IR, but the physical vector register mapping is deferred.
Infinite precision integers
Instead of i32 or i64, all integer values are simply int. Range constraints are expressed via at-use annotations on value definitions and use edges (see RFC: at-use-annotations):
%raw = load b32, ptr %p // load 32 bits of raw bytes from memory
%x:s32 = bitcast %raw to int // cast to int, result annotated as signed 32-bit
%y:s32 = add %x:s32, 1 // mathematical addition, no overflow; result in s32 range
%out = bitcast %y:s32 to b32 // cast back to bytes for storing
store b32 %out, ptr %q
Loads always operate on byte types (b<N>). The byte type carries the bit width; integers do not. The decision of sign-extension vs zero-extension, and whether to use a GPR or FPR, is deferred entirely to instruction selection based on at-use annotations.
There are no zext, sext, or trunc instructions. Bitwise operations (and, or, xor, shift) are defined on infinite precision integers.
At-use annotations
Range constraints and bit-level analysis results are encoded directly into the IR as annotations on value definitions (result-side) and use edges (use-side). See RFC: at-use-annotations for the full design.
Each annotation carries a four-state per-bit encoding:
0: bit is known to be 01: bit is known to be 1?: bit is demanded but value unknownx: bit is don’t-care (not demanded)
Shorthand annotations :s32, :u8 etc. are sugar over the full known-bits representation.
Result-side annotations are generated by the frontend from source language type information. Use-side annotations may be added by optimization passes when they discover stronger constraints at specific use sites. This follows the “analysis is also a transformation” principle — analysis results are encoded directly into the IR, not stored in side tables.
Hierarchical CFG
The IR uses a hierarchical control flow graph:
func @example(int:s32) -> int:s32 {
region loop {
bb0(i: int):
%cond = cmp.lt %i:s32, %arg:s32
br %cond, bb1, exit
bb1:
%i.next = add %i, 1
br bb0(%i.next)
}
exit:
ret %i:s32
}
Loops and branches are explicitly represented as nested SESE (Single-Entry Single-Exit) regions. Inside each region, instructions are organized in traditional SSA basic blocks with fixed ordering.
Memory model
Pointers are abstract values with provenance:
%p = alloca int // %p has provenance tied to this allocation
%addr = ptrtoaddr %p // extracts address only, discards provenance
%bits = ptrtoint %p // converts to integer, captures provenance
%q = inttoptr %bits // safe roundtrip: recovers provenance
Each byte in memory is one of four states:
- Bits(u8): a concrete byte value (0–255)
- PtrFragment(Pointer, n): the nth byte of an abstract pointer
- Uninit: uninitialized memory (not “random bits”; a distinct abstract state)
- Poison: a poisoned byte
Memory SSA
Memory dependencies are encoded directly into the IR as Memory SSA (MemSSA). Each memory operation (load, store, call) carries a memory version token representing the memory state it depends on or produces.
%mem0 = entry_mem // initial memory state
%mem1 = store b32 %v, ptr %p, %mem0 // store produces new memory version
%raw = load b32, ptr %q, %mem1 // load depends on memory version
MemSSA supports progressive refinement: initially a conservative single memory chain, then split into finer-grained versions as alias analysis improves. This follows the “analysis is also a transformation” principle — memory dependency information lives in the IR, not in side tables.
Atomic operations
Atomic memory operations are regular load/store/rmw/cmpxchg instructions annotated with a memory ordering:
%raw = load.atomic b32, ptr %p, ordering=acquire, %mem0
%mem1 = store.atomic b32 %v, ptr %p, ordering=release, %mem0
%old = rmw.add b32, ptr %p, %delta, ordering=seqcst, %mem0
%res, %ok = cmpxchg b32, ptr %p, %expected, %desired, success=acqrel, failure=acquire, %mem0
Supported orderings: relaxed, acquire, release, acqrel, seqcst. Fence instructions are also supported as standalone operations.
Atomic operations participate in MemSSA like any other memory operation, with ordering constraints restricting reordering during optimization.
UB model
The IR uses only poison for undefined behavior. There is no undef.
- Operations with UB (e.g., division by zero) produce
poison. - Annotation violations produce
poison: result-side violations at the defining instruction, use-side violations at the consuming instruction. - Any operation on
Uninitmemory producespoison. poisonpropagates through computations but is only UB when “observed” (e.g., used as a branch condition, stored to observable memory).
Single data structure, multiple stages
The IR uses one data structure throughout compilation. Different stages impose different normalization constraints:
- High-level: close to MIR translation, high-level operations allowed (e.g., aggregate operations, generic intrinsics).
- Optimized: after optimization passes, IR is normalized but still target-independent.
- Legalized: operations are lowered to target-legal forms (e.g., 64-bit multiply on 32-bit targets becomes a sequence of 32-bit operations).
- Pre-emission: fully scheduled, register-allocated, ready for machine code emission.
No separate SDAG or MachineIR. The same graph structure is progressively constrained.
MIR translation
Translation from rustc MIR to tuffy IR involves:
- Control flow structuralization: attempt to normalize to natural loops and promote to SESE regions. Irreducible control flow is left as flat CFG within a region.
- mem2reg: MIR uses mutable
Placevalues (not pure SSA). A mem2reg pass promotes stack allocations to SSA values where possible. - Drop/unwind: Rust’s drop glue and panic unwind edges are preserved in CFG form, not promoted to SESE regions. Unwind paths remain as explicit CFG edges.
- Aggregate scalarization: structs, enums, arrays, and tuples are fully decomposed into scalar operations (analogous to LLVM’s SROA). No aggregate types exist in tuffy IR.
- Type translation: MIR fixed-width integers (
u8,i32,u64, etc.) are translated tointwith corresponding:u<N>/:s<N>result-side annotations at definition points.
Machine code lowering
Lowering from tuffy IR to machine code:
- ABI: a dedicated ABI library maps
intparameters/returns to concrete register classes and calling conventions based on the target triple. - Stack frame layout: derived from the target triple; alloca sizes are resolved after instruction selection determines concrete bit widths.
- Instruction selection: at-use bit-level annotations determine concrete types; the instruction selector chooses sext/zext and GPR/FPR placement.
- Debug info: DWARF generation follows LLVM’s approach.
- Unwind tables: exception handling table generation follows LLVM’s approach.
Reference-level explanation
Type system details
| Type | Description |
|---|---|
int | Infinite precision integer, no fixed bit width |
b<N> | Byte type, N must be a multiple of 8 |
ptr | Pointer with provenance (allocation ID + offset) |
float | IEEE 754 single precision |
double | IEEE 754 double precision |
vec<vscale x N x T> | Scalable vector, vscale is runtime constant, N is min element count, T is element type |
Integer operations
All integer arithmetic operates on mathematical integers:
| Operation | Semantics |
|---|---|
add %a, %b | Mathematical addition |
sub %a, %b | Mathematical subtraction |
mul %a, %b | Mathematical multiplication |
div %a, %b | Integer division (poison if %b == 0) |
rem %a, %b | Integer remainder (poison if %b == 0) |
and %a, %b | Bitwise AND on infinite precision |
or %a, %b | Bitwise OR on infinite precision |
xor %a, %b | Bitwise XOR on infinite precision |
shl %a, %b | Left shift |
shr %a, %b | Right shift (signedness from annotation) |
Value annotations
Annotations on value definitions and use edges:
| Annotation | Semantics |
|---|---|
:s<N> | Poison if value outside signed N-bit range [-2^(N-1), 2^(N-1)-1] |
:u<N> | Poison if value outside unsigned N-bit range [0, 2^N-1] |
:known(<ternary>) | Per-bit four-state constraint (0, 1, ?, x) |
Result-side violation → instruction produces poison. Use-side violation → consuming instruction produces poison.
Pointer operations
| Operation | Semantics |
|---|---|
ptrtoint %p | Convert pointer to integer, capturing provenance |
ptrtoaddr %p | Extract address only, discard provenance |
inttoptr %i | Convert integer to pointer, recovering captured provenance |
ptradd %p, %offset | Pointer arithmetic within an allocation (provenance-preserving offset addition) |
ptrdiff %p, %q | Pointer difference; returns %p.offset - %q.offset as int; poison if %p and %q do not belong to the same allocation |
Byte type operations
| Operation | Semantics |
|---|---|
bitcast %v to b<N> | Type-punning conversion between byte type and other types |
load b<N>, ptr %p | Load raw bytes from memory |
store b<N> %v, ptr %p | Store raw bytes to memory |
Vector operations
All vector operations work on vec<vscale x N x T> types. A mask operand (vec<vscale x N x int>) is optional on most operations to support predicated execution (SVE/RVV style).
Elementwise arithmetic:
| Operation | Semantics |
|---|---|
vec.add %a, %b | Elementwise addition |
vec.sub %a, %b | Elementwise subtraction |
vec.mul %a, %b | Elementwise multiplication |
vec.div %a, %b | Elementwise division |
vec.rem %a, %b | Elementwise remainder |
vec.and %a, %b | Elementwise bitwise AND |
vec.or %a, %b | Elementwise bitwise OR |
vec.xor %a, %b | Elementwise bitwise XOR |
vec.shl %a, %b | Elementwise left shift |
vec.shr %a, %b | Elementwise right shift (signedness from annotation) |
vec.fadd %a, %b | Elementwise floating point addition |
vec.fmul %a, %b | Elementwise floating point multiplication |
Horizontal reductions:
| Operation | Semantics |
|---|---|
vec.reduce.add %v | Sum all elements |
vec.reduce.mul %v | Multiply all elements |
vec.reduce.smin %v | Signed minimum across elements |
vec.reduce.smax %v | Signed maximum across elements |
vec.reduce.umin %v | Unsigned minimum across elements |
vec.reduce.umax %v | Unsigned maximum across elements |
vec.reduce.and %v | Bitwise AND across elements |
vec.reduce.or %v | Bitwise OR across elements |
vec.reduce.xor %v | Bitwise XOR across elements |
vec.reduce.fadd %v | Floating point sum (ordered or unordered via flag) |
vec.reduce.fmin %v | Floating point minimum across elements |
vec.reduce.fmax %v | Floating point maximum across elements |
Scatter/gather (indexed vector memory access):
| Operation | Semantics |
|---|---|
vec.gather %base, %indices, %mask, %mem | Load elements from base + indices[i] for each lane where mask is true; produces vector + memory token |
vec.scatter %v, %base, %indices, %mask, %mem | Store elements to base + indices[i] for each lane where mask is true; produces memory token |
Masking/predication:
| Operation | Semantics |
|---|---|
vec.select %mask, %a, %b | Per-lane select: mask[i] ? a[i] : b[i] |
vec.compress %v, %mask | Pack active lanes (mask=true) to front |
vec.expand %v, %mask | Scatter contiguous elements to active lanes |
Lane manipulation:
| Operation | Semantics |
|---|---|
vec.splat %scalar | Broadcast scalar to all lanes |
vec.extract %v, %idx | Extract scalar from lane |
vec.insert %v, %scalar, %idx | Insert scalar into lane |
vec.shuffle %a, %b, %indices | Permute lanes from two vectors |
Contiguous vector memory access:
| Operation | Semantics |
|---|---|
vec.load %ptr, %mask, %mem | Masked contiguous load |
vec.store %v, %ptr, %mask, %mem | Masked contiguous store |
Platform intrinsics:
Platform-specific intrinsics (e.g., std::arch::x86_64::*) support two modes, selectable via compiler flag or function-level attribute:
- Map mode (default): recognized intrinsics are mapped to tuffy vector operations, enabling further optimization by the compiler.
- Passthrough mode: intrinsics are preserved as opaque target-specific calls (
target.intrinsic "x86.avx2.add.ps" %a, %b), emitted as-is during instruction selection. Suitable for hand-tuned SIMD code where the user wants exact instruction control.
Unrecognized intrinsics always use passthrough mode.
Memory byte states
Each byte in the abstract memory is represented as:
enum AbstractByte {
Bits(u8), // concrete value 0-255
PtrFragment(AllocId, usize), // nth byte of a pointer
Uninit, // uninitialized
Poison, // poisoned
}
Hierarchical CFG structure
enum CfgNode {
BasicBlock(Vec<Instruction>, Terminator),
Region(RegionKind, Vec<CfgNode>),
}
enum RegionKind {
Loop { header: BasicBlockId },
Branch { condition: Value, then_region: RegionId, else_region: RegionId },
Sequence,
}
Regions are always SESE. Translation from rustc MIR requires control flow structuralization.
At-use annotation representation
Each use edge and each instruction result carries an optional annotation:
struct Annotation {
known_ones: BigUint, // bits known to be 1
known_zeros: BigUint, // bits known to be 0
demanded: BigUint, // bits that are demanded
}
Invariant: known_ones & known_zeros == 0.
Shorthand :s<N> and :u<N> are sugar over this representation. No annotation means unconstrained int; the verifier reports an error if instruction selection cannot determine a concrete machine type.
Drawbacks
- Infinite precision integers add complexity to instruction selection. The final stage must derive concrete bit widths from at-use annotations, which is a non-trivial analysis.
- Hierarchical CFG requires control flow structuralization from MIR, which may introduce extra blocks for irreducible control flow.
- At-use annotations on every edge increase memory footprint of the IR.
- Single data structure means late-stage IR carries fields irrelevant to early stages and vice versa, potentially wasting memory.
- Novel design means less existing tooling and literature to draw from compared to traditional SSA CFG.
Rationale and alternatives
Why infinite precision integers?
Two equally important motivations:
-
Formal verification: Infinite precision integers are mathematical integers — exactly what theorem provers (Lean 4, Mathlib) natively support. Rewrite rule proofs operate on standard integer arithmetic with induction, algebraic properties, and equivalences available out of the box. Fixed-width integers would require every proof to reason about overflow, wrapping semantics, and sign extension, dramatically increasing proof complexity.
-
Pattern recognition: Source languages (especially C) introduce redundant integer conversions due to implicit promotion rules (e.g., C’s integer promotion widens
charandshorttoint). In LLVM IR, these become chains of zext/sext/trunc that obscure the underlying mathematical equivalence. Pattern matching in optimization passes fails to see through the conversion noise, causing missed optimizations. Infinite precision integers eliminate this noise entirely — the mathematical structure is directly visible in the IR.
Together, these make both the proof side and the optimization side simpler and more effective.
Alternative: arbitrary-width integers (LLVM-style). Rejected because they still require zext/sext/trunc and overflow reasoning in every pass and every proof.
Alternative: fixed set of widths (Cranelift-style). Simpler but still carries the same fundamental problems.
Why hierarchical CFG?
Flat CFGs lose loop structure, requiring expensive loop detection passes. Since tuffy aims to perform loop optimizations early, preserving this structure in the IR is essential.
Alternative: flat SSA CFG. Rejected because loop detection is expensive and error-prone.
Alternative: Sea of Nodes. Considered seriously. Advantages for strongly-typed languages (Cliff Click’s arguments), but engineering complexity (scheduling, debugging, cache locality) outweighs benefits. V8’s experience is cautionary even if their problems were partly JS-specific.
Alternative: E-graph. Attractive for formal verification but immature for handling side effects and control flow at scale.
Why four-state bytes?
The Uninit state (separate from Poison) correctly models uninitialized memory as an abstract concept rather than “random bits.” The PtrFragment state preserves pointer provenance through byte-level memory operations like memcpy. Poison at the byte level enables per-byte poison tracking aligned with the byte type design.
Why ptrtoint captures provenance?
This enables safe inttoptr(ptrtoint(p)) roundtrips while ptrtoaddr provides a provenance-stripping alternative. This avoids the LLVM problem where three individually-correct optimizations produce wrong results when combined (Ralf Jung, 2020).
Prior art
- LLVM IR: SSA CFG with fixed-width integers, undef/poison duality, flat CFG. Tuffy’s IR addresses known LLVM pain points (undef removal, byte type, phase ordering).
- Cranelift CLIF: Single IR with stage-dependent legalization. Tuffy adopts this “single data structure” approach but adds hierarchical CFG and infinite precision integers.
- LLVM VPlan: Hierarchical CFG with SESE regions for vectorization. Inspires tuffy’s region-based structure.
- V8 TurboFan / Turboshaft: Sea of Nodes abandoned in favor of CFG after 10 years. Informs tuffy’s decision to use SSA basic blocks inside regions.
- Cliff Click’s Sea of Nodes: Strong typing (Java, Rust) mitigates many of V8’s problems. Tuffy acknowledges this but prioritizes engineering simplicity.
- Alive2: Poison-only UB model, byte type proposal. Directly informs tuffy’s UB and type design.
- Miri: Abstract pointer model (allocation ID + offset), abstract byte representation. Informs tuffy’s memory model.
Unresolved questions
- Floating point semantics: How should NaN payloads and denormals be handled? Are there opportunities for infinite-precision-like simplification?
- SIMD/vector types: How do vector operations fit into the type system and hierarchical CFG?
- Aggregate types: Should structs/arrays exist in the IR or be fully decomposed into scalar operations?
- Exception handling: How do Rust panics and unwinding interact with SESE regions?
- Concurrency model: How are atomic operations and memory ordering represented?
- Stack slot lifetime and coloring:
stack_slotcurrently takes only a size parameter. To support stack coloring (overlapping non-live slots), lifetime information is needed. The preferred direction is region-based: stack slots belong to a SESE region and are automatically dead when control exits the region. Open question: after lowering to a flat CFG, how is region-based lifetime information preserved? Options include (a) materializing lifetime markers at region boundaries during lowering, or (b) maintaining a region tree alongside the flat CFG. The lifetime.start/end intrinsic approach (LLVM-style) is not preferred.
Future possibilities
- E-graph integration: Equality saturation could be applied as an optimization strategy on subgraphs within the hierarchical CFG, without replacing the core IR representation.
- Superoptimizer: The formal-first design and rewrite path tracing make it natural to integrate solver-based optimization discovery (Souper-style).
- Multi-language frontend: The IR is designed to be language-agnostic. Future frontends for other languages can target the same IR.
- JIT compilation: The single data structure with stage-dependent normalization could support a JIT mode that skips late-stage optimizations for faster compilation.
- Translation validation: Rewrite path traces provide the foundation for end-to-end translation validation of the compilation pipeline.