Feature Name: `tuffy_ir`

Status: Draft
Created: 2026-02-07
Completed: N/A

Summary

This RFC defines the intermediate representation (IR) for the tuffy compiler. The IR uses a single data structure across all compilation stages with stage-dependent normalization, infinite precision integers with at-use bit-level analysis, a four-state byte memory model, and a hierarchical CFG with nested SESE regions.

Motivation

Existing compiler IRs (LLVM IR, Cranelift CLIF) carry historical design baggage that limits optimization potential and complicates formal verification:

Fixed-width integers force premature commitment to bit widths, introducing zext/sext/trunc noise that obscures mathematical equivalences.
undef/poison duality in LLVM creates well-documented semantic confusion and miscompilation bugs. Approximately 18% of detected LLVM miscompilations stem from undef alone.
Flat CFGs lose structural information (loops, branches) that must be expensively recovered for high-level optimizations.
Side-table analyses (KnownBits, DemandedBits) are disconnected from the IR, creating phase ordering problems and stale analysis results.
Multiple IR layers (LLVM IR → SelectionDAG → MachineIR) require costly translations with potential information loss at each boundary.

Tuffy’s IR addresses these issues from first principles, designed for formal verification, efficient compilation, and clean semantics.

Guide-level explanation

Type system

Tuffy IR has a minimal type system:

int — Infinite precision integer. No bit width, no signedness. Arithmetic is mathematical: add(3, 5) is 8, period. There is no overflow.
b<N> — Byte type representing raw memory data, where N is a multiple of 8. Distinct from integers; preserves pointer fragments and per-byte poison tracking.
ptr<AS> — Pointer type carrying provenance (allocation ID + offset) in address space AS. Default address space is 0. Different address spaces may have different pointer sizes and semantics.
float / double — IEEE 754 floating point types.
vec<vscale x N x T> — Scalable vector type. vscale is a runtime constant determined by hardware; N is the minimum number of elements; T is the element type (int, float, double, b<N>, ptr<AS>). Fixed-width vectors are represented as vscale=1 (e.g., vec<1 x 4 x b32> is a 128-bit fixed vector of 4 x 32-bit elements). When the element type is int, the actual vector width is unknown until instruction selection, since int has no fixed bit width; the element count is determined by the IR, but the physical vector register mapping is deferred.

Infinite precision integers

Instead of i32 or i64, all integer values are simply int. Range constraints are expressed via at-use annotations on value definitions and use edges (see RFC: at-use-annotations):

%raw = load b32, ptr %p         // load 32 bits of raw bytes from memory
%x:s32 = bitcast %raw to int   // cast to int, result annotated as signed 32-bit
%y:s32 = add %x:s32, 1         // mathematical addition, no overflow; result in s32 range
%out = bitcast %y:s32 to b32   // cast back to bytes for storing
store b32 %out, ptr %q

Loads always operate on byte types (b<N>). The byte type carries the bit width; integers do not. The decision of sign-extension vs zero-extension, and whether to use a GPR or FPR, is deferred entirely to instruction selection based on at-use annotations.

There are no zext, sext, or trunc instructions. Bitwise operations (and, or, xor, shift) are defined on infinite precision integers.

At-use annotations

Range constraints and bit-level analysis results are encoded directly into the IR as annotations on value definitions (result-side) and use edges (use-side). See RFC: at-use-annotations for the full design.

Each annotation carries a four-state per-bit encoding:

0: bit is known to be 0
1: bit is known to be 1
?: bit is demanded but value unknown
x: bit is don’t-care (not demanded)

Shorthand annotations :s32, :u8 etc. are sugar over the full known-bits representation.

Result-side annotations are generated by the frontend from source language type information. Use-side annotations may be added by optimization passes when they discover stronger constraints at specific use sites. This follows the “analysis is also a transformation” principle — analysis results are encoded directly into the IR, not stored in side tables.

Hierarchical CFG

The IR uses a hierarchical control flow graph:

func @example(int:s32) -> int:s32 {
  region loop {
    bb0(i: int):
      %cond = cmp.lt %i:s32, %arg:s32
      br %cond, bb1, exit

    bb1:
      %i.next = add %i, 1
      br bb0(%i.next)
  }
  exit:
    ret %i:s32
}

Loops and branches are explicitly represented as nested SESE (Single-Entry Single-Exit) regions. Inside each region, instructions are organized in traditional SSA basic blocks with fixed ordering.

Memory model

Pointers are abstract values with provenance:

%p = alloca int              // %p has provenance tied to this allocation
%addr = ptrtoaddr %p         // extracts address only, discards provenance
%bits = ptrtoint %p          // converts to integer, captures provenance
%q = inttoptr %bits          // safe roundtrip: recovers provenance

Each byte in memory is one of four states:

Bits(u8): a concrete byte value (0–255)
PtrFragment(Pointer, n): the nth byte of an abstract pointer
Uninit: uninitialized memory (not “random bits”; a distinct abstract state)
Poison: a poisoned byte

Memory SSA

Memory dependencies are encoded directly into the IR as Memory SSA (MemSSA). Each memory operation (load, store, call) carries a memory version token representing the memory state it depends on or produces.

%mem0 = entry_mem                    // initial memory state
%mem1 = store b32 %v, ptr %p, %mem0 // store produces new memory version
%raw = load b32, ptr %q, %mem1      // load depends on memory version

MemSSA supports progressive refinement: initially a conservative single memory chain, then split into finer-grained versions as alias analysis improves. This follows the “analysis is also a transformation” principle — memory dependency information lives in the IR, not in side tables.

Atomic operations

Atomic memory operations are regular load/store/rmw/cmpxchg instructions annotated with a memory ordering:

%raw = load.atomic b32, ptr %p, ordering=acquire, %mem0
%mem1 = store.atomic b32 %v, ptr %p, ordering=release, %mem0
%old = rmw.add b32, ptr %p, %delta, ordering=seqcst, %mem0
%res, %ok = cmpxchg b32, ptr %p, %expected, %desired, success=acqrel, failure=acquire, %mem0

Supported orderings: relaxed, acquire, release, acqrel, seqcst. Fence instructions are also supported as standalone operations.

Atomic operations participate in MemSSA like any other memory operation, with ordering constraints restricting reordering during optimization.

UB model

The IR uses only poison for undefined behavior. There is no undef.

Operations with UB (e.g., division by zero) produce poison.
Annotation violations produce poison: result-side violations at the defining instruction, use-side violations at the consuming instruction.
Any operation on Uninit memory produces poison.
poison propagates through computations but is only UB when “observed” (e.g., used as a branch condition, stored to observable memory).

Single data structure, multiple stages

The IR uses one data structure throughout compilation. Different stages impose different normalization constraints:

High-level: close to MIR translation, high-level operations allowed (e.g., aggregate operations, generic intrinsics).
Optimized: after optimization passes, IR is normalized but still target-independent.
Legalized: operations are lowered to target-legal forms (e.g., 64-bit multiply on 32-bit targets becomes a sequence of 32-bit operations).
Pre-emission: fully scheduled, register-allocated, ready for machine code emission.

No separate SDAG or MachineIR. The same graph structure is progressively constrained.

MIR translation

Translation from rustc MIR to tuffy IR involves:

Control flow structuralization: attempt to normalize to natural loops and promote to SESE regions. Irreducible control flow is left as flat CFG within a region.
mem2reg: MIR uses mutable Place values (not pure SSA). A mem2reg pass promotes stack allocations to SSA values where possible.
Drop/unwind: Rust’s drop glue and panic unwind edges are preserved in CFG form, not promoted to SESE regions. Unwind paths remain as explicit CFG edges.
Aggregate scalarization: structs, enums, arrays, and tuples are fully decomposed into scalar operations (analogous to LLVM’s SROA). No aggregate types exist in tuffy IR.
Type translation: MIR fixed-width integers (u8, i32, u64, etc.) are translated to int with corresponding :u<N>/:s<N> result-side annotations at definition points.

Machine code lowering

Lowering from tuffy IR to machine code:

ABI: a dedicated ABI library maps int parameters/returns to concrete register classes and calling conventions based on the target triple.
Stack frame layout: derived from the target triple; alloca sizes are resolved after instruction selection determines concrete bit widths.
Instruction selection: at-use bit-level annotations determine concrete types; the instruction selector chooses sext/zext and GPR/FPR placement.
Debug info: DWARF generation follows LLVM’s approach.
Unwind tables: exception handling table generation follows LLVM’s approach.

Reference-level explanation

Type system details

Type	Description
`int`	Infinite precision integer, no fixed bit width
`b<N>`	Byte type, N must be a multiple of 8
`ptr`	Pointer with provenance (allocation ID + offset)
`float`	IEEE 754 single precision
`double`	IEEE 754 double precision
`vec<vscale x N x T>`	Scalable vector, vscale is runtime constant, N is min element count, T is element type

Integer operations

All integer arithmetic operates on mathematical integers:

Operation	Semantics
`add %a, %b`	Mathematical addition
`sub %a, %b`	Mathematical subtraction
`mul %a, %b`	Mathematical multiplication
`div %a, %b`	Integer division (poison if %b == 0)
`rem %a, %b`	Integer remainder (poison if %b == 0)
`and %a, %b`	Bitwise AND on infinite precision
`or %a, %b`	Bitwise OR on infinite precision
`xor %a, %b`	Bitwise XOR on infinite precision
`shl %a, %b`	Left shift
`shr %a, %b`	Right shift (signedness from annotation)

Value annotations

Annotations on value definitions and use edges:

Annotation	Semantics
`:s<N>`	Poison if value outside signed N-bit range `[-2^(N-1), 2^(N-1)-1]`
`:u<N>`	Poison if value outside unsigned N-bit range `[0, 2^N-1]`
`:known(<ternary>)`	Per-bit four-state constraint (`0`, `1`, `?`, `x`)

Result-side violation → instruction produces poison. Use-side violation → consuming instruction produces poison.

Pointer operations

Operation	Semantics
`ptrtoint %p`	Convert pointer to integer, capturing provenance
`ptrtoaddr %p`	Extract address only, discard provenance
`inttoptr %i`	Convert integer to pointer, recovering captured provenance
`ptradd %p, %offset`	Pointer arithmetic within an allocation (provenance-preserving offset addition)
`ptrdiff %p, %q`	Pointer difference; returns `%p.offset - %q.offset` as int; poison if %p and %q do not belong to the same allocation

Byte type operations

Operation	Semantics
`bitcast %v to b<N>`	Type-punning conversion between byte type and other types
`load b<N>, ptr %p`	Load raw bytes from memory
`store b<N> %v, ptr %p`	Store raw bytes to memory

Vector operations

All vector operations work on vec<vscale x N x T> types. A mask operand (vec<vscale x N x int>) is optional on most operations to support predicated execution (SVE/RVV style).

Elementwise arithmetic:

Operation	Semantics
`vec.add %a, %b`	Elementwise addition
`vec.sub %a, %b`	Elementwise subtraction
`vec.mul %a, %b`	Elementwise multiplication
`vec.div %a, %b`	Elementwise division
`vec.rem %a, %b`	Elementwise remainder
`vec.and %a, %b`	Elementwise bitwise AND
`vec.or %a, %b`	Elementwise bitwise OR
`vec.xor %a, %b`	Elementwise bitwise XOR
`vec.shl %a, %b`	Elementwise left shift
`vec.shr %a, %b`	Elementwise right shift (signedness from annotation)
`vec.fadd %a, %b`	Elementwise floating point addition
`vec.fmul %a, %b`	Elementwise floating point multiplication

Horizontal reductions:

Operation	Semantics
`vec.reduce.add %v`	Sum all elements
`vec.reduce.mul %v`	Multiply all elements
`vec.reduce.smin %v`	Signed minimum across elements
`vec.reduce.smax %v`	Signed maximum across elements
`vec.reduce.umin %v`	Unsigned minimum across elements
`vec.reduce.umax %v`	Unsigned maximum across elements
`vec.reduce.and %v`	Bitwise AND across elements
`vec.reduce.or %v`	Bitwise OR across elements
`vec.reduce.xor %v`	Bitwise XOR across elements
`vec.reduce.fadd %v`	Floating point sum (ordered or unordered via flag)
`vec.reduce.fmin %v`	Floating point minimum across elements
`vec.reduce.fmax %v`	Floating point maximum across elements

Scatter/gather (indexed vector memory access):

Operation	Semantics
`vec.gather %base, %indices, %mask, %mem`	Load elements from `base + indices[i]` for each lane where mask is true; produces vector + memory token
`vec.scatter %v, %base, %indices, %mask, %mem`	Store elements to `base + indices[i]` for each lane where mask is true; produces memory token

Masking/predication:

Operation	Semantics
`vec.select %mask, %a, %b`	Per-lane select: `mask[i] ? a[i] : b[i]`
`vec.compress %v, %mask`	Pack active lanes (mask=true) to front
`vec.expand %v, %mask`	Scatter contiguous elements to active lanes

Lane manipulation:

Operation	Semantics
`vec.splat %scalar`	Broadcast scalar to all lanes
`vec.extract %v, %idx`	Extract scalar from lane
`vec.insert %v, %scalar, %idx`	Insert scalar into lane
`vec.shuffle %a, %b, %indices`	Permute lanes from two vectors

Contiguous vector memory access:

Operation	Semantics
`vec.load %ptr, %mask, %mem`	Masked contiguous load
`vec.store %v, %ptr, %mask, %mem`	Masked contiguous store

Platform intrinsics:

Platform-specific intrinsics (e.g., std::arch::x86_64::*) support two modes, selectable via compiler flag or function-level attribute:

Map mode (default): recognized intrinsics are mapped to tuffy vector operations, enabling further optimization by the compiler.
Passthrough mode: intrinsics are preserved as opaque target-specific calls (target.intrinsic "x86.avx2.add.ps" %a, %b), emitted as-is during instruction selection. Suitable for hand-tuned SIMD code where the user wants exact instruction control.

Unrecognized intrinsics always use passthrough mode.

Memory byte states

Each byte in the abstract memory is represented as:

enum AbstractByte {
    Bits(u8),                    // concrete value 0-255
    PtrFragment(AllocId, usize), // nth byte of a pointer
    Uninit,                      // uninitialized
    Poison,                      // poisoned
}

Hierarchical CFG structure

enum CfgNode {
    BasicBlock(Vec<Instruction>, Terminator),
    Region(RegionKind, Vec<CfgNode>),
}

enum RegionKind {
    Loop { header: BasicBlockId },
    Branch { condition: Value, then_region: RegionId, else_region: RegionId },
    Sequence,
}

Regions are always SESE. Translation from rustc MIR requires control flow structuralization.

At-use annotation representation

Each use edge and each instruction result carries an optional annotation:

struct Annotation {
    known_ones: BigUint,     // bits known to be 1
    known_zeros: BigUint,    // bits known to be 0
    demanded: BigUint,       // bits that are demanded
}

Invariant: known_ones & known_zeros == 0.

Shorthand :s<N> and :u<N> are sugar over this representation. No annotation means unconstrained int; the verifier reports an error if instruction selection cannot determine a concrete machine type.

Drawbacks

Infinite precision integers add complexity to instruction selection. The final stage must derive concrete bit widths from at-use annotations, which is a non-trivial analysis.
Hierarchical CFG requires control flow structuralization from MIR, which may introduce extra blocks for irreducible control flow.
At-use annotations on every edge increase memory footprint of the IR.
Single data structure means late-stage IR carries fields irrelevant to early stages and vice versa, potentially wasting memory.
Novel design means less existing tooling and literature to draw from compared to traditional SSA CFG.

Rationale and alternatives

Why infinite precision integers?

Two equally important motivations:

Formal verification: Infinite precision integers are mathematical integers — exactly what theorem provers (Lean 4, Mathlib) natively support. Rewrite rule proofs operate on standard integer arithmetic with induction, algebraic properties, and equivalences available out of the box. Fixed-width integers would require every proof to reason about overflow, wrapping semantics, and sign extension, dramatically increasing proof complexity.
Pattern recognition: Source languages (especially C) introduce redundant integer conversions due to implicit promotion rules (e.g., C’s integer promotion widens char and short to int). In LLVM IR, these become chains of zext/sext/trunc that obscure the underlying mathematical equivalence. Pattern matching in optimization passes fails to see through the conversion noise, causing missed optimizations. Infinite precision integers eliminate this noise entirely — the mathematical structure is directly visible in the IR.

Together, these make both the proof side and the optimization side simpler and more effective.

Alternative: arbitrary-width integers (LLVM-style). Rejected because they still require zext/sext/trunc and overflow reasoning in every pass and every proof.

Alternative: fixed set of widths (Cranelift-style). Simpler but still carries the same fundamental problems.

Why hierarchical CFG?

Flat CFGs lose loop structure, requiring expensive loop detection passes. Since tuffy aims to perform loop optimizations early, preserving this structure in the IR is essential.

Alternative: flat SSA CFG. Rejected because loop detection is expensive and error-prone.

Alternative: Sea of Nodes. Considered seriously. Advantages for strongly-typed languages (Cliff Click’s arguments), but engineering complexity (scheduling, debugging, cache locality) outweighs benefits. V8’s experience is cautionary even if their problems were partly JS-specific.

Alternative: E-graph. Attractive for formal verification but immature for handling side effects and control flow at scale.

Why four-state bytes?

The Uninit state (separate from Poison) correctly models uninitialized memory as an abstract concept rather than “random bits.” The PtrFragment state preserves pointer provenance through byte-level memory operations like memcpy. Poison at the byte level enables per-byte poison tracking aligned with the byte type design.

Why ptrtoint captures provenance?

This enables safe inttoptr(ptrtoint(p)) roundtrips while ptrtoaddr provides a provenance-stripping alternative. This avoids the LLVM problem where three individually-correct optimizations produce wrong results when combined (Ralf Jung, 2020).

Prior art

LLVM IR: SSA CFG with fixed-width integers, undef/poison duality, flat CFG. Tuffy’s IR addresses known LLVM pain points (undef removal, byte type, phase ordering).
Cranelift CLIF: Single IR with stage-dependent legalization. Tuffy adopts this “single data structure” approach but adds hierarchical CFG and infinite precision integers.
LLVM VPlan: Hierarchical CFG with SESE regions for vectorization. Inspires tuffy’s region-based structure.
V8 TurboFan / Turboshaft: Sea of Nodes abandoned in favor of CFG after 10 years. Informs tuffy’s decision to use SSA basic blocks inside regions.
Cliff Click’s Sea of Nodes: Strong typing (Java, Rust) mitigates many of V8’s problems. Tuffy acknowledges this but prioritizes engineering simplicity.
Alive2: Poison-only UB model, byte type proposal. Directly informs tuffy’s UB and type design.
Miri: Abstract pointer model (allocation ID + offset), abstract byte representation. Informs tuffy’s memory model.

Unresolved questions

Floating point semantics: How should NaN payloads and denormals be handled? Are there opportunities for infinite-precision-like simplification?
SIMD/vector types: How do vector operations fit into the type system and hierarchical CFG?
Aggregate types: Should structs/arrays exist in the IR or be fully decomposed into scalar operations?
Exception handling: How do Rust panics and unwinding interact with SESE regions?
Concurrency model: How are atomic operations and memory ordering represented?
Stack slot lifetime and coloring: stack_slot currently takes only a size parameter. To support stack coloring (overlapping non-live slots), lifetime information is needed. The preferred direction is region-based: stack slots belong to a SESE region and are automatically dead when control exits the region. Open question: after lowering to a flat CFG, how is region-based lifetime information preserved? Options include (a) materializing lifetime markers at region boundaries during lowering, or (b) maintaining a region tree alongside the flat CFG. The lifetime.start/end intrinsic approach (LLVM-style) is not preferred.

Future possibilities

E-graph integration: Equality saturation could be applied as an optimization strategy on subgraphs within the hierarchical CFG, without replacing the core IR representation.
Superoptimizer: The formal-first design and rewrite path tracing make it natural to integrate solver-based optimization discovery (Souper-style).
Multi-language frontend: The IR is designed to be language-agnostic. Future frontends for other languages can target the same IR.
JIT compilation: The single data structure with stage-dependent normalization could support a JIT mode that skips late-stage optimizations for faster compilation.
Translation validation: Rewrite path traces provide the foundation for end-to-end translation validation of the compilation pipeline.

Keyboard shortcuts

Tuffy Documentation