Virtual Machine¶
This document describes the Lona Virtual Machine (Lona VM), the bytecode execution engine that runs Lonala code within realms. It covers the register-based architecture, bytecode format, execution model, and design decisions that enable future optimizations like JIT compilation.
Scope¶
This document defines the foundational VM architecture. The implementation is incremental:
- Initial types: nil, bool, int, string, symbol, pair (see Value Tags)
- Future types: Added as needed (keywords, floats, collections, capabilities, etc.)
- Future features: Pattern matching, closures, message passing, GC — built on these foundations
The architecture is designed to support future features without breaking changes. The implementation source code is the authoritative reference for what is currently implemented.
Overview¶
The Lona VM is a register-based bytecode interpreter embedded in every realm. It executes compiled Lonala code, schedules lightweight processes, and manages per-process memory.
| Property | Description |
|---|---|
| Architecture | Register-based (not stack-based) |
| Instruction size | Fixed 32-bit (4 bytes) |
| Value size | 16-byte tagged enum (see Value Representation) |
| Registers | X registers (temporaries), Y registers (locals) |
| Dispatch | Direct threading (function pointers) |
| Scheduling | Reduction-based cooperative preemption |
Why a Register-Based VM?¶
Modern high-performance VMs (BEAM, Lua 5.0+, Dalvik/ART) use register-based architectures rather than stack-based ones. The Lona VM follows this approach for several reasons:
| Aspect | Stack-Based | Register-Based |
|---|---|---|
| Instructions executed | More (push/pop overhead) | ~46% fewer |
| Code density | Smaller bytecode | ~26% larger bytecode |
| Dispatch overhead | Higher (more instructions) | Lower (fewer dispatches) |
| JIT compilation | Harder to optimize | Maps naturally to hardware registers |
| Compiler complexity | Simple | Moderate (register allocation) |
Key insight: The reduction in dispatch overhead outweighs the increase in bytecode size. Each instruction fetch, decode, and dispatch has fixed cost; register machines amortize this cost over more work per instruction.
Design Philosophy¶
Principles¶
- Simplicity over cleverness: Clear, understandable design that can be reasoned about
- Performance without sacrificing correctness: Optimize the common case, but never at the cost of correctness
- JIT-friendly from the start: Design decisions should not preclude future JIT compilation
- BEAM-inspired scheduling: Reduction counting for fair, preemptive-cooperative scheduling
- Fixed-width instructions: Predictable memory access patterns, branch-predictor friendly
What We Take From Each System¶
| Source | What We Adopt |
|---|---|
| BEAM | X/Y register split, reduction counting, BIF (intrinsic) system |
| Lua 5.0+ | Fixed 32-bit instruction format, compact operand encoding |
| JVM/Dalvik | Constant pool design, method structure |
| CPython | Simple dispatch loop as baseline, computed gotos for optimization |
Value Representation¶
Values in the VM are represented as 16-byte tagged enums. The current implementation uses a Rust #[repr(u8)] enum with an 8-byte payload, resulting in 16 bytes total (1-byte tag + 7 bytes padding + 8-byte payload).
Future Optimization: A packed 64-bit representation (tag in low bits, 60-bit payload) could reduce memory usage and improve cache performance. This optimization can be added later without changing the VM's semantics.
Current Value Layout¶
16-byte Value (Rust enum)
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ ┌────────┬─────────────────────────┬───────────────────────────┐ │
│ │ Tag(1) │ Padding (7 bytes) │ Payload (8 bytes) │ │
│ └────────┴─────────────────────────┴───────────────────────────┘ │
│ │
│ Tag: u8 discriminant (0x0-0xF) │
│ Payload: i64 for integers, Vaddr (u64) for heap pointers │
│ │
└─────────────────────────────────────────────────────────────────────┘
Value Tags¶
The 4-bit tag space supports up to 16 primary types:
| Tag | Type | Description |
|---|---|---|
0x0 |
Nil | The nil value (immediate) |
0x1 |
Bool | Boolean (immediate, payload: 0=false, 1=true) |
0x2 |
Int | Small integer (immediate, 60-bit signed) |
0x3 |
String | UTF-8 string (pointer to heap) |
0x4 |
Symbol | Interned symbol (pointer to heap) |
0x5 |
Pair | Linked list pair with head and rest (pointer to heap) |
0x6 |
Keyword | Interned keyword (pointer to heap) |
0x7 |
Tuple | Fixed-size tuple (pointer to heap) |
0x8 |
Map | Association list map (pointer to heap) |
0x9 |
CompiledFn | Pure function without captures (pointer to heap) |
0xA |
Closure | Function with captured values (pointer to heap) |
0xB |
NativeFn | Intrinsic ID (immediate, 16-bit payload) |
0xC |
Var | VarSlot reference (pointer to code region) |
0xD |
Namespace | Namespace reference (pointer to code region) |
0xE |
Unbound | Sentinel for uninitialized vars (immediate) |
0xF |
Reserved | Future types |
Callable types:
- CompiledFn points to a pure function (no captures) - result of (fn* [x] ...)
- Closure points to a function paired with captured values - result of (fn* [x] (fn* [y] (+ x y)))
- NativeFn is an immediate value containing an intrinsic dispatch ID (0-65535)
- All three return true for the fn? predicate
Note: The complete type system includes additional types (floats, vectors, sets, capabilities, etc.) documented in docs/lonala/data-types.md. Types are added to the VM as needed. The implementation source code (crates/lona-vm/src/value/) is the authoritative reference for currently supported types.
Integer Representation¶
Integers use the full 64-bit i64 range and are stored inline in the Value enum's payload:
Integer Value (inline in 16-byte Value)
┌─────────────────────────────────────────────────────────────────────┐
│ Range: -2^63 to 2^63 - 1 (full i64 range) │
│ │
│ ┌────────┬─────────────────────────┬───────────────────────────┐ │
│ │ 0x2 │ Padding (7 bytes) │ i64 value (8 bytes) │ │
│ └────────┴─────────────────────────┴───────────────────────────┘ │
│ │
│ Operations: Add, subtract, multiply check for overflow. │
│ On overflow: Promote to BigInt (heap-allocated, future). │
└─────────────────────────────────────────────────────────────────────┘
Heap Pointers¶
Heap-allocated values store a Vaddr (virtual address) in the payload. The tag identifies the heap object type:
Pointer Value (in 16-byte Value)
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ ┌────────┬─────────────────────────┬───────────────────────────┐ │
│ │ Tag │ Padding (7 bytes) │ Vaddr (8 bytes) │ │
│ └────────┴─────────────────────────┴───────────────────────────┘ │
│ │
│ Heap objects are 8-byte aligned. │
│ Tag determines how to interpret the pointed-to object. │
└─────────────────────────────────────────────────────────────────────┘
Register Architecture¶
The VM uses two classes of registers, inspired by BEAM's design.
X Registers (Temporaries)¶
X registers are general-purpose registers used for: - Function arguments (X0, X1, X2, ...) - Return values (X0) - Intermediate computation results - Temporary storage during expression evaluation
X REGISTERS (Temporaries)
════════════════════════════════════════════════════════════════════════
┌────┬────┬────┬────┬────┬────┬────┬────┬─────────┐
│ X0 │ X1 │ X2 │ X3 │ X4 │ X5 │ X6 │ X7 │ ... X255│
└────┴────┴────┴────┴────┴────┴────┴────┴─────────┘
│ │ │
│ │ └── Third argument / temp
│ └─────── Second argument / temp
└──────────── First argument / return value
Properties:
- NOT preserved across function calls
- Caller-save semantics
- 256 registers available (8-bit addressing)
- Stored in process state, not on stack
Y Registers (Locals)¶
Y registers are stack-frame-local variables that persist across function calls:
Y REGISTERS (Stack Frame Locals)
════════════════════════════════════════════════════════════════════════
Stack frame for function with 3 locals:
┌─────────────────────────────────────────────────────────────────────┐
│ Return Address (continuation pointer) │
├─────────────────────────────────────────────────────────────────────┤
│ Y0: Local variable 0 │
├─────────────────────────────────────────────────────────────────────┤
│ Y1: Local variable 1 │
├─────────────────────────────────────────────────────────────────────┤
│ Y2: Local variable 2 │
└─────────────────────────────────────────────────────────────────────┘
Properties:
- Preserved across function calls (callee-save semantics)
- Allocated per stack frame
- Y register count known at compile time for each function
- Addressed relative to frame pointer
Register Allocation Strategy¶
The compiler allocates registers as follows:
- Arguments: Passed in X0, X1, X2, ... (up to limit, then spill)
- Return value: Always in X0
- Temporaries: Allocated from available X registers during expression evaluation
- Local variables: Assigned to Y registers if they survive function calls
- Let bindings: Use X registers for short-lived bindings, Y for long-lived
EXAMPLE: Compiling (let [a (+ 1 2)] (foo a a))
════════════════════════════════════════════════════════════════════════
; 'a' is used twice after a call, so it needs Y register
ALLOCATE 1 ; Reserve Y0 for 'a'
ADD X0, 1, 2 ; X0 = 1 + 2
MOVE Y0, X0 ; Y0 = X0 (save 'a' in local)
MOVE X0, Y0 ; First arg = a
MOVE X1, Y0 ; Second arg = a
CALL foo/2 ; Call foo with 2 args
DEALLOCATE 1 ; Release stack frame
RETURN ; Return X0
Bytecode Format¶
Instructions are fixed 32-bit words with several encoding formats.
Instruction Encoding¶
32-BIT INSTRUCTION FORMATS
════════════════════════════════════════════════════════════════════════
Format A: Three operands (arithmetic, comparisons)
┌────────┬────────┬─────────┬─────────┐
│ Opcode │ A │ B │ C │
│ 6 bits │ 8 bits │ 9 bits │ 9 bits │
└────────┴────────┴─────────┴─────────┘
Format B: Two operands + unsigned immediate (loads, jumps)
┌────────┬────────┬───────────────────┐
│ Opcode │ A │ Bx │
│ 6 bits │ 8 bits │ 18 bits │
└────────┴────────┴───────────────────┘
Format C: Two operands + signed immediate (relative jumps)
┌────────┬────────┬───────────────────┐
│ Opcode │ A │ sBx │
│ 6 bits │ 8 bits │ 18 bits (signed) │
└────────┴────────┴───────────────────┘
Format D: One operand + large immediate (extended)
┌────────┬────────────────────────────┐
│ Opcode │ Ax │
│ 6 bits │ 26 bits │
└────────┴────────────────────────────┘
Operand Encoding¶
The B and C fields use 9 bits each, allowing: - Register reference (bit 8 = 0): Register index 0-255 - Constant reference (bit 8 = 1): Constant pool index 0-255
OPERAND ENCODING (9 bits)
════════════════════════════════════════════════════════════════════════
┌───────────┬──────────────────────────────────────────────────────────┐
│ Bit 8 │ Interpretation │
├───────────┼──────────────────────────────────────────────────────────┤
│ 0 │ Register: bits 0-7 = register index (0-255) │
│ 1 │ Constant: bits 0-7 = constant pool index (0-255) │
└───────────┴──────────────────────────────────────────────────────────┘
This encoding is called RK (Register or Konstant) in Lua terminology.
Example:
ADD X0, X1, K5 ; X0 = X1 + constants[5]
Encodes as:
- Opcode: ADD
- A: 0 (X0)
- B: 1 (X1, bit 8 = 0)
- C: 261 (5 + 256, bit 8 = 1)
X vs Y Register Encoding¶
The VM uses context-dependent register class encoding (like BEAM). The opcode determines which register class is used, not the operand encoding:
CONTEXT-DEPENDENT REGISTER CLASSES
════════════════════════════════════════════════════════════════════════
X registers (temporaries):
- Used by: arithmetic, comparison, function arguments, return values
- Instructions: ADD, SUB, MUL, MOVE, CALL, RETURN, INTRINSIC, etc.
- All general computation uses X registers
Y registers (locals):
- Used by: stack frame allocation, local variable storage
- Instructions: ALLOCATE, DEALLOCATE, and dedicated move instructions
Transfer between classes:
MOVE_XY A, B Y(A) := X(B) ; Save X to Y (before call)
MOVE_YX A, B X(A) := Y(B) ; Restore Y to X (after call)
This approach:
- Matches BEAM's semantic model (X for temps, Y for preserved locals)
- No wasted bits in instruction encoding
- Semantically clear: arithmetic never touches Y directly
- Full 256 registers available in each class
Why not encode register class in the operand? Encoding both class and index in 8 bits would limit each class to 128 registers. The context-dependent approach provides 256 X registers (ample for expression evaluation) and 256 Y registers (ample for local variables) without encoding overhead.
Why Fixed 32-bit Instructions?¶
| Benefit | Description |
|---|---|
| Predictable fetch | Always read 4 bytes, no length decoding |
| Aligned access | No unaligned memory reads |
| Branch prediction | Fixed stride aids CPU prediction |
| JIT friendly | Easy to map to native instructions |
| Simplicity | No variable-length encoding complexity |
The trade-off is slightly larger bytecode (~26% vs variable-length), but the performance benefits from predictable memory access patterns outweigh this cost.
Instruction Set¶
Instruction Categories¶
| Category | Purpose | Examples |
|---|---|---|
| Data Movement | Move values between registers/constants | MOVE, LOADK, LOADNIL |
| Arithmetic | Integer and float operations | ADD, SUB, MUL, DIV, MOD |
| Comparison | Value comparison, type checking | EQ, LT, LE, TYPE |
| Logic | Boolean operations | NOT, AND, OR |
| Control Flow | Branching and loops | JMP, JMPIF, JMPIFNOT |
| Function Calls | Calling and returning | CALL, TAILCALL, RETURN |
| Intrinsics | Built-in function dispatch | INTRINSIC |
| Stack Frame | Local variable management | ALLOCATE, DEALLOCATE |
| Collections | Tuple/vector/map operations | TUPLE, GET, PUT |
| Process | Spawn, send, receive | SPAWN, SEND, RECV |
Core Instructions¶
CORE INSTRUCTION SET
════════════════════════════════════════════════════════════════════════
Data Movement:
MOVE A, B R(A) := R(B)
LOADK A, Bx R(A) := K(Bx)
LOADNIL A, B R(A), R(A+1), ..., R(A+B) := nil
LOADBOOL A, B, C R(A) := (B != 0); if C, skip next
Arithmetic (all support RK operands):
ADD A, B, C R(A) := RK(B) + RK(C)
SUB A, B, C R(A) := RK(B) - RK(C)
MUL A, B, C R(A) := RK(B) * RK(C)
DIV A, B, C R(A) := RK(B) / RK(C)
MOD A, B, C R(A) := RK(B) % RK(C)
NEG A, B R(A) := -R(B)
Comparison (result in condition flag, used by following JMP):
EQ A, B, C if (RK(B) == RK(C)) != A then skip next
LT A, B, C if (RK(B) < RK(C)) != A then skip next
LE A, B, C if (RK(B) <= RK(C)) != A then skip next
Control Flow:
JMP sBx PC += sBx
JMPIF A, sBx if R(A) is truthy then PC += sBx
JMPIFNOT A, sBx if R(A) is falsy then PC += sBx
Function Calls:
CALL A, B, C R(A), ..., R(A+C-2) := R(A)(R(A+1), ..., R(A+B-1))
TAILCALL A, B return R(A)(R(A+1), ..., R(A+B-1))
RETURN A, B return R(A), ..., R(A+B-2)
Stack Frame:
ALLOCATE A Allocate A stack slots for Y registers
DEALLOCATE A Deallocate A stack slots
Y Register Transfer:
MOVE_XY A, B Y(A) := X(B) ; Save to local
MOVE_YX A, B X(A) := Y(B) ; Restore from local
Intrinsics:
INTRINSIC A, B X0 := intrinsic(A)(X1, ..., X(B))
Instruction Execution Cost¶
Each instruction has an associated reduction cost for scheduling fairness:
| Instruction Type | Reductions | Rationale |
|---|---|---|
| Simple (MOVE, LOADK) | 1 | Basic data movement |
| Arithmetic | 1 | Single operation |
| Comparison | 1 | Single comparison |
| Jump | 1 | Branch |
| Function call (CALL) | 1 | CALL instruction only (per BEAM) |
| Intrinsic | 1-10 | Depends on complexity (see below) |
| Collection build | 1 + size/8 | Scales with element count |
| Namespace/var ops | 10 | Heap and registry access |
Intrinsic costs by category:
| Category | Cost | Examples |
|---|---|---|
| Predicates, arithmetic, comparison | 1 | +, -, =, nil?, integer? |
| Simple sequence ops | 2 | count, first, empty? |
| Collection access, strings | 3 | get, nth, rest, str, name |
| Collection mutation | 5 | put |
| Namespace/var operations | 10 | intern, var-get, def-root |
Constant Pool¶
Each compiled function (chunk) has a constant pool storing values that don't fit in instruction immediates.
Constant Pool Contents¶
| Entry Type | Description |
|---|---|
| Large integers | Integers > 60 bits |
| Floats | All floating-point numbers |
| Strings | String literals |
| Symbols | Symbol literals (keywords, identifiers) |
| Function prototypes | Nested function definitions |
Constant Pool Structure¶
CONSTANT POOL
════════════════════════════════════════════════════════════════════════
Chunk {
code: [u32; N], // Instruction array
constants: [Value; M], // Constant pool
...
}
Accessing constant K5:
LOADK X0, 5 ; X0 = constants[5]
Constants are pre-allocated on the heap when the chunk is loaded.
References from instructions are indices into this array.
String Interning¶
Identical string literals within a chunk share the same constant pool entry. Symbol literals are globally interned across the entire realm for fast equality comparison.
Execution Model¶
The Dispatch Loop¶
The VM executes instructions in a tight loop:
DISPATCH LOOP (Conceptual)
════════════════════════════════════════════════════════════════════════
fn execute(process: &mut Process) -> RunResult {
loop {
// Check reduction budget
if process.reductions == 0 {
return RunResult::Yielded;
}
// Fetch instruction
let instruction = process.chunk.code[process.pc];
process.pc += 1;
// Decode and execute, getting reduction cost
let cost = match decode_opcode(instruction) {
OP_MOVE => {
process.reg[decode_a(instruction)] = process.reg[decode_b(instruction)];
1 // Simple ops cost 1
}
OP_INTRINSIC => {
let id = decode_a(instruction);
call_intrinsic(id, process)?;
intrinsic_cost(id) // Variable cost: 1-10
}
OP_CALL => {
// Push frame, set up callee
push_call_frame(process);
1 // CALL itself costs 1 (per BEAM)
}
OP_BUILD_TUPLE => {
let count = decode_c(instruction);
build_tuple(process, count)?;
1 + count / 8 // Scales with size
}
OP_RETURN => {
if !pop_call_frame(process) {
return RunResult::Completed(process.reg[0]);
}
1
}
// ... other opcodes
};
// Consume reductions
process.reductions = process.reductions.saturating_sub(cost);
}
}
Dispatch Optimization: Direct Threading¶
For better performance, the VM uses direct threading where each opcode is replaced with a pointer to its handler:
DIRECT THREADING
════════════════════════════════════════════════════════════════════════
Traditional switch dispatch:
fetch → decode → switch → execute → loop back
Direct threading:
fetch → jump to handler → execute → jump to next handler
Implementation (using computed goto):
// Handler table
static HANDLERS: [fn(); 64] = [handle_move, handle_add, ...];
// Dispatch
let handler = HANDLERS[opcode];
handler(); // Handler ends with: goto *HANDLERS[next_opcode]
Benefits:
- Eliminates switch overhead
- Better branch prediction
- ~15-25% faster than switch dispatch
Reduction Counting¶
Each instruction consumes reductions based on its complexity (see Instruction Execution Cost). When the budget is exhausted, the process yields:
REDUCTION COUNTING
════════════════════════════════════════════════════════════════════════
const MAX_REDUCTIONS: u32 = 2000; // ~500µs time slice
Process execution:
1. Set reductions = MAX_REDUCTIONS
2. Execute instruction, consume its cost from reductions
3. When reductions == 0, yield to scheduler
4. Scheduler picks next process, resets its budget
5. Repeat
This ensures:
- Fair scheduling among processes
- No process can monopolize CPU
- Bounded latency for other processes
- Expensive operations (intrinsics, collection builds) cost more
- Works with cooperative yielding (receive, etc.)
Intrinsics (Built-in Functions)¶
Intrinsics are built-in operations implemented in the VM runtime, not in bytecode.
Intrinsic Design¶
INTRINSIC SYSTEM
════════════════════════════════════════════════════════════════════════
Intrinsics provide:
- Performance-critical operations (arithmetic on heap values)
- Operations that can't be expressed in Lonala (I/O, process ops)
- Type-specific optimized paths
Intrinsic invocation uses a fixed calling convention:
INTRINSIC A, B ; X0 := intrinsic(A)(X1, X2, ..., X(B))
Where:
- A = intrinsic ID (0-255)
- B = argument count
Arguments: X1, X2, ..., X(B)
Result: always in X0
This fixed convention:
- Matches function call semantics (args in X regs, result in X0)
- Simplifies dispatch (no destination register to decode)
- Compiler places args in X1+, result is always in X0
Intrinsic Categories¶
| Category | Intrinsics | Description |
|---|---|---|
| Arithmetic | +, -, *, /, mod |
Numeric operations |
| Comparison | =, <, >, <=, >= |
Value comparison |
| Type | nil?, int?, str?, type |
Type predicates |
| String | str, str-len, str-concat |
String operations |
| Collection | get, put, count, conj |
Collection ops |
| Process | spawn, send, self, exit |
Process management |
| I/O | read, write, open, close |
Input/output |
Intrinsic Dispatch¶
Intrinsics are dispatched through a function pointer table for O(1) lookup:
INTRINSIC DISPATCH
════════════════════════════════════════════════════════════════════════
type IntrinsicFn = fn(&mut Process, argc: u8) -> Result<Value, Error>;
static INTRINSICS: [IntrinsicFn; 256] = [
intrinsic_add, // ID 0
intrinsic_sub, // ID 1
intrinsic_mul, // ID 2
// ...
];
fn dispatch_intrinsic(process: &mut Process, id: u8, argc: u8) -> Result<(), Error> {
let handler = INTRINSICS[id as usize];
// Args are in X1, X2, ..., X(argc) - intrinsic reads them directly
let result = handler(process, argc)?;
// Result always goes to X0 (fixed calling convention)
process.x_regs[0] = result;
Ok(())
}
Example compilation of (+ a b):
MOVE X1, <a> ; First arg
MOVE X2, <b> ; Second arg
INTRINSIC ADD, 2 ; X0 = intrinsic_add(X1, X2)
; Result now in X0
Stack Frames and Calling Convention¶
Stack frames are stored in the process's stack region (grows down from hend). Each frame has a fixed-size header followed by Y registers for local variables. See crates/lona-vm/src/process/mod.rs for the detailed implementation.
Stack Layout¶
STACK FRAME LAYOUT
════════════════════════════════════════════════════════════════════════
Stack grows downward (high to low addresses)
┌─────────────────────────────────────────────────────────────────────┐
│ Caller's Frame │
├─────────────────────────────────────────────────────────────────────┤
│ Return Address (PC to resume after call) │
├─────────────────────────────────────────────────────────────────────┤
│ Previous Frame Pointer │
├─────────────────────────────────────────────────────────────────────┤
│ Y0 (first local) │
├─────────────────────────────────────────────────────────────────────┤
│ Y1 (second local) │
├─────────────────────────────────────────────────────────────────────┤
│ Y2 (third local) │
├─────────────────────────────────────────────────────────────────────┤
│ ... more locals ... │
├─────────────────────────────────────────────────────────────────────┤
│ ← Frame Pointer (FP) points here │
│ │
│ ← Stack Pointer (SP) points to next free slot │
└─────────────────────────────────────────────────────────────────────┘
Calling Convention¶
CALLING CONVENTION
════════════════════════════════════════════════════════════════════════
1. CALLER prepares call:
- Place arguments in X0, X1, X2, ...
- Execute CALL instruction
2. CALL instruction:
- Push return address
- Push frame pointer
- Set new frame pointer
- Jump to callee
3. CALLEE entry:
- ALLOCATE N (reserve N slots for Y registers)
- Execute function body
4. CALLEE exit:
- Place return value in X0
- DEALLOCATE N (release Y register slots)
- RETURN (restore FP, jump to return address)
5. CALLER resumes:
- Return value is in X0
- Continue execution
Tail Call Optimization¶
Tail calls reuse the current stack frame:
TAIL CALL OPTIMIZATION
════════════════════════════════════════════════════════════════════════
Regular call:
CALL f ; Pushes new frame, call f, returns here
RETURN ; Returns to our caller
Tail call:
TAILCALL f ; Reuses current frame, jumps to f
; f's RETURN goes directly to our caller
The TAILCALL instruction:
1. Deallocates current frame
2. Moves arguments to correct positions
3. Jumps to target function
4. Target's RETURN returns to original caller
This enables efficient recursion without stack growth.
Memory Model¶
For the complete process memory model including garbage collection, heap fragments, and per-worker allocators, see Process Model.
Process Memory Overview¶
Each process has two memory blocks following the BEAM model:
PROCESS MEMORY MODEL
════════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────────┐
│ X Registers (in process struct, not on heap) │
│ ┌────┬────┬────┬────┬────┬────────────────┐ │
│ │ X0 │ X1 │ X2 │ X3 │ X4 │ ... X255 │ │
│ └────┴────┴────┴────┴────┴────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│ Young Heap (stack + young objects, single contiguous block) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ STACK (grows down) │ FREE │ YOUNG HEAP (grows up) │ │
│ │ [Frame2][Frame1][F0]◄─┼────────┼─►[string][tuple][pair] │ │
│ │ stop │ │ htop │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ Out of memory when htop >= stop → triggers Minor GC │
├─────────────────────────────────────────────────────────────────────┤
│ Old Heap (promoted objects, separate contiguous block) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ [promoted][promoted][promoted] │ FREE │ │
│ │ Objects that survived Minor GC │◄─ old_htop │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ Collected only during Major GC (fullsweep) │
├─────────────────────────────────────────────────────────────────────┤
│ Mailbox (MPSC queue, messages on receiver's heap) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ [Msg1] → [Msg2] → [Msg3] → nil │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Heap Allocation¶
The young heap uses a bump allocator. Allocation is O(1):
HEAP ALLOCATION (Young Heap)
════════════════════════════════════════════════════════════════════════
fn alloc(heap, size, align) -> Vaddr {
let new_htop = align_up(heap.htop + size, align);
if new_htop >= heap.stop {
trigger_minor_gc();
// Retry after GC - young heap is now empty
new_htop = align_up(heap.htop + size, align);
}
let ptr = heap.htop;
heap.htop = new_htop;
ptr
}
Generational GC:
- Minor GC: Promotes live young objects to old heap, resets young heap
- Major GC: Collects both heaps, compacts all live data
Compiled Code Structure¶
Chunk (Compiled Function)¶
CHUNK STRUCTURE
════════════════════════════════════════════════════════════════════════
struct Chunk {
// Metadata
name: Symbol, // Function name (for debugging)
arity: u8, // Number of parameters
num_locals: u8, // Number of Y registers needed
is_variadic: bool, // Accepts variable arguments?
// Code
code: Vec<u32>, // Instruction array
constants: Vec<Value>, // Constant pool
// Debug info (optional)
line_numbers: Vec<u32>, // Source line for each instruction
local_names: Vec<Symbol>, // Names of local variables
}
Module (Compiled Namespace)¶
MODULE STRUCTURE
════════════════════════════════════════════════════════════════════════
struct Module {
name: Symbol, // Namespace name (e.g., 'lona.core)
chunks: Vec<Chunk>, // Compiled functions
vars: HashMap<Symbol, Value>, // Module-level var bindings
imports: Vec<Symbol>, // Required namespaces
}
Future Considerations¶
JIT Compilation Path¶
The VM architecture supports future JIT compilation:
JIT COMPILATION PATH
════════════════════════════════════════════════════════════════════════
Tiered compilation strategy:
Tier 0: Interpreter
- All code starts here
- Profile execution (hot paths, type information)
- Low startup cost
Tier 1: Baseline JIT
- Compile hot functions to native code
- No optimization, fast compilation
- ~5-10x interpreter speedup
Tier 2: Optimizing JIT
- Compile very hot functions with optimizations
- Type specialization, inlining, dead code elimination
- ~50-100x interpreter speedup
Register-based design benefits:
- X/Y registers map to hardware registers
- No stack manipulation overhead
- Straightforward instruction selection
Speculative Optimization¶
Type information gathered during interpretation enables speculative optimization:
SPECULATIVE OPTIMIZATION
════════════════════════════════════════════════════════════════════════
Type profiling during interpretation:
- Track types seen for each operation
- Record branch taken frequencies
- Identify monomorphic call sites
Optimization opportunities:
- Inline small integer arithmetic (skip overflow check)
- Specialize polymorphic calls to monomorphic
- Eliminate redundant type checks
- Inline frequently called functions
Deoptimization:
- Guard fails → return to interpreter
- Recompile with updated type information
On-Stack Replacement (OSR)¶
OSR enables switching from interpreted to compiled code mid-execution:
ON-STACK REPLACEMENT
════════════════════════════════════════════════════════════════════════
Hot loop detection:
- Count loop iterations
- When threshold exceeded, compile loop
OSR entry:
1. Compile loop with current variable values
2. At next loop iteration check:
- If compiled version ready, transfer execution
- Map interpreter state to compiled state
- Continue in compiled code
OSR exit (deoptimization):
1. Guard fails in compiled code
2. Map compiled state back to interpreter state
3. Continue in interpreter
Summary¶
| Aspect | Design Choice | Rationale |
|---|---|---|
| Architecture | Register-based | Fewer instructions, JIT-friendly |
| Instruction size | Fixed 32-bit | Predictable fetch, aligned access |
| Value representation | Tagged 64-bit | Inline small integers, fast type checks |
| Register classes | X (temp) + Y (local) | Efficient calling convention |
| Dispatch | Direct threading | Low overhead, good branch prediction |
| Scheduling | Reduction counting | Fair, bounded latency |
| Intrinsics | Function pointer table | O(1) dispatch, extensible |
| Memory | Per-process heap | Independent GC, no shared state |
The Lona VM is designed to be simple enough to implement correctly, yet structured to allow future optimizations including JIT compilation. Every design decision considers both immediate implementation needs and long-term performance goals.