Architecture¶

This is the architecture reference for Lona, a capability-secure operating system built on the seL4 microkernel. It covers the design philosophy, security model, memory architecture, and system structure.

Terminology¶

Before diving into the architecture, here are the core concepts:

Term	Definition
Lona	The operating system. Combines seL4's security with BEAM-style concurrency.
Lonala	Lona's LISP dialect. Clojure-inspired syntax with BEAM-style concurrency semantics.
Realm	The primary security isolation boundary. Each realm is a completely isolated execution environment that can safely run untrusted code without compromising the rest of the system. Realms form a tree hierarchy, and each realm can host millions of lightweight processes. Typical examples: a driver runs in its own realm, an application runs in its own realm, a plugin runs in its own realm.
Process	A lightweight execution unit within a realm, similar to Erlang/BEAM processes. Processes have their own heap and mailbox. Millions can exist per realm. NOT a security boundary - processes share trust with their realm.
Lona Memory Manager	The minimal privileged component that manages system resources (memory, CPU budgets, realm lifecycle). Contains no Lonala code - just resource management and fault handling.
Lona VM	The virtual machine that runs Lonala code, mapped into every realm. Interprets bytecode, schedules processes, handles garbage collection.
seL4	The microkernel underlying Lona. Provides the low-level mechanisms (capabilities, address spaces, scheduling) that make realm isolation possible.

Key insight: Realms provide security isolation (hardware-enforced, cannot be bypassed). Processes provide concurrency and fault tolerance within a realm. Running untrusted code requires a dedicated realm, not just a separate process.

Design Philosophy¶

Core Design Principles¶

1. Untrusted by Default¶

All realms are treated as potentially compromised. Resource limits are kernel-enforced, requiring no cooperation from untrusted code. The system must:

Prevent resource exhaustion attacks (CPU, memory)
Prevent capability leakage
Isolate faults to single realms
Enforce access control on shared resources

2. Cheap Processes, Expensive Realms¶

Process creation is microseconds (pure userspace). Realm creation is milliseconds (kernel objects, page tables). Use realms for security boundaries, processes for concurrency.

Operation	Cost	Use Case
Process spawn	~1-10 µs	Workers, handlers, concurrent tasks
Realm creation	~1-10 ms	Security isolation, untrusted code, drivers

3. BEAM-Style Message Passing¶

Messages between processes are always deep-copied (BEAM semantics). This enables independent per-process garbage collection and prevents shared-state bugs. Large binaries (≥64 bytes) use reference counting via a realm-wide binary pool to avoid copying bulk data. Between realms, messages are serialized for IPC.

4. Late Binding for Live Updates¶

Clojure-style vars enable code updates that propagate automatically to child realms without restart. Parent updates a var binding, children see the new value immediately through shared memory mappings.

5. Policy Compiled to Kernel Mechanisms¶

Resource policies are "compiled" into seL4 scheduling contexts and capabilities. No userspace code in the enforcement hot path - the kernel enforces CPU budgets and memory limits directly.

What We Take From Each System¶

Source	What We Adopt
seL4	Capabilities, address space isolation, MCS scheduling, security-focused microkernel design
BEAM	Lightweight processes, per-process heaps, reduction-based scheduling, mailboxes, immutable/persistent data structures
Clojure	Vars, namespaces, atomic updates, rich literal syntax (tuples `[]`, vectors `{}`, sets `#{}`, maps `%{}`), using data literals for function parameters
LISP	Homoiconicity, REPL-driven development, runtime code loading

Design Rationale¶

Why seL4?

Capability security: No ambient authority - all access must be explicitly granted
Minimal TCB: Small kernel, most code in userspace, reduced attack surface
Security-focused design: Developed with formal verification methodology, resulting in high code quality
Performance: Fast IPC and context switching

Why BEAM-style Processes?

Isolation: Per-process heaps eliminate shared-state bugs
Scalability: Millions of lightweight processes
Fault tolerance: Process crashes don't affect others
Soft real-time: Per-process GC, no stop-the-world pauses

Why Clojure-style Vars?

Live updates: Var indirection enables code changes without restart
Consistency: Atomic namespace updates prevent partial states
Hierarchy: Natural code sharing from parent to child realms
Rich literals: Expressive data notation for configuration and messages

Why Hierarchical Realms?

Least authority: Each realm has minimal capabilities
Defense in depth: Multiple isolation boundaries
Resource accounting: Hierarchical budgets (children share parent's allocation)
Composability: Build systems from isolated components

Note on seL4 Formal Verification¶

seL4 is formally verified only in specific configurations (e.g., single-processor, specific platforms). Multi-processor configurations with MCS scheduling, which Lona targets, are NOT formally verified.

We choose seL4 for its strong security foundations and the code quality that formal verification methodology brings, but formal verification does not apply to our configuration and is not a goal for this system.

Hardware Requirements¶

Lona requires specific hardware capabilities for operation and security. See Supported Hardware for the full platform support matrix.

Requirement	x86_64	aarch64	Mandatory	Purpose
64-bit CPU	Yes	Yes	Yes	Address space layout
MMU	Yes	Yes	Yes	VSpace isolation
IOMMU	Intel VT-d	ARM SMMU	No*	DMA isolation
Timer	APIC	GIC	Yes	MCS scheduling

IOMMU and Driver Security¶

IOMMU is required for full security isolation of device drivers.

Without IOMMU, a device can DMA to arbitrary physical memory, bypassing all VSpace/CSpace isolation. This means:

With IOMMU: Driver realms are fully isolated. A compromised driver cannot access memory outside its allocated DMA regions. Drivers can be untrusted.
Without IOMMU: Driver realms must be trusted. They become part of the Trusted Computing Base (TCB). A compromised driver can read or write any physical memory.

At boot, Lona detects IOMMU availability:

If present: IOMMU is configured to restrict each device to its allocated DMA regions. Log: IOMMU enabled, DMA isolation active
If absent: Warning logged: WARNING: No IOMMU detected. Driver realms are TRUSTED. DMA isolation disabled.

See DMA Isolation for technical details.

Security Model¶

Threat Model¶

We assume any realm may be compromised. A compromised realm may:

Attempt to exhaust CPU or memory
Try to access memory it shouldn't have
Send malformed or excessive IPC messages
Attempt to forge capabilities or identities
Crash repeatedly to disrupt the system

The system must remain stable and protect other realms even when one realm is actively malicious.

Defense Mechanisms¶

Threat	Defense	Enforcement
CPU exhaustion	MCS scheduling with per-realm budgets	seL4 kernel
Memory exhaustion	Memory quotas per realm	Lona Memory Manager*
Unauthorized memory access	Separate address spaces per realm	seL4 kernel + MMU
Capability theft/forgery	Capability system, per-realm endpoints	seL4 kernel
IPC flooding	Fault rate limiting	Lona Memory Manager
Crash loops	Supervisor policies, restart limits	Lona VM
Code injection	W^X memory mappings (no RWX)	seL4 kernel
DMA attacks	IOMMU restricts device memory access	Hardware IOMMU**

*Memory quotas are enforced by the Memory Manager through capability partitioning, not directly by the kernel.

**IOMMU required. Without IOMMU, driver realms must be trusted. See Hardware Requirements.

Trust and Authority Direction¶

Realms form a tree hierarchy. Authority flows downward from parent to child:

Lona Memory Manager (trusted, manages all realms)
    │
    └── Init Realm (first realm, high trust)
            │
            ├── App Realm A (trusted)
            │       │
            │       └── Plugin Realm (untrusted) ← OK: untrusted child
            │
            └── App Realm B (trusted)

Parent authority over children:

Parents can update inherited code (children see changes via live sharing)
Parents can revoke child capabilities
Parents control child resource budgets

Trust implications:

Untrusted child under trusted parent: SAFE. Parent controls child; child cannot harm parent.
Trusted child under untrusted parent: UNSAFE. Untrusted parent can compromise trusted child through inheritance.

Rule: Encapsulate untrusted code in dedicated child realms. Never run trusted code as a child of untrusted code.

Process Isolation vs Realm Isolation¶

Aspect	Process Isolation	Realm Isolation
Enforced by	Lona VM (userspace)	seL4 kernel (hardware)
Address space	Shared within realm	Separate per realm
Purpose	Concurrency, fault tolerance	Security
Trust level	Same trust as realm	Independent trust
Bypass possible?	Yes (bugs, malicious code)	No (kernel + MMU enforced)

Processes within a realm share an address space. A bug or exploit in one process could theoretically access another process's memory. Realm isolation is hardware-enforced and cannot be bypassed by userspace code.

Architecture Summary¶

┌─────────────────────────────────────────────────────────────────────┐
│                           seL4 Kernel                               │
│                    (capabilities, scheduling, IPC)                  │
└─────────────────────────────────┬───────────────────────────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    ▼                           │
       ┌──────────────────────┐                 │
       │ Lona Memory Manager  │                 │
       │                      │                 │
       │   Resource mgmt      │                 │ schedules
       │   Fault handling     │                 │
       │   Realm lifecycle    │                 │
       └──────────┬───────────┘                 │
                  │                             │
                  │ creates realms              │
                  ▼                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                            REALMS                                   │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐      │
│  │   Init Realm    │  │   App Realm     │  │  Driver Realm   │      │
│  │                 │  │                 │  │                 │      │
│  │  Lona VM        │  │  Lona VM        │  │  Lona VM        │      │
│  │       │         │  │       │         │  │       │         │      │
│  │       ▼         │  │       ▼         │  │       ▼         │      │
│  │  ┌─┬─┬─┐        │  │  ┌─┬─┬─┐        │  │  ┌─┬─┬─┐        │      │
│  │  │P│P│P│        │  │  │P│P│P│        │  │  │P│P│P│        │      │
│  │  └─┴─┴─┘        │  │  └─┴─┴─┘        │  │  └─┴─┴─┘        │      │
│  │  Processes      │  │  Processes      │  │  Processes      │      │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘      │
└─────────────────────────────────────────────────────────────────────┘

Document Overview¶

Document	Description
Memory Fundamentals	Physical memory, MMU, address translation, and seL4's memory model
System Architecture	Lona Memory Manager, realms, threads, bootstrapping, IPC, and resource management
Realm Memory Layout	Address space structure, inherited regions, vars, and code compilation
Term Representation	Tagged words, immediate values, heap object headers, and memory layout
Process Model	Processes, scheduling, message passing, and memory layout
Garbage Collection	Per-process generational GC, heap growth, large binaries, and differences from BEAM
Device Drivers	Driver isolation, zero-copy I/O, DMA, and interrupt handling

Key Concepts¶

Detailed definitions for reference:

Term	Definition
Lona Memory Manager	Minimal privileged component managing resources. No Lonala code.
Lona VM	Virtual machine running Lonala code, mapped into all realms
Realm	Security boundary with its own address space, capability space, and CPU budget
Worker	Kernel thread (TCB) running the Lona VM within a realm
Process	Lightweight Lonala execution unit, multiplexed by the VM
Inherited Region	Parent realm's code/data mapped read-only into child at fixed address
Realm Endpoint	Per-realm IPC endpoint for communication with Lona Memory Manager (unforgeable identity)

seL4-specific terms:

Term	Definition
VSpace	seL4's virtual address space object - each realm has its own
CSpace	seL4's capability space - stores capabilities a realm holds
Untyped	Raw physical memory that can be retyped into kernel objects
MCS	Mixed Criticality Scheduling - seL4's scheduling model with CPU budgets
TCB	Thread Control Block - seL4's kernel object representing a thread

Scope and Precision¶

This documentation captures architectural decisions and design discussions. Some aspects are precisely defined, others are deliberately left open for future refinement:

Precisely Defined:

Separation between Lona Memory Manager and Lona VM binaries
Two-level memory management (realms via seL4, processes via VM)
Inherited code regions with fixed virtual addresses
Process memory: single contiguous block per process (stack + heap growing toward each other, BEAM-style)
Per-worker allocator instances for lock-free allocation
Heap growth via reallocation (Fibonacci then 20% increments)
Per-process garbage collection (no global pauses)
Per-realm endpoints for IPC identity (not badges)
Fixed virtual addresses for shared code (same address in all realms)
IPC buffer location within worker stacks region
Live sharing semantics for inherited regions (not snapshots)
Var shadowing: local definitions override inherited
Per-realm fault rate limiting for DoS protection
Region table with strict permission enforcement
Lazy frame mapping for inherited regions

Deliberately Open:

Exact virtual address assignments (specific numbers are illustrative)
Specific segment sizes and limits
Multi-worker (multi-TCB) scheduling within realms
Detailed IPC protocol message formats
Fault rate limit thresholds (configurable)

Known Limitations (Future Work):

Code region GC: append-only model causes unbounded growth, GC deferred to later
Single-threaded fault handler: no thread pool, rate limiting only