Skip to content

Architecture

This is the architecture reference for Lona, a capability-secure operating system built on the seL4 microkernel. It covers the design philosophy, security model, memory architecture, and system structure.


Terminology

Before diving into the architecture, here are the core concepts:

Term Definition
Lona The operating system. Combines seL4's security with BEAM-style concurrency.
Lonala Lona's LISP dialect. Clojure-inspired syntax with BEAM-style concurrency semantics.
Realm The primary security isolation boundary. Each realm is a completely isolated execution environment that can safely run untrusted code without compromising the rest of the system. Realms form a tree hierarchy, and each realm can host millions of lightweight processes. Typical examples: a driver runs in its own realm, an application runs in its own realm, a plugin runs in its own realm.
Process A lightweight execution unit within a realm, similar to Erlang/BEAM processes. Processes have their own heap and mailbox. Millions can exist per realm. NOT a security boundary - processes share trust with their realm.
Lona Memory Manager The minimal privileged component that manages system resources (memory, CPU budgets, realm lifecycle). Contains no Lonala code - just resource management and fault handling.
Lona VM The virtual machine that runs Lonala code, mapped into every realm. Interprets bytecode, schedules processes, handles garbage collection.
seL4 The microkernel underlying Lona. Provides the low-level mechanisms (capabilities, address spaces, scheduling) that make realm isolation possible.

Key insight: Realms provide security isolation (hardware-enforced, cannot be bypassed). Processes provide concurrency and fault tolerance within a realm. Running untrusted code requires a dedicated realm, not just a separate process.


Design Philosophy

Core Design Principles

1. Untrusted by Default

All realms are treated as potentially compromised. Resource limits are kernel-enforced, requiring no cooperation from untrusted code. The system must:

  • Prevent resource exhaustion attacks (CPU, memory)
  • Prevent capability leakage
  • Isolate faults to single realms
  • Enforce access control on shared resources

2. Cheap Processes, Expensive Realms

Process creation is microseconds (pure userspace). Realm creation is milliseconds (kernel objects, page tables). Use realms for security boundaries, processes for concurrency.

Operation Cost Use Case
Process spawn ~1-10 µs Workers, handlers, concurrent tasks
Realm creation ~1-10 ms Security isolation, untrusted code, drivers

3. BEAM-Style Message Passing

Messages between processes are always deep-copied (BEAM semantics). This enables independent per-process garbage collection and prevents shared-state bugs. Large binaries (≥64 bytes) use reference counting via a realm-wide binary pool to avoid copying bulk data. Between realms, messages are serialized for IPC.

4. Late Binding for Live Updates

Clojure-style vars enable code updates that propagate automatically to child realms without restart. Parent updates a var binding, children see the new value immediately through shared memory mappings.

5. Policy Compiled to Kernel Mechanisms

Resource policies are "compiled" into seL4 scheduling contexts and capabilities. No userspace code in the enforcement hot path - the kernel enforces CPU budgets and memory limits directly.

What We Take From Each System

Source What We Adopt
seL4 Capabilities, address space isolation, MCS scheduling, security-focused microkernel design
BEAM Lightweight processes, per-process heaps, reduction-based scheduling, mailboxes, immutable/persistent data structures
Clojure Vars, namespaces, atomic updates, rich literal syntax (tuples [], vectors {}, sets #{}, maps %{}), using data literals for function parameters
LISP Homoiconicity, REPL-driven development, runtime code loading

Design Rationale

Why seL4?

  • Capability security: No ambient authority - all access must be explicitly granted
  • Minimal TCB: Small kernel, most code in userspace, reduced attack surface
  • Security-focused design: Developed with formal verification methodology, resulting in high code quality
  • Performance: Fast IPC and context switching

Why BEAM-style Processes?

  • Isolation: Per-process heaps eliminate shared-state bugs
  • Scalability: Millions of lightweight processes
  • Fault tolerance: Process crashes don't affect others
  • Soft real-time: Per-process GC, no stop-the-world pauses

Why Clojure-style Vars?

  • Live updates: Var indirection enables code changes without restart
  • Consistency: Atomic namespace updates prevent partial states
  • Hierarchy: Natural code sharing from parent to child realms
  • Rich literals: Expressive data notation for configuration and messages

Why Hierarchical Realms?

  • Least authority: Each realm has minimal capabilities
  • Defense in depth: Multiple isolation boundaries
  • Resource accounting: Hierarchical budgets (children share parent's allocation)
  • Composability: Build systems from isolated components

Note on seL4 Formal Verification

seL4 is formally verified only in specific configurations (e.g., single-processor, specific platforms). Multi-processor configurations with MCS scheduling, which Lona targets, are NOT formally verified.

We choose seL4 for its strong security foundations and the code quality that formal verification methodology brings, but formal verification does not apply to our configuration and is not a goal for this system.

Hardware Requirements

Lona requires specific hardware capabilities for operation and security. See Supported Hardware for the full platform support matrix.

Requirement x86_64 aarch64 Mandatory Purpose
64-bit CPU Yes Yes Yes Address space layout
MMU Yes Yes Yes VSpace isolation
IOMMU Intel VT-d ARM SMMU No* DMA isolation
Timer APIC GIC Yes MCS scheduling

IOMMU and Driver Security

IOMMU is required for full security isolation of device drivers.

Without IOMMU, a device can DMA to arbitrary physical memory, bypassing all VSpace/CSpace isolation. This means:

  • With IOMMU: Driver realms are fully isolated. A compromised driver cannot access memory outside its allocated DMA regions. Drivers can be untrusted.
  • Without IOMMU: Driver realms must be trusted. They become part of the Trusted Computing Base (TCB). A compromised driver can read or write any physical memory.

At boot, Lona detects IOMMU availability:

  • If present: IOMMU is configured to restrict each device to its allocated DMA regions. Log: IOMMU enabled, DMA isolation active
  • If absent: Warning logged: WARNING: No IOMMU detected. Driver realms are TRUSTED. DMA isolation disabled.

See DMA Isolation for technical details.


Security Model

Threat Model

We assume any realm may be compromised. A compromised realm may:

  • Attempt to exhaust CPU or memory
  • Try to access memory it shouldn't have
  • Send malformed or excessive IPC messages
  • Attempt to forge capabilities or identities
  • Crash repeatedly to disrupt the system

The system must remain stable and protect other realms even when one realm is actively malicious.

Defense Mechanisms

Threat Defense Enforcement
CPU exhaustion MCS scheduling with per-realm budgets seL4 kernel
Memory exhaustion Memory quotas per realm Lona Memory Manager*
Unauthorized memory access Separate address spaces per realm seL4 kernel + MMU
Capability theft/forgery Capability system, per-realm endpoints seL4 kernel
IPC flooding Fault rate limiting Lona Memory Manager
Crash loops Supervisor policies, restart limits Lona VM
Code injection W^X memory mappings (no RWX) seL4 kernel
DMA attacks IOMMU restricts device memory access Hardware IOMMU**

*Memory quotas are enforced by the Memory Manager through capability partitioning, not directly by the kernel.

**IOMMU required. Without IOMMU, driver realms must be trusted. See Hardware Requirements.

Trust and Authority Direction

Realms form a tree hierarchy. Authority flows downward from parent to child:

Lona Memory Manager (trusted, manages all realms)
    └── Init Realm (first realm, high trust)
            ├── App Realm A (trusted)
            │       │
            │       └── Plugin Realm (untrusted) ← OK: untrusted child
            └── App Realm B (trusted)

Parent authority over children:

  • Parents can update inherited code (children see changes via live sharing)
  • Parents can revoke child capabilities
  • Parents control child resource budgets

Trust implications:

  • Untrusted child under trusted parent: SAFE. Parent controls child; child cannot harm parent.
  • Trusted child under untrusted parent: UNSAFE. Untrusted parent can compromise trusted child through inheritance.

Rule: Encapsulate untrusted code in dedicated child realms. Never run trusted code as a child of untrusted code.

Process Isolation vs Realm Isolation

Aspect Process Isolation Realm Isolation
Enforced by Lona VM (userspace) seL4 kernel (hardware)
Address space Shared within realm Separate per realm
Purpose Concurrency, fault tolerance Security
Trust level Same trust as realm Independent trust
Bypass possible? Yes (bugs, malicious code) No (kernel + MMU enforced)

Processes within a realm share an address space. A bug or exploit in one process could theoretically access another process's memory. Realm isolation is hardware-enforced and cannot be bypassed by userspace code.


Architecture Summary

┌─────────────────────────────────────────────────────────────────────┐
│                           seL4 Kernel                               │
│                    (capabilities, scheduling, IPC)                  │
└─────────────────────────────────┬───────────────────────────────────┘
                    ┌─────────────┴─────────────┐
                    ▼                           │
       ┌──────────────────────┐                 │
       │ Lona Memory Manager  │                 │
       │                      │                 │
       │   Resource mgmt      │                 │ schedules
       │   Fault handling     │                 │
       │   Realm lifecycle    │                 │
       └──────────┬───────────┘                 │
                  │                             │
                  │ creates realms              │
                  ▼                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                            REALMS                                   │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐      │
│  │   Init Realm    │  │   App Realm     │  │  Driver Realm   │      │
│  │                 │  │                 │  │                 │      │
│  │  Lona VM        │  │  Lona VM        │  │  Lona VM        │      │
│  │       │         │  │       │         │  │       │         │      │
│  │       ▼         │  │       ▼         │  │       ▼         │      │
│  │  ┌─┬─┬─┐        │  │  ┌─┬─┬─┐        │  │  ┌─┬─┬─┐        │      │
│  │  │P│P│P│        │  │  │P│P│P│        │  │  │P│P│P│        │      │
│  │  └─┴─┴─┘        │  │  └─┴─┴─┘        │  │  └─┴─┴─┘        │      │
│  │  Processes      │  │  Processes      │  │  Processes      │      │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘      │
└─────────────────────────────────────────────────────────────────────┘

Document Overview

Document Description
Memory Fundamentals Physical memory, MMU, address translation, and seL4's memory model
System Architecture Lona Memory Manager, realms, threads, bootstrapping, IPC, and resource management
Realm Memory Layout Address space structure, inherited regions, vars, and code compilation
Term Representation Tagged words, immediate values, heap object headers, and memory layout
Process Model Processes, scheduling, message passing, and memory layout
Garbage Collection Per-process generational GC, heap growth, large binaries, and differences from BEAM
Device Drivers Driver isolation, zero-copy I/O, DMA, and interrupt handling

Key Concepts

Detailed definitions for reference:

Term Definition
Lona Memory Manager Minimal privileged component managing resources. No Lonala code.
Lona VM Virtual machine running Lonala code, mapped into all realms
Realm Security boundary with its own address space, capability space, and CPU budget
Worker Kernel thread (TCB) running the Lona VM within a realm
Process Lightweight Lonala execution unit, multiplexed by the VM
Inherited Region Parent realm's code/data mapped read-only into child at fixed address
Realm Endpoint Per-realm IPC endpoint for communication with Lona Memory Manager (unforgeable identity)

seL4-specific terms:

Term Definition
VSpace seL4's virtual address space object - each realm has its own
CSpace seL4's capability space - stores capabilities a realm holds
Untyped Raw physical memory that can be retyped into kernel objects
MCS Mixed Criticality Scheduling - seL4's scheduling model with CPU budgets
TCB Thread Control Block - seL4's kernel object representing a thread

Scope and Precision

This documentation captures architectural decisions and design discussions. Some aspects are precisely defined, others are deliberately left open for future refinement:

Precisely Defined:

  • Separation between Lona Memory Manager and Lona VM binaries
  • Two-level memory management (realms via seL4, processes via VM)
  • Inherited code regions with fixed virtual addresses
  • Process memory: single contiguous block per process (stack + heap growing toward each other, BEAM-style)
  • Per-worker allocator instances for lock-free allocation
  • Heap growth via reallocation (Fibonacci then 20% increments)
  • Per-process garbage collection (no global pauses)
  • Per-realm endpoints for IPC identity (not badges)
  • Fixed virtual addresses for shared code (same address in all realms)
  • IPC buffer location within worker stacks region
  • Live sharing semantics for inherited regions (not snapshots)
  • Var shadowing: local definitions override inherited
  • Per-realm fault rate limiting for DoS protection
  • Region table with strict permission enforcement
  • Lazy frame mapping for inherited regions

Deliberately Open:

  • Exact virtual address assignments (specific numbers are illustrative)
  • Specific segment sizes and limits
  • Multi-worker (multi-TCB) scheduling within realms
  • Detailed IPC protocol message formats
  • Fault rate limit thresholds (configurable)

Known Limitations (Future Work):

  • Code region GC: append-only model causes unbounded growth, GC deferred to later
  • Single-threaded fault handler: no thread pool, rate limiting only