Multi-tenant GPU workloads today run on hypervisors whose TCB is orders of magnitude too large to verify, and whose device-facing drivers are proprietary kernel code running at ring 0 (§). Every verified-core project — the seL4 [1] and NOVA [2] microkernels, and the tailored hypervisors seKVM [3] and AWS’s Nitro Isolation Engine [4] — deliberately pushes device drivers out of the trusted core to keep the verified code small. The driver then runs deprivileged, in user mode, which is the right architecture but relocates the problem rather than removing it: tenant isolation now rests on the correctness of the user-mode driver and of the code that configures the kernel’s protection policy.
The second dependency — the correctness of the code that configures the kernel’s protection policy — is the capability-versus-policy trade-off taken up by the enabler problem at §, where seKVM and Nitro bake policy into the verified core while seL4 and NOVA push it into configuration that must itself be verified. This widget targets the first dependency: the driver.
The research methodology for verifying a single driver — separation-logic proofs against an abstract hardware model, demonstrated for a ZynqMP DMA engine in concurrent separation logic [5], [6], [7] — is settled. The real blocker is that the abstract hardware model does not exist for the devices that matter. The CPU side is modeled — Arm and RISC-V ship machine-readable ISA specifications, and REMS reaches relaxed virtual memory and the Arm MMU [8], [9] — but no GPU vendor publishes a machine-checkable specification of its command processor, on-GPU memory-management and IOMMU interfaces, or DMA engines. AMD’s machine-readable GPU ISA is the shader instruction set, not that control surface. So the tractable problem is two-sided: produce a formal model of a device interface at a fidelity that supports proof, and prove a driver against it. The longer prize behind it is a verifiable feature set — enough of a hypervisor (dynamic provisioning, VM migration) and its user-mode drivers to run real workloads, kept small enough to stay in reach of current proof technology. Neither seL4 nor NOVA can run a virtual machine in isolation today; identifying that minimal feature set is the precondition for everything above it.
Start with the smallest useful surface: the command-submission path of a single open-source GPU stack (NVK/Nouveau on NVIDIA, or an AMDGPU subset). Specify the command-ring state machine in Rocq or Lean at a level of detail that admits noninterference claims across tenant contexts — ring-buffer consistency, IOMMU mapping integrity, context-switch scrubbing. Prove a reference driver (runnable under either seL4′s sDDF or NOVA’s userspace driver model) against that spec, with the security property being no sequence of guest-supplied command packets causes the driver to program an IOMMU mapping or issue a DMA outside the guest’s declared memory region. Two natural stopping points: a verified command-submission module with a stubbed-in hardware model, which is shippable on its own as a reference; and the same driver proved against a model co-developed with the vendor or with REMS, which is the research contribution.
// stack context
[firmware-lowlevel] Firmware & Low-Level Systems
Firmware and Low-Level Systems Below the orchestration layer and above the silicon sits the firmware: hypervisors, device drivers, and boot chains. Code here runs at the highest privilege levels on both CPU and GPU. A bug is not a container escape — it is a host compromise, often with no log entry a…
[hardware-supply-chain] Hardware & Physical Supply Chain
Hardware and Physical Supply Chain At the bottom of the stack is the silicon itself: the GPU, TPU, and NPU dies that run every training and inference workload, and the global supply chain that produces them. Compromise here is durable: a hardware trojan survives every software update, every OS reins…