GPU Drivers for Verified Kernels

Multi-tenant GPU workloads today run on hypervisors whose TCB is orders of magnitude too large to verify, and whose GPU-facing driver is proprietary kernel code running at ring 0 (§). Two verified microkernels are candidates for hosting an ML-grade GPU stack: seL4 [1], whose functional-correctness proof is the most complete but which has no GPU driver support at all, and NOVA [2], whose partial proof covers concurrency and weak memory and whose microhypervisor architecture is explicitly designed for the host/guest split a GPU-passthrough workload needs.

The research methodology for verifying a driver — separation-logic proofs against an abstract hardware model — is settled [3], [4], [5]. The real blocker is that the abstract hardware model does not exist. No GPU vendor publishes a machine-checkable specification of their command processor, MMU, or DMA engine; Peter Sewell’s REMS Group has the most accurate public modeling work but is scoped to the CPU ISA [6]. The tractable problem is accordingly two-sided: produce a formal model of a GPU command-submission interface at a fidelity that supports proof, and prove a driver against it.

Solution/project Sketch

Start with the smallest useful surface: the command-submission path of a single open-source GPU stack (NVK/Nouveau on NVIDIA, or an AMDGPU subset). Specify the command-ring state machine in Rocq or Lean at a level of detail that admits noninterference claims across tenant contexts — ring-buffer consistency, IOMMU mapping integrity, context-switch scrubbing. Prove a reference driver (runnable under either seL4′s sDDF or NOVA’s userspace driver model) against that spec, with the security property being no sequence of guest-supplied command packets causes the driver to program an IOMMU mapping or issue a DMA outside the guest’s declared memory region. Two natural stopping points: a verified command-submission module with a stubbed-in hardware model, which is shippable on its own as a reference; and the same driver proved against a model co-developed with the vendor or with REMS, which is the research contribution.

Bibliography

// stack context

[firmware-lowlevel] Firmware & Low-Level Systems

Firmware and Low-Level Systems Below the orchestration layer and above the silicon sits the firmware: hypervisors, device drivers, and boot chains. Code here runs at the highest privilege levels on both CPU and GPU. A bug is not a container escape — it is a host compromise, often with no log entry a

[hardware-supply-chain] Hardware & Physical Supply Chain

Hardware and Physical Supply Chain At the bottom of the stack is the silicon itself: the GPU, TPU, and NPU dies that run every training and inference workload, and the global supply chain that produces them. Compromise here is durable: a hardware trojan survives every software update, every OS reins

← all problems