Below the orchestration layer and above the silicon sits the firmware: hypervisors, device drivers, and boot chains. Code here runs at the highest privilege levels on both CPU and GPU. A bug is not a container escape — it is a host compromise, often with no log entry at all.
Multi-tenant GPU isolation today relies on hypervisors. KVM’s trusted computing base is roughly ten million lines of code; Xen is smaller but still far too large for exhaustive verification. Both have had VM-escape vulnerabilities — Google Project Zero’s 2021 KVM breakout via AMD SVM nested virtualization [1] is a representative example — and the PCI passthrough path that GPU workloads require widens the attack surface further, since IOMMU misconfigurations or missing PCI Access Control Services (ACS) can allow peer-to-peer DMA between devices assigned to different tenants. NVIDIA’s Multi-Instance GPU (MIG) partitioning provides hardware-level memory isolation within a single GPU, but the partitioning is managed by the host driver, which runs inside the hypervisor’s TCB. seL4, the only general-purpose microkernel with a complete functional-correctness proof, has no GPU driver support at all. Its new device driver framework (sDDF) handles network and block devices, but nothing with the complexity of a modern GPU. NOVA [2] is the other candidate worth naming here: it is open-source (GPL-2) with an open-source proof, and though the verification is not as far along as seL4′s, the portions that are complete already account for concurrency and weak memory — both of which fundamentally change the verification challenge rather than merely extending it. Closing the GPU gap — giving either kernel a verified GPU driver — is the subject of §.
Even where MIG is deployed, GPU memory is not always scrubbed between context switches. Trail of Bits demonstrated this with LeftoverLocals [3]: on AMD, Apple, and Qualcomm GPUs, a co-resident process could read local memory left behind by a previous kernel, recovering roughly 5.5 MB per GPU invocation — enough to reconstruct an LLM’s token-by-token output with high fidelity. NVIDIA hardware was not affected in that specific case, but the deeper issue is architectural. GPUs were designed for throughput, not isolation, and the memory hierarchy reflects that priority. Shared L2 caches, shared interconnects, and performance counters that leak timing information all create side channels across tenant boundaries. The NVBleed attack [4] showed that contention on NVLink interconnects between GPUs lets an attacker fingerprint which deep-learning model a co-tenant is running with 97.8% accuracy, and the attack works across VM boundaries on Google Cloud — even after NVIDIA patched performance-counter access, the timing channel alone still yields F1 scores above 83%.
Confidential VMs compound the problem. CPU-side TEEs like AMD SEV-SNP encrypt guest memory so the hypervisor cannot read it, but extending that guarantee to a GPU is an open engineering problem. SEV-SNP requires the IOMMU in non-passthrough mode to prevent peripherals from reaching encrypted memory, which directly conflicts with the PCIe passthrough that GPU workloads need. AMD’s SEV-TIO extension [5] is meant to bridge this gap using PCI-SIG’s TDISP protocol, but it is not yet widely deployed, and the attestation story for a GPU behind a TDISP-secured link is still being worked out. In the meantime, any “confidential AI” deployment that claims SEV-SNP protection while using GPU passthrough has a gap in its threat model that the marketing materials do not mention.
The GPU driver is the most privileged code that touches the accelerator. NVIDIA’s proprietary CUDA driver stack, AMD’s ROCm, and Intel’s oneAPI are all closed-source, kernel-mode codebases that handle memory mapping, command submission, and context switching for every GPU workload on the machine. NVIDIA ships quarterly security bulletins; 2024 alone included [6] (privilege escalation in the display driver) and [7] (out-of-bounds read leading to code execution), plus nine vulnerabilities in the CUDA toolkit found by Palo Alto’s Unit 42 [8]. These are not exotic attacks — they are standard memory-safety bugs in C code running at ring 0. The driver is a single point of compromise: an attacker who controls it can read any tenant’s GPU memory, modify in-flight computations, or pivot to the host kernel. Because the driver source is proprietary, the only available mitigations are patching and hoping. A verified, open driver for at least the GPU command-submission path would change the calculus (§).
A compromised or buggy GPU driver also opens the door to DMA attacks. The GPU sits on the PCIe bus and performs DMA to host memory; the IOMMU is supposed to restrict which physical pages the device can touch, but if the driver programs the IOMMU mappings incorrectly — or if an attacker can influence them — the GPU becomes a tool for reading or writing arbitrary host memory. [9] demonstrated exactly this on NVIDIA’s Jetson platform, where a PCIe DMA attack bypassed secure boot. The IOMMU is a necessary defense, but it is only as good as the code that configures it, and that code is part of the same proprietary driver stack.
The CUDA compatibility layer adds its own surface area. NVIDIA maintains forward and backward compatibility across major CUDA versions, which means the driver must support multiple ABIs simultaneously and preserve behavior across a sprawling matrix of toolkit-to-driver version combinations. Operators running multi-framework training pipelines frequently pin older driver versions to avoid breaking their stack, which means known-patched vulnerabilities persist in production for months or years. The compatibility shims themselves are additional code paths that rarely get the same scrutiny as the hot path, and any bug in ABI translation is a bug at ring 0.
On the open-source side, NVIDIA released its kernel-mode GPU modules as open source in 2022, and the Mesa project’s NVK driver now provides Vulkan support for Kepler through Blackwell. These are steps forward for auditability — community review can find classes of bugs that internal QA misses. But neither effort covers the full compute path that ML workloads use: the user-mode CUDA runtime, the compiler backend, and the firmware blobs that run on the GPU’s internal microcontrollers all remain closed. An open kernel module with a closed firmware blob is better than nothing, but it is not a verifiable system. The gap between “source available” and “formally verified” is where the interesting work lies.
Worth naming what the open problem is and is not. Driver verification as a research methodology — separation-logic proofs of a driver against an abstract hardware model — is a demonstrated pattern [10], [11], [12]. The unsolved half is the abstract hardware model itself. Vendors do not ship machine-checkable specifications of their devices; the most accurate public modeling effort, Peter Sewell’s REMS Group under a long-running ARM collaboration [13], is scoped to the CPU ISA and does not cover the MMU, IOMMU, or anything resembling a GPU command processor. The pivot from such a model to a software proof is a further gap; sketches exist [12] but no one has closed it at scale. The tractable problem at § is accordingly as much about producing a formal model of a GPU command-submission interface as it is about proving a driver against one.
None of the isolation above matters if the firmware itself has been tampered with. Secure Boot and measured boot chains establish trust from the hardware root of trust (RoT) through each firmware stage to the OS kernel: each stage cryptographically verifies the next before handing off execution. NVIDIA’s H100 extends this model to the GPU, with a per-device ECC keypair, on-die RoT, and a measured boot sequence that produces an attestation report — a signed manifest of every firmware component loaded. Combined with CPU-side TEEs (Intel TDX, AMD SEV-SNP), this enables composite remote attestation: a verifier can check that both CPU and GPU booted clean firmware before releasing model weights or training data to a node.
There is a real architectural choice here, not just a terminology one, between enforcing boot locally via signature check (verified boot) and enforcing it externally via an attestation-gated control plane (measured boot). Verified boot halts on bad signatures, but it trusts whatever is in the signature database, and that database has been bypassed in the wild: the BlackLotus bootkit (2023) exploited [14] to defeat UEFI Secure Boot on fully patched Windows systems by bringing its own copy of a legitimately signed but vulnerable boot manager. Measured boot does not prevent anything from loading — the TPM faithfully records whatever hash is presented, clean or not — but in exchange it is more compositional: the hardware only measures, and policy (what to do with a given measurement) lives outside the boot path, in an operator-owned plane that can release secrets only to nodes that pass attestation, hard reset machines that fail it, or quarantine them from the management network for analysis. NOVA deliberately went this direction [2], on the argument that post-facto measurability with an external control plane is a better composition point than bundling enforcement into the boot ROM. Either story depends on continuous attestation and a fresh revocation list; a stale verifier leaves a tampered node operating undetected between checks.
The Baseboard Management Controller (BMC) is a separate, quieter threat. Every server in a training cluster has a BMC — a small system-on-chip running its own OS (often Linux), connected to its own network interface, with out-of-band access to the host’s power, console, and firmware update mechanisms. BMCs are managed via IPMI or Redfish, and they are rarely patched with the same urgency as the host OS. In 2023, Binarly disclosed seven vulnerabilities in Supermicro BMC firmware [15] that gave unauthenticated attackers root access to the BMC. From there, an attacker can read cleartext credentials off the BMC filesystem, flash malicious UEFI firmware to the host, or pivot to every other BMC on the management VLAN. The persistence is the real problem: a BMC implant survives host OS reinstalls, disk wipes, and even GPU firmware updates, because it lives on a separate flash chip that the host never touches.
Firmware signing for GPUs is better than it was — NVIDIA’s secure firmware update mechanism verifies digital signatures and enforces version anti-rollback — but the trust anchor is only as strong as the key management around it. The 2022 LAPSUS$ breach of NVIDIA [16] resulted in the theft of two code-signing certificates. Although the certificates were expired, Windows still accepts expired certificates for kernel drivers, and malware signed with the stolen keys appeared in the wild within a day. The broader lesson: a single key compromise turns the entire firmware-signing infrastructure from a defense into a distribution channel. What formal methods can contribute here is verifying the attestation protocol itself: proving that the chain of measurements is unforgeable, that a compromised node cannot replay a clean attestation, that revocation logic has no time-of-check/time-of-use gaps, and that the BMC’s update path cannot be used to write outside its intended flash region. The machinery exists, but deploying it across a 10,000-GPU training cluster with continuous attestation, key rotation, and revocation checking is an engineering problem that remains largely unsolved at scale.
// related tractable problems