the stack / orchestration-cloud
The orchestration layer manages the compute resources that ML workloads run on: scheduling jobs across GPU clusters, routing traffic between them, and controlling who can access what. It sits below the frameworks and above the firmware — a layer of distributed systems software that most ML engineers interact with only through configuration files, but whose security properties determine whether isolation between tenants, jobs, and data pipelines actually holds. Following § we treat this layer as a stack of mechanisms (runtime confinement, distributed-protocol correctness, network-fabric isolation, IAM logic) and not as a question about scheduler optimization; placement and load-balancing decisions are downstream of these mechanisms and out of scope.
GPU training and inference jobs run inside containers managed by Kubernetes, Slurm, or Ray. The container is the unit of isolation: it is supposed to confine a workload to its own filesystem, network namespace, and device access. In practice, GPU workloads erode this confinement. NVIDIA’s container toolkit — the standard mechanism for exposing GPUs inside containers — had a critical TOCTOU vulnerability [1] that allowed a malicious container image to escape to the host, affecting 33% of cloud environments per Wiz’s estimate. The underlying container runtime, runc, had its own escape [2] via leaked file descriptors. These are not exotic attacks; they are the kind of bugs that container runtimes accumulate routinely, and they interact badly with the privileged device access that GPU workloads require. The OCI runtime hardening problem (§) addresses the gap between what orchestrators assume containers guarantee and what the runtimes actually enforce.
Slurm, the dominant scheduler for HPC and large training runs, has a worse security record than most ML engineers realize. It has historically run its daemons as root, and its authentication rests on MUNGE — a shared-secret credential system that was never designed for adversarial multi-tenant environments. [3] demonstrated this: attackers could bypass MUNGE’s hash-based message integrity protections in the slurmd process to replay root-level authentication tokens, gaining privileged access to compute nodes over the network. Earlier vulnerabilities were blunter — [4] let an unprivileged user send data to arbitrary Unix sockets on the host as root through Slurm’s PMI2/PMIx RPC handler, and [5] allowed privilege escalation to root via SPANK environment variable injection during Prolog/Epilog execution. Slurm clusters at national labs and cloud GPU providers routinely run versions months or years behind patches, because upgrading the scheduler means draining the entire cluster.
Ray’s security posture is, by its maintainers’ admission, a design choice rather than an oversight. Until recently, Ray had no authentication on its Jobs API or Dashboard — any process that could reach the network port could submit arbitrary code for execution on the cluster. Anyscale’s position was that Ray should only run in isolated networks and act upon trusted code. The result was predictable: the ShadowRay campaign [6] compromised hundreds of thousands of exposed Ray clusters for cryptocurrency mining, data exfiltration, and eventually self-propagating botnets. Attackers extracted production database credentials, cloud environment tokens, and model weights from compromised AI workloads. The vulnerability remained unpatched for over two years because the vendor classified it as intended behavior.
Kubernetes introduces its own GPU-specific attack surface through device plugins. The NVIDIA device plugin runs as a privileged DaemonSet with access to the host’s /var/lib/kubelet/device-plugins socket and device files. A compromised plugin can register fake GPU devices, intercept kubelet communications, or escalate to full node compromise. Kubernetes also lacks native GPU resource throttling — unlike CPU and memory, there are no cgroup controls for GPU compute time, so a malicious pod can monopolize GPU cycles and starve other tenants’ inference workloads with no scheduler-level mitigation. In multi-tenant clusters, GPU memory is not isolated by default; processes on the same GPU can read each other’s memory regions unless MIG (Multi-Instance GPU) or MPS partitioning is explicitly configured, and even then the isolation guarantees are weaker than what hypervisors provide for CPU and RAM.
Large training runs communicate over InfiniBand or RoCE interconnects using RDMA — remote direct memory access that bypasses the kernel to move data between GPU memory regions at line rate. This is a performance architecture, not a security architecture. RDMA’s kernel bypass means that the standard OS-level network monitoring and access control stack is not in the path. InfiniBand partition keys (P_Keys) provide coarse tenant isolation at the hardware level, but verifying that traffic isolation policies are correctly enforced across a fabric of thousands of ports is a manual, configuration-driven process with no runtime audit trail. If you want to verify what is actually on the wire between nodes in a GPU cluster — not what the SDN controller says is on the wire — you need an independent observation point, which is the problem the verified network tap (§) is designed to solve.
The SDN controllers that manage GPU cluster fabrics are themselves a concentrated target. An SDN controller is a single logical authority over all traffic routing decisions in the network; compromise it and you can redirect, mirror, or drop any flow. The OpenFlow protocol, which most SDN deployments use for controller-to-switch communication, did not require authentication of switches in early versions, and even post-1.2 versions with TLS support are frequently deployed without it. An attacker who gains access to the control plane can add a second malicious controller (a feature available in OpenFlow 1.2+) and persistently reroute traffic without touching the data plane hardware. The centralization that makes SDN manageable is exactly what makes it a single point of failure for traffic integrity.
NVLink/NVSwitch topology inside a typical eight-GPU node. The fabric manager that programs NVSwitch routing tables runs as a privileged host process; the resulting tenant boundary is software-configured rather than hardware-enforced, and the traffic that crosses it never touches the CPU, the PCIe bus, or the OS network stack.
Within a single node, GPUs communicate over NVLink and NVSwitch at bandwidths up to 900 GB/s — bypassing the host CPU, PCIe bus, and OS network stack. NVIDIA’s fabric manager controls NVSwitch routing tables and restricts applications to designated address ranges, but this is a software-configured trust boundary, not a hardware-enforced one. There is no equivalent of IOMMU page-table isolation for NVLink traffic; the protection rests on correct configuration of the fabric manager, which runs as a privileged host process. A compromised fabric manager, or a bug in NVSwitch routing table setup, could allow one GPU’s process to read or write another’s memory region across the switch. For multi-tenant GPU servers using NVLink-connected GPUs — the standard configuration in DGX systems and cloud GPU instances — this is a lateral movement path that never touches the network and never appears in host OS logs.
Even when interconnect traffic is encrypted, traffic analysis remains viable. The volume and timing of gradient synchronization traffic are highly regular and predictable — all-reduce operations produce characteristic burst patterns that correlate with batch boundaries, model architecture, and even convergence state. An observer on the fabric (or with access to switch port counters) can infer training progress, detect when checkpointing occurs, and distinguish between different model architectures without decrypting a single packet.
Distributed training coordinates gradient synchronization across hundreds or thousands of GPUs using collective operations — all-reduce, all-gather, reduce-scatter — implemented by libraries like NCCL and Gloo. These protocols run over the RDMA fabric with no authentication or encryption; any process that can reach the network can inject or modify gradient data in transit. In federated and multi-node settings, this is the mechanism by which model poisoning attacks operate: a compromised node contributes manipulated gradients that shift the model’s behavior while remaining within the statistical noise of normal training variance. The protocol’s correctness properties — that all-reduce actually computes the sum of all participants’ contributions, that no participant can observe another’s individual gradient — are distributed-system invariants that formal methods can specify and verify. Proof-carrying code (§) is one path to making these guarantees checkable at deployment time rather than assumed.
Checkpoint integrity is the other open wound. Large training runs write distributed checkpoints — sharded model state spread across dozens or hundreds of files on networked storage — every few thousand steps. These checkpoints are the recovery mechanism when nodes fail, which happens constantly at scale. They are also the artifact that gets promoted to production: the “final model” is just the last checkpoint that passed evaluation. Yet checkpoints are typically stored as serialized tensors with no cryptographic signature, no content-addressable hash chain linking them to the training run that produced them, and no tamper-evident log of which processes wrote which shards. An attacker with write access to the shared filesystem — or to the object store behind it — can modify checkpoint shards to inject backdoor behavior, and the training job will resume from the poisoned state without any indication that the checkpoint was altered. The same exposure applies when checkpoints are copied between clusters, uploaded to model registries, or handed off from a training team to a deployment team. The integrity gap between “this file exists on S3” and “this file is the authentic output of training run X at step Y” is currently bridged by convention, not cryptography.
Data pipeline integrity compounds the problem. Training data flows through distributed storage systems — HDFS, cloud object stores, data lakes — through preprocessing jobs, shuffling, and batching before it reaches the GPUs. Each stage is a point where data can be modified, filtered, or augmented by an attacker with access to the storage layer. Unlike gradient poisoning, which requires a compromised training node, data poisoning can happen long before training begins, in ETL jobs or data labeling pipelines that are often run with broader permissions and less monitoring than the training jobs themselves.
The IAM layer controls which principals — human users, CI pipelines, serving infrastructure, and increasingly autonomous agents — can read model weights, write to training data stores, modify deployment configurations, or invoke models. The failure mode is over-permissioning: ML pipelines tend to accumulate broad service account credentials because the alternative is debugging permission errors during a training run. In 2025, Sysdig’s threat-research team documented an attack chain [7] where compromised credentials reached cloud administrator privileges in eight minutes across 19 distinct AWS principals — LLM-assisted reconnaissance followed by Lambda-based privilege escalation, with invocation logging disabled along the way. When the principal is an AI agent operating with long-lived API tokens, the IAM boundary is the control boundary. A token compromise gives the attacker everything the agent could do, with access patterns indistinguishable from legitimate automation. Proof-carrying code (§) offers a structural alternative: rather than trusting that IAM policies are correctly configured and that tokens are uncompromised, require that actions carry machine-checkable proofs of authorization.
Model registries are an under-secured chokepoint. A model registry — Hugging Face Hub, a private MLflow instance, a cloud model catalog — is where trained models are stored, versioned, and promoted to production. Whoever can push a new version to the registry controls what runs in production. The PoisonGPT demonstration [8] showed that a surgically modified open-source model could be uploaded to Hugging Face and spread targeted misinformation while retaining normal benchmark performance. ReversingLabs found malicious code embedded in pickle-serialized models on Hugging Face that evaded the platform’s safety scanning [9]. Model confusion attacks — the AI supply-chain analog of dependency confusion — exploit namespace reuse in registries to trick downstream pipelines into pulling attacker-controlled models. Access control on model registries is often weaker than on code repositories: push permissions are granted to broad service accounts, there is no equivalent of signed commits or required code review, and the model artifact format (often pickle or safetensors) has no built-in integrity or provenance chain.
Secret sprawl is the background condition that makes all of the above worse. ML workflows generate and consume credentials at every stage: cloud storage keys in training scripts, Weights & Biases tokens in experiment configs, Hugging Face tokens in model download scripts, database credentials in data pipeline notebooks. GitGuardian’s 2025 report [10] found 23.8 million secrets leaked in public GitHub repositories, a 25% increase over the prior year — and ML repositories are disproportionately affected because the notebook-driven, experiment-oriented workflow encourages hardcoding credentials for quick iteration. These secrets leak into container images (baked into layers during build), into training logs (printed during debugging and never scrubbed), and into checkpoint metadata (serialized alongside model state). When an agent operates autonomously — writing code, calling APIs, launching training runs — it accumulates credentials in its context window and working files with no human in the loop to notice. The agent-to-agent delegation problem makes this worse: when one agent delegates a subtask to another, what credentials does the delegate inherit, and who revokes them when the task is done? Current IAM systems have no concept of transitive, time-bounded delegation for non-human principals.
// related tractable problems