Tractable Problems in AI Security via Formal Methods

OCI Runtime Hardening

The standard container runtime, runc, implements the OCI specification atop Linux namespaces and cgroups. It was never designed to resist an adversary inside the container, and the CVE record reflects this. When the workload in question is an AI agent with tool use (i.e. one that can write and execute arbitrary code), the gap between “process isolation” and “security boundary” becomes the whole game.

Recent benchmark work has started to quantify the gap. [1] measures frontier LLM capabilities for container sandbox escape directly, treating the model as a red-team operator. The UK AISI sandboxing toolkit [2] and its Docker Sandbox Breakout CTF [3] provide standardized escape scenarios built on Inspect AI [4]. [5] introduces BashArena, a control setting specifically for highly privileged agents. BoxArena [6] inverts the usual offensive framing: instead of asking “can this agent escape?”, it fixes the attacker model and varies the OCI runtime, producing a defensive leaderboard. Five attack surfaces (filesystem, socket, process, network, syscall) are tested across runtimes like runc, gVisor’s runsc, crun, and Kata Containers.

The pattern across all of these is convergent: empirical red-teaming finds holes, patches get applied, and next quarter a new escape surfaces. A sufficiently capable agent should be assumed to find any flaw in its containment. This is where formal methods enters.

Solution/project Sketch

What would a formally verified OCI runtime look like? The lineage runs through the CLInc stack [7] (end-to-end verified from gate-level hardware through compiler to application) and seL4 [8] (functional correctness of a microkernel, machine-checked in Isabelle/HOL). A verified OCI runtime would prove that the namespace and capability isolation enforced by the runtime matches a formal policy – that no sequence of syscalls from inside the container can violate the confinement invariant.

This is a large artifact, but it decomposes into tractable pieces: a formal model of the relevant Linux ABI surface (the 80 syscalls a hardened container actually needs), a specification of the confinement policy (drop all capabilities, read-only rootfs, seccomp allowlist, no-new-privileges), and a proof that the runtime’s state machine preserves the policy across all reachable states.

The AI control literature [9], [10] adds another dimension: even with a verified runtime, you want a monitor – a second model or policy that reviews agent actions before they reach the kernel. The combination of a verified isolation boundary and a scalable monitor protocol is stronger than either alone, and the monitor’s job gets easier when the boundary it sits behind has a proof.

Bibliography

[1] R. Marchand et al., “Quantifying Frontier LLM Capabilities for Container Sandbox Escape,” arXiv preprint arXiv:2603.02277, 2026.
[2] UK AI Safety Institute, “AISI Sandboxing Toolkit.” 2025.
[3] UK AI Safety Institute, “Docker Sandbox Breakout.” 2026.
[4] UK AI Safety Institute, “Inspect AI.” 2025.
[5] A. Kaufman, J. Lucassen, T. Tracy, C. Rushing, and A. Bhatt, “BashArena: A Control Setting for Highly Privileged AI Agents,” arXiv preprint arXiv:2512.15688, 2025.
[6] Q. Dougherty, A. Singh, and M. von Hippel, “BoxArena: An OCI Runtime Escape Benchmark.” 2026.
[7] W. R. Bevier, W. A. Hunt, J. S. Moore, and W. D. Young, “An Approach to Systems Verification,” Journal of Automated Reasoning, vol. 5, no. 4, pp. 411–428, 1989.
[8] G. Klein et al., “seL4: Formal Verification of an OS Kernel,” in Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP), 2009, pp. 207–220.
[9] R. Greenblatt, B. Shlegeris, K. Sachan, and F. Roger, “AI Control: Improving Safety Despite Intentional Subversion,” in International Conference on Machine Learning (ICML), 2024.
[10] A. Bhatt et al., “Ctrl-Z: Controlling AI Agents via Resampling,” arXiv preprint arXiv:2504.10374, 2025.