Tractable Problems in AI Security via Formal Methods

Adversarial Robustness of Formal Methods

Agda’s issue label “false” on GitHub sits at the time of this commit at 10 open and 76 closed. Agda’s issue label “false” tracks the proofs of false that Agda allows or has allowed. Rocq runs the same play under a different name: the critical label is reserved for proofs of false. One asks, “isn’t the whole point of a type theory that it be sound?”

So you see we have a problem. If ITP and other FM tools are not adversarially robust, scheming or reward hacking AIs will readily leverage novel zerodays to violate security properties. ARIA’s Safeguarded AI: Cybersecurity solicitation [1] names a “verified proof-checking kernel” robust to adversarial AI proof generation as one of its example targets, framing the proof checker as the single point of trust if AI agents are to produce verified artefacts at scale — the same enabler, arrived at independently.

[2] discusses some of this to set up the Lean kernel arena. TODO: elaborate.

Broadly, soundness issues in proof assistants arise from the tension between expressivity and ease of use on one side and consistency on the other. Every concession to ergonomics — definitional extensions, universe polymorphism, coinduction, native compilation of the reduction machine, opaque conversion shortcuts — is a place where the kernel can drift away from the logic it was supposed to implement.

TODO: zero in on Lean specifically, (James Henson?), discuss governance challenges and the slippery definition of the problem.

Solution/project Sketch

Soundness bugs often arise due to the extreme complexity of a dependent type theory, who’s proof checker we call a kernel ( 5k LoC in Lean, 10-30k LoC in Rocq). One idea would be to express the language of the ITP, even the dependently typed one, in a simpler theory with a smaller kernel. So you could, in principle, simply write a model of your complicated dependent ITP in a simple, well-trusted target and check proofs there.

What you want is a system in the LCF tradition (Milner’s Logic of Computable Functions), whose defining discipline is that theorems are an abstract datatype and the only way to manufacture one is through a small fixed set of inference rules. That discipline is what makes the trusted core small, and it’s preserved across systems with otherwise quite different logics: HOL Light [3] and HOL Zero [4] (classical higher-order logic), Candle [5] (HOL Light verified down through CakeML), and Milawa (a bootstrapped first-order prover in the ACL2 lineage [6]). HOL Zero gets the trusted core down to a few hundred lines. Push further — verify the logic down to a large cardinal axiom, plus a model of the hardware, plus a small wrapper for actually running things — and rough estimates put the state of the art around 10k lines of spec.

One approach, then, is to pick something simple and well-trusted — ZFC or a variant, or a minimal LCF-style higher order logic, augmented with a large cardinal / large universe axiom — and reduce a working ITP all the way down to that. Useful prior art and reference points: lean4lean (a verifier for Lean 4 written in Lean 4) and the MetaRocq (formalised metatheory of Rocq inside Rocq).

Bibliography

[1] Advanced Research and Invention Agency, “Safeguarded AI: Cybersecurity — Call for Proposals.” [Online]. Available: https://aria.org.uk/media/1padxpaf/safeguarded-ai_cybersecurity_solicitation.pdf
[2] L. de Moura, “Who Watches the Provers?.” [Online]. Available: https://leodemoura.github.io/blog/2026-3-16-who-watches-the-provers/
[3] J. Harrison, “HOL Light.” [Online]. Available: https://hol-light.github.io/
[4] M. Adams, “HOL Zero.” [Online]. Available: http://www.proof-technologies.com/holzero/
[5] CakeML Project, “Candle: A Self-Compiling Verification of HOL Light.” [Online]. Available: https://cakeml.org/candle/
[6] J. Davis, “The Milawa Rewriter and an ACL2 Proof of its Soundness,” in Proceedings of the Seventh International Workshop on the ACL2 Theorem Prover and Its Applications (ACL2 '07), Austin, Texas, USA, Oct. 2007. [Online]. Available: https://kookamara.com/jared/milawa/Documentation/Rewriter/rewrite.pdf