Context Window Integrity Under Adversarial Length

A model’s context window is a fixed-size resource. The serving harness fills it by concatenating, in priority order, a system prompt, developer instructions, retrieved context (from RAG or tool outputs), and the user message. When the user message or retrieved context is long enough, something has to be truncated — and in most deployed systems, truncation is last-in-wins or proportional, not priority-respecting. An adversary who controls message length (trivially: just send a long message) or who can influence retrieved-context length (less trivially: poison a document store with padding) can force eviction of the system prompt. Once the system prompt is gone, so are the behavioral constraints it encodes.

This is a resource-scheduling problem in disguise. The system prompt and developer instructions are hard reservations; user input and retrieved context are elastic. The invariant is simple to state: the system prompt occupies a reserved prefix of guaranteed minimum size, and truncation of lower-priority sections never reduces the system-prompt allocation. This is a monotonicity property — increasing the length of a lower-priority section can only shrink other lower-priority sections, never the reserved prefix. It is exactly the kind of property that tools like Dafny and Frama-C are built to verify, because it reduces to an invariant over a bounded integer program (token counts, priority levels, allocation sizes).

The reason this has not been done is that context assembly logic is typically embedded in application code — a few dozen lines of Python in a LangChain pipeline or a custom harness — and nobody has extracted a specification from it. But the specification is short, the code is short, and the proof obligation is a loop invariant. This is a low-effort, high-value verification target in the serving stack.

Solution/project Sketch

Model context allocation as a resource-bounded scheduler with a fixed budget of tokens and priority levels. Each level has a minimum reservation with . Surplus tokens are distributed to lower-priority levels in priority order. Specify the allocator in Dafny or Frama-C and prove two properties: (1) reservation guarantee — for all inputs, every priority level receives at least tokens, and (2) monotonicity — increasing the requested length of level does not decrease the allocation of any level with (higher priority).

Extract the verified allocator as a standalone function that existing harnesses can call to compute truncation boundaries before assembling the context. The function takes token counts per section and returns per-section truncation limits. Integration is a one-line change in any harness that currently does its own ad-hoc truncation.

Bibliography

    // stack context

    [execution-harness] Execution Harness

    Execution Harness The execution harness is the outermost software layer of an ML deployment: the wrapper that actually calls the model and handles its inputs and outputs. A user’s prompt enters here, gets tokenized and batched, passes through the model, and returns as a completion. Because every int

    ← all problems