When, Not If: What a Proof Leaves Open
A new NIST result proves that no fixed set of guardrails can hold against every adversarial prompt. It is rigorous, and it is not the end of the argument. It is the beginning of a different one — about what happens after the guardrail fails.
This month, in IEEE Security & Privacy, Apostol Vassilev, a senior scientist at NIST, published a mathematical proof extending Gödel’s incompleteness logic to artificial intelligence. The finding, stated plainly: no finite set of guardrails is universally robust against adversarial prompts. For any defensive system built on a bounded set of rules, there exists a prompt that defeats it. It is only a matter of finding it.
The proof is careful and its author is measured about what follows from it. Vassilev does not claim the problem is hopeless; he proposes a discipline — constant red-teaming to find new exploits first, continuous hardening against the ones discovered, and operational resilience built on the assumption that an exploit will eventually succeed. The phrase that organizes the whole approach is when, not if.
Sit with that phrase, because it carries more than it first appears to.
The line the proof draws
For most of the security era, the implicit goal has been prevention: build the wall high enough and the attacker stays out. The NIST result formalizes what practitioners have sensed for a while — that against a sufficiently rich input space, the wall is never high enough in principle, not merely in practice. There is always a door.
This does not make defense pointless. Hardening still raises the cost of attack, and raising that cost is a legitimate and valuable goal. But the proof relocates the conversation. If failure is not a probability to be minimized toward zero but a certainty to be planned for, then a question that prevention could previously defer becomes unavoidable: when the prompt does succeed — and it will — what independent record exists of what the system actually did?
That is a different question from the one the proof answers, and the proof does not answer it. It was never meant to.
Three jobs, not one
It helps to separate what have quietly become three distinct jobs.
Prevention keeps the attacker out. Guardrails, alignment, input filtering. The NIST proof tells us this job is bounded: necessary, valuable, and never complete.
Containment limits the damage once an attacker is in. Sandboxing, rate limits, blast-radius reduction, quick recovery. This is most of what “operational resilience” refers to, and it is where much of the thoughtful discussion around the proof has landed.
Evidence is the third job, and it is the one with no clear owner. When a system is successfully steered into doing something it should not have, containment may stop the bleeding — but it produces no neutral account of what occurred. The system’s own logs are written by the system under attack, and read, later, by the party defending itself with them. That is not independent evidence. It is the interested party’s version.
Prevention and containment both operate from inside the system being defended. Evidence, to be worth anything when it is finally needed — by a regulator, a counterparty, an insurer, a court — has to come from somewhere the compromised system does not control. It has to be observed from the receiving side, as it happens, and sealed so it cannot be revised after the fact.
The corollary the proof invites
Here is the connection, stated carefully so as not to claim more than is warranted. The NIST proof is about prevention: it shows the guardrail is incomplete. It says nothing about evidence; that is not its subject. But the two share a structure worth noticing.
The reason a guardrail cannot certify its own universal robustness is, loosely, that a system cannot fully account for itself from within itself — the Gödelian move the proof makes precise. The reason a compromised system’s logs cannot serve as neutral evidence of its own conduct is the same shape of problem, one layer over: the system under attack is not a trustworthy witness to what was done to it. In both cases the missing certainty has to be imported from outside the system in question.
If that reading holds, then the proof’s when, not if has a quiet second half. When the failure comes, the only account that survives scrutiny is the one recorded from outside — independently, at the moment, by a party with no stake in the outcome.
What this note claims, and what it does not
It does not claim NIST’s proof endorses any particular approach to evidence. The proof is about the limits of guardrails; the move from there to receiver-side evidence is an inference drawn here, not a conclusion of the paper, and the distinction matters.
It does not claim that evidence prevents anything. It does not. Prevention reduces frequency; containment limits damage; evidence does neither. It only ensures that when the inevitable occurs, there is a record of it that can be trusted by someone who was not there and has reason to doubt every interested party.
What it does claim is narrow and, we think, hard to escape: a rigorous proof has just established that failure is a certainty rather than a risk. The industry has spent its attention on preventing and containing that failure. Almost none of it has gone to the question of who holds the neutral record once the failure has occurred. Three jobs. The third has no owner yet.
The wall will be breached; the proof says so. The only open question is whether anyone independent was watching when it happened.
This research note discusses publicly available work: Apostol Vassilev, “Robust AI Security and Alignment: A Sisyphean Endeavor?”, IEEE Security & Privacy, May 2026. The interpretation and the framing of evidence as a distinct discipline are the Observatory’s own and should not be attributed to NIST or the paper’s author.
BotConduct — independent behavioral observatory. Evidence, not enforcement.