AI agents today can browse the web, retrieve information and act on their own. Useful, yes, but it also opens new windows for attackers to try to influence what the AI does. What does that mean for you?
What the risk is: not just a malicious string
Have you heard of prompt injection? Those are malicious instructions hidden in external content intended to make the model do something the user didn't ask for. At first it was relatively simple: edit a page and drop a direct command to confuse the agent.
But things changed. In the real world these tactics started to look more like social engineering than a single malicious line of text. It's no longer just about spotting a dangerous phrase; the AI navigates contexts where information can be misleading or manipulative.
In practice, defense can't rely only on filtering inputs. You need to design the system to limit damage even if some attacks succeed.
Why the traditional strategy fails
Techniques like an AI "firewall" try to classify inputs as malicious or normal. Sounds logical, right? But detecting a lie or manipulation without context is as hard for a classifier as it is for a person.
Modern attacks use the same trick that works on people: push emotions, invent believable stories, impersonate others. If your defense depends only on spotting a malicious string, it'll fall behind.
How OpenAI proposes to defend: limit the impact
OpenAI suggests thinking of the agent as a customer-service worker: constantly exposed to third parties who might try to trick them. In that scenario, the key isn't to eliminate the chance of being fooled, but to reduce what an agent can do when they are manipulated.
Some measures they describe:
- Impose restrictions on sensitive actions (for example, transmitting data or making payments).
- Analyze which sources influence which "sinks": if external input can trigger a dangerous action, limit that route.
- Be transparent with the user: when an action would share data with a third party, the system asks for confirmation or blocks the transmission.
A concrete example: a mitigation called Safe Url detects when the agent might send information learned in the conversation to a third party. If that happens, it either shows the user what would be sent to ask for confirmation, or it blocks the action.
Practical mindset: copy what works for humans
The proposal isn't magic. It's inspired by how we handle social engineering risk in people:
- Define clear rules for the agent (what it can and cannot do).
- Put deterministic limits on the most dangerous actions.
- Detect signals and request human intervention when risk rises.
That way, even if the AI is very smart, the system limits the potential damage of a successful manipulation.
What this means for developers and users
If you integrate models into an application, ask yourself: what controls would a human agent have in this situation? If your answer involves reviewing, confirming, or limiting, implement those for the AI.
Not all effort should go into building perfect filters. An important part is designing the architecture so that, even if a malicious input slips through, the agent can't do significant harm without validation.
Research and defenses continue. OpenAI mixes this social approach with traditional security practices and model training, encouraging dangerous decisions not to happen silently or automatically.
A practical look toward the future
The short takeaway: attacks are evolving toward social engineering, and defenses must evolve too. Limiting capabilities, requiring explicit confirmations, and applying structural controls is more effective than relying only on detecting malicious strings.
Worried about how this affects your product? Start today by mapping sensitive actions and deciding what requires human verification. It's a small investment compared to the risk of a leak or a harmful action.
Original source
https://openai.com/index/designing-agents-to-resist-prompt-injection
