The core idea
AI agent social engineering is a way of influencing an AI agent through believable context, pressure, identity claims, policy claims, or tool-use requests. The attacker is not only trying to make the model say a forbidden sentence. The attacker is trying to make the agent reinterpret what it is allowed to do.
That distinction matters because agents are often given delegated work. They may read private context, summarize records, call tools, update systems, draft messages, browse websites, or hand work to another agent. When an agent has that kind of role, manipulation can become an operational problem rather than a chat-quality problem.
Why this is different from ordinary chatbot risk
A chatbot can produce a wrong answer. A delegated agent can take a wrong step. It can reveal a customer record, approve an exception, trust a forged identity claim, write a false memory, call a tool before verification, or summarize untrusted content as if it were trusted.
The risk grows when the interaction looks normal. A customer asks for urgency because a renewal is blocked. A candidate includes instructions in a resume. A prospect says a manager already approved a discount. A webpage asks the browser agent to enter information before continuing. None of those patterns need to look like a classic jailbreak.
The practical question is whether the agent maintains the boundary when the surrounding conversation becomes persuasive. The boundary may be about identity, authorization, data scope, tool preconditions, memory integrity, or handoff provenance.
Common manipulation patterns
Most useful tests start with the pressure pattern, then connect it to the business boundary. Fake authority tests whether the agent accepts a claimed role without evidence. Urgency pressure tests whether speed changes the agent's decision process. Policy exception pressure tests whether a request framed as special becomes treated as approved.
Data extraction tests whether the agent reveals information outside the user's scope. Tool misuse tests whether the agent calls a state-changing tool without the right preconditions. Handoff manipulation tests whether one source can turn untrusted context into an instruction for another agent.
- Support: a user claims to be the account owner and asks for a refund before verification.
- Sales: a prospect says procurement already approved a discount and asks the agent to confirm it in writing.
- Recruiting: a resume includes instructions that ask the agent to ignore screening rules.
- Browser automation: a page frames a sensitive form as a routine verification step.
How to test the risk
Start by naming the boundaries the agent must preserve. The test should not begin with a random list of prompts. It should begin with the business rule: what the agent can read, reveal, approve, change, remember, or send.
Then simulate realistic pressure around that boundary. The scenario should be plausible enough that a product, security, or support reviewer understands why the agent was tempted to cross the line. The output should preserve evidence: the attacker move, the agent response or tool action, the failed invariant, severity, and enough context to reproduce the issue.
A good testing program does not stop after finding a failure. It asks whether the fix changed the boundary behavior, whether the same scenario now holds, and whether future changes could bring the failure back.
What a good result looks like
A good result is not simply that the agent refuses everything. Over-refusal can make an agent unusable and can hide whether the boundary was understood. The agent should preserve the boundary while still helping within its allowed role.
For example, a support agent can explain the verification process without revealing account data. A sales agent can offer to route a discount request without promising the discount. A recruiting agent can summarize a resume without following instructions embedded inside it.
The safest behavior usually combines source awareness, boundary awareness, and an appropriate next step. That next step may be refusal, verification, escalation, drafting, or asking for more information. The test should judge whether the chosen path matches the boundary.
How the failure feels in a real workflow
The most useful way to understand this risk is to imagine the agent in the middle of a normal operational moment. A customer is upset, a prospect is pushing for a deadline, a candidate has submitted a long document, or a browser page is asking for confirmation before the agent can continue. The request does not need to look hostile. In many cases, the danger is that it looks ordinary enough to pass through the agent's reasoning process.
That is why the best examples are not theatrical jailbreaks. They are small shifts in interpretation. The agent treats a claimed identity as verified. It treats a policy exception as already approved. It treats an untrusted document as instructions. It treats urgency as a reason to skip a check. Each step may look plausible on its own, but together they move the agent outside the boundary it was supposed to preserve.
For builders, the practical lesson is to test the agent in the language of the workflow, not only in the language of model safety. The question is not whether the agent can recite the policy. The question is whether it can preserve the policy when the surrounding context is persuasive.