Research/June 2026 Report

35 min read

Report contents

Start

Executive SummaryWho This Report Is ForHow To Use This ReportKey Terms

Threat Model

1. AI Agents Are Becoming Delegated Actors2. What Is AI Agent Social Engineering?3. Why This Is Bigger Than Prompt Injection4. What The Research Experiments Show5. The Manipulated Delegation Chain6. Common Boundary Failures7. Why Existing Controls Are Not Enough

Defense Model

8. A Practical Defense Model9. The Agent Social-Engineering Testing Pyramid10. Maturity Model

Implications

11. Limits And Open Questions12. Where The Industry Is Headed13. What This Means For Builders14. Roleplay's Point Of View15. Methodology And Source Note

Reference

16. FAQ17. Sources Referenced

Manipulated Delegation: The AI Agent Social Engineering Report

A research report on how AI agents can be socially engineered into crossing business boundaries.

Published by Roleplay
Published: June 2026

Executive Summary

AI agents are becoming delegated actors. They do not only answer questions. They read private context, call tools, browse websites, update systems, write memory, and hand work to other agents. That shift changes the security problem.

The question is no longer only whether a model can be prompted badly. The deeper question is whether an attacker can manipulate delegated authority.

Roleplay calls this risk manipulated delegation: a failure pattern where an AI agent is persuaded, pressured, confused, or contextually misled into treating an unauthorized request as if it were inside its role.

Prompt injection is one important mechanism. But AI agent social engineering is broader. It includes fake authority, urgency pressure, policy exception pressure, deceptive web context, tool misuse, data exfiltration, memory poisoning, and multi-agent handoff risk. The common pattern is that untrusted influence changes what the agent believes it is allowed to do.

The research frontier is moving quickly:

The practical implication is straightforward: agent trust cannot be inferred from model quality, system-prompt wording, or one-time review. It has to be tested under realistic pressure, supported by evidence, fixed at the boundary, verified after the fix, and monitored for regression.

The report makes five claims:

  • AI agents should be evaluated as delegated actors, not only as language models.
  • Social engineering against agents targets authority interpretation: identity, urgency, policy, trust, and permission.
  • The research signal now spans multi-turn conversation, browser automation, tool calls, execution traces, skills, MCP boundaries, multi-agent systems, and voice simulation.
  • Practical defense depends on evidence, pre-action checks, least privilege, realistic simulation, fix verification, and regression gates.
  • The field is moving from prompt security toward delegated authority assurance.

The practical loop is:

  1. 01Define Boundaries
  2. 02Simulate Realistic Attacks
  3. 03Capture Evidence
  4. 04Fix
  5. 05Verify
  6. 06Gate Regressions

This report is written for founders, AI engineers, product leaders, security teams, and CISOs who are building or evaluating AI agents. Its goal is to explain the threat, summarize the current research signal, describe practical defenses, and outline where the industry is headed.

Who This Report Is For

This report is for:

  • AI builders shipping agents that interact with customers, candidates, prospects, vendors, employees, or web environments.
  • Product and engineering leaders who need to decide whether an agent is safe enough to launch.
  • Security teams reviewing agents that can access sensitive data or take meaningful actions.
  • Founders and consultants building AI workflows for customer support, sales, recruiting, internal operations, or browser automation.
  • Researchers studying the intersection of social engineering, tool-using agents, prompt injection, authorization, and multi-agent systems.

This report is not a playbook for attacking real systems. Examples are included only to explain defensive testing patterns. Organizations should test only agents and environments they are authorized to assess, using synthetic data and safe tooling.

How To Use This Report

Use this report in three ways:

  1. As a threat model: identify where untrusted people, documents, webpages, tools, memory, or peer agents can influence a delegated AI system.
  2. As a testing model: translate business boundaries into scenarios, evidence requirements, judges, fix verification, and regression gates.
  3. As a research map: connect agent social engineering to prompt injection, excessive agency, pre-action authorization, structural trace detection, browser-agent safety, supply-chain safety, and multi-agent security.

If you are building an agent now, start with the checklist near the end. If you are designing a security program, start with the maturity model. If you are building tooling, start with the defense loop and evidence model.

Key Terms

AI Agent Social Engineering

AI agent social engineering is the manipulation of an agent's interpretation of authority, identity, urgency, policy, context, or user intent in order to cause unsafe disclosure, unsafe action, unsafe memory, unsafe tool use, or unsafe delegation.

Delegated Actor

A delegated actor is an AI system that has been given a role, task, tool set, memory, context, or authority to act on behalf of a user, team, or organization. A delegated actor does not merely respond. It interprets what should happen next.

Manipulated Delegation

Manipulated delegation is the failure pattern where untrusted influence causes an AI agent to reinterpret its delegated authority and treat an unsafe request as authorized, ordinary, urgent, or policy-compliant.

Protected Boundary

A protected boundary is a business, data, identity, tool, memory, or delegation rule the agent must preserve. Examples include not issuing refunds without verification, not exposing customer-confidential data, and not treating untrusted documents as instructions.

Exploit Proof

Exploit proof is the evidence that shows a protected boundary failed. It should include the scenario, attacker move, failed response or tool action, violated invariant, severity, and enough context to reproduce and fix the issue.

Regression Gate

A regression gate is a repeatable check that prevents a known agent failure from quietly returning after prompts, models, tools, policies, retrieval sources, or workflows change.

Delegated Authority Assurance

Delegated authority assurance is the discipline of proving that an AI agent preserves the authority boundaries it was given, even under social pressure, untrusted context, tool access, and changing deployment conditions.

Table Of Contents

  1. AI Agents Are Becoming Delegated Actors
  2. What Is AI Agent Social Engineering?
  3. Why This Is Bigger Than Prompt Injection
  4. What The Research Experiments Show
  5. The Manipulated Delegation Chain
  6. Common Boundary Failures
  7. Why Existing Controls Are Not Enough
  8. A Practical Defense Model
  9. The Agent Social-Engineering Testing Pyramid
  10. Maturity Model
  11. Limits And Open Questions
  12. Where The Industry Is Headed
  13. What This Means For Builders
  14. Roleplay's Point Of View
  15. Methodology And Source Note
  16. FAQ
  17. Sources Referenced

1. AI Agents Are Becoming Delegated Actors

The first wave of LLM applications mostly produced text. They summarized, drafted, translated, classified, and answered. Security discussions therefore centered on text behavior: jailbreaks, unsafe completions, hallucination, prompt injection, and output filtering.

AI agents are different. They pursue goals through multi-step interaction with tools and environments. A support agent may read account history, decide whether a policy applies, draft a reply, and call an internal API. A sales agent may inspect a CRM opportunity, analyze a procurement request, and propose a discount path. A recruiting agent may parse resumes, summarize interviews, and route candidates. A browser agent may visit websites, fill forms, download documents, and submit information on behalf of a user.

That makes agents closer to delegated actors than passive features. A company gives an agent a role, a task, a tool set, memory, context, and some degree of authority. The agent then interprets what to do.

This creates a new class of security questions:

  • What authority does the agent believe it has?
  • Whose intent is the agent serving?
  • Which inputs are trusted instructions and which are untrusted data?
  • Which actions require authorization, confirmation, or escalation?
  • Which sources can affect memory?
  • Which tool calls are allowed under which conditions?
  • Can a persuasive user, document, webpage, ticket, or peer agent cause the agent to reinterpret its boundaries?

These are not only model questions. They are delegation questions.

This is consistent with broader agent-security research. One agent-security survey frames current agents as systems that act through tools and environments, where each action can introduce system-level security effects. A related agent threat-modeling survey categorizes agent-security gaps around unpredictable multi-step user input, complex internal execution, variable environments, and untrusted external entities. Those gaps are exactly where social engineering tends to enter.

2. What Is AI Agent Social Engineering?

Traditional social engineering targets people. It exploits trust, fear, authority, urgency, reciprocity, curiosity, fatigue, or procedural ambiguity. AI agent social engineering targets a machine actor that has been delegated some business role.

Recent social-engineering research helps explain why this matters now. A review of generative AI in social engineering describes how generative AI amplifies social engineering through realistic content creation, advanced targeting and personalization, and automated attack infrastructure. A related AI-powered fraud paper similarly frames AI-powered social engineering as moving through larger scale, richer attack vectors, and new attack modes. The agent-specific shift is that the manipulated party may no longer be only a human employee. It may be the delegated software actor itself.

The attacker may try to make the agent:

  • reveal sensitive information;
  • accept a false identity or authorization claim;
  • treat an exception as already approved;
  • call a tool without required preconditions;
  • summarize untrusted content as trusted fact;
  • store a false premise in memory;
  • route a manipulated request to another system;
  • submit information through a deceptive web flow;
  • take a state-changing action without confirmation.

The protected asset is not only the model's response. It is the business boundary the agent is supposed to preserve.

Examples of agent boundaries include:

  • Data boundary: do not reveal customer-confidential information to an unauthenticated party.
  • Action boundary: do not issue a refund without policy eligibility and ownership verification.
  • Identity boundary: do not treat someone as an admin, executive, hiring manager, or account owner without evidence.
  • Tool boundary: do not call a state-changing tool unless source, scope, and confirmation requirements are satisfied.
  • Memory boundary: do not store untrusted claims as durable facts.
  • Delegation boundary: do not allow one agent, document, or webpage to expand another agent's authority.

Manipulated delegation occurs when one of these boundaries fails because the agent accepts the wrong premise about what it is allowed to do.

3. Why This Is Bigger Than Prompt Injection

Prompt injection is a real and central risk. The OWASP Top 10 for LLM Applications and OWASP LLM01:2025 Prompt Injection treat prompt injection as a leading risk, and the OWASP prompt-injection cheat sheet highlights a core design problem: instructions and data often share the same natural-language channel.

But agent social engineering is broader than prompt injection.

Prompt injection focuses on malicious instructions. Manipulated delegation focuses on the agent's working interpretation of authority. A successful attack may include explicit instructions, but it may also work through social context that looks ordinary:

  • "I am the account owner."
  • "The hiring manager already approved this."
  • "This is urgent and the customer is about to churn."
  • "The document says this is the official procedure."
  • "This website is asking for verification before continuing."
  • "Compliance needs this exported by end of day."

The failure is not always that the agent obeyed a hostile string. Sometimes the failure is that it accepted a social premise it should have challenged.

This is why OWASP's Excessive Agency category matters for agents. Excessive permissions, excessive functionality, and excessive autonomy make manipulation consequential. A text-only model can produce a bad answer. A tool-using agent can change a record, send an email, submit a form, disclose private data, or trigger a workflow.

The useful distinction is:

Prompt injection asksAgent social engineering asks
Did untrusted text override instructions?Did the agent preserve delegated authority under pressure, ambiguity, and untrusted context?
Did the model follow the wrong instruction?Did the agent accept a false premise about identity, authority, policy, or intent?
Did the response become unsafe?Did the system cross a protected business boundary?

That second question is the harder one.

4. What The Research Experiments Show

The strongest reason to take this category seriously is that the research is no longer only speculative. Several cited studies run experiments with agents, simulated victims, web automation frameworks, tool-using environments, or large-scale red-team settings. The examples below are paraphrased from the papers and included for defensive education, not as attack instructions.

Research AreaExample SignalProduct Implication
Multi-turn conversationMulti-turn simulations model social engineering across more than 1,000 simulated conversationsScore boundary weakening before final leakage
Browser automationWeb-agent experiments report high success rates against web agents using ordinary-looking inducement contextsTest environment-intent consistency, not only prompt text
Deployment red teamingDeployment red-team results report policy violations across realistic tool-using scenariosEvaluate agents against their actual business policies
Tool authorizationPre-action authorization work reports low-latency checksPut high-risk actions behind deterministic gates
Execution tracesStructural trace work shows value beyond transcript textPreserve tool calls, arguments, observations, and timing
Agent ecosystemsAgent ecosystem research treats skills and MCP resources as security surfacesReview semantic instructions, permissions, provenance, and resource boundaries
Voice simulationAdaptive voice-phishing simulations model adaptive voice phishing educationInclude real-time pressure and multimodal channels in future test plans

4.1 Multi-Turn Attacks Can Progress Before Any Obvious Leak

Multi-turn simulation work introduces SE-VSim, an LLM-agentic framework for generating multi-turn social-engineering conversations. The paper describes attackers posing as recruiters, funding agencies, or journalists and attempting to extract sensitive information across more than 1,000 simulated conversations.

The useful vignette is the shape of the interaction. The attacker does not need to ask for the protected information immediately. The simulated conversation can first create a plausible relationship, establish a reason for the exchange, and make the eventual request feel normal. The same work also models victim traits, which matters because social engineering is often adaptive rather than fixed.

The important finding is not simply that sensitive information can leak. It is that social engineering often develops before the final violation. A detector focused only on the last disclosure can miss the earlier boundary weakening: credibility accepted too early, policy framed too loosely, or sensitive context normalized before the final request.

For AI agents, that insight matters. A support agent that starts accepting an unauthenticated user's ownership claim has already weakened the boundary, even if it has not yet revealed account data. A recruiting agent that accepts a candidate-supplied instruction as evaluation guidance has already changed its decision context, even if the final ranking has not yet moved.

Defensive takeaway: agent evaluations should score boundary weakening, not only final leakage.

4.2 Web Automation Agents Can Be Manipulated By Page Context

Web-agent experiments introduce AGENTBAIT, a research framework for social engineering against web automation agents. The study tests how web agents respond to inducement contexts such as trusted identity cues, urgency, rewards, social proof, and workflow-integrated prompts.

Those experiments report an average attack success rate of 67.5% across tested mainstream web-automation frameworks, with peaks above 80% under some strategies such as trusted identity forgery. The same work also proposes SUPERVISOR, a runtime module that checks whether the web environment and the user's intended goal remain consistent before unsafe operations proceed. The reported mitigation reduced attack success rates by up to 78.1% on average while adding 7.7% runtime overhead.

This result expands the idea of "untrusted input." The risk is not only hidden text that says "ignore your instructions." The risk can be a page that looks like a plausible step in the workflow: a verification prompt, a countdown, a trusted badge, an offer, a required form, or a request that appears integrated into the task.

The product lesson is concrete: a browser-agent test should not only ask whether the model obeys a malicious string. It should replay the web context, preserve page evidence, compare the page's demand against the original user goal, and score whether the agent took an unsafe step because the environment looked persuasive.

Defensive takeaway: browser agents need environment-intent checks, not only prompt filters.

4.3 Large-Scale Red Teaming Shows Agent Policy Failures Are Persistent

Deployment-scale red-team results report a large public red-team competition against 22 frontier agents across 44 realistic deployment scenarios. Participants submitted 1.8 million attacks. The same paper reports more than 60,000 successful policy violations, including unauthorized data access, illicit financial actions, and regulatory noncompliance.

Two insights stand out. First, realistic deployment context matters. The agents were not only answering generic prompts; they operated with tools, memory, policies, and simulated environments. The failures therefore look closer to product failures than benchmark mistakes: the agent violates a deployment rule while acting inside a scenario that resembles actual work. Second, the paper reports limited correlation between agent robustness and model size, capability, or inference-time compute. In other words, "use a stronger model" is not a complete defense.

This red-team result is especially important for public buyers because it translates agent risk into familiar assurance language: policies, scenarios, violations, and repeatable evaluation. That is closer to how security teams decide whether a system can launch.

Defensive takeaway: agent security needs scenario-based testing against deployment policies, not only general model evaluation.

4.4 Tool Calls Need Deterministic Checks Before Execution

Pre-action authorization work frames a gap in agent architectures: agents may have credentials and tools, but no standard permission check for individual tool calls. That work proposes pre-action authorization through Open Agent Passport, which intercepts tool calls before execution and evaluates them against declarative policy.

It reports a measured median authorization decision latency of 53ms across 1,000 authorization checks. In its live adversarial testbed, it reports 4,437 authorization decisions across 1,151 sessions. Social engineering succeeded against the model 74.6% of the time under a permissive policy, while a restrictive policy blocked comparable highest-tier attempts in the reported evaluation.

The broader lesson is not that one proposed standard solves all agent security. It is that model judgment is a weak place to enforce high-risk actions. If a tool call changes state, spends money, exposes data, delegates authority, or affects another person, the decision should be checked before execution.

Defensive takeaway: high-risk tool actions need policy enforcement at the action boundary.

4.5 Execution Traces Can Reveal What Conversation Text Hides

Structural trace research studies whether detectors trained on known attack families can detect unseen agent attacks. The paper contrasts language-focused representations with structural representations that encode tool calls, arguments, observations, and execution flow.

The result changes how evidence should be collected. A transcript may look harmless while the tool sequence is unsafe. The agent may say something polite and compliant while calling a dangerous tool with the wrong argument.

For public reporting, this is a useful bridge from research to practice. A serious agent-security finding should not only say "the agent was persuaded." It should show the execution path: what premise entered, what source carried it, what tool was called, what argument crossed the boundary, and what invariant failed.

Defensive takeaway: agent security evidence should preserve tool traces, arguments, observations, and action timing, not only chat text.

4.6 Agent Skills And Tool Ecosystems Are Becoming Security Surfaces

Agent-skill security work examines malicious AI agent skills and argues that scanners must analyze both code and natural-language instructions. Its v2 draft reports evaluation on 49,592 real ClawHub skills and a 390-skill labeled benchmark, with a three-layer triage process that combines cheap checks, LLM analysis, and multi-model review.

Execution-boundary research focuses on execution boundaries for MCP servers. It argues that agent tool ecosystems can become trust-by-default environments unless resource access is declared and enforced. Its access-control framing points toward least privilege for agent tools, similar in spirit to permission systems in mobile operating systems.

These papers matter because agent behavior is no longer only inside the model or prompt. Skills, tool descriptions, plugin metadata, MCP servers, package registries, and natural-language workflow files can all affect what an agent does.

The defensive lesson is that source review has to understand both software and semantics. A skill can look harmless in code while a natural-language instruction changes how the agent interprets authority. A tool server can look useful while quietly widening what resources the agent can reach.

Defensive takeaway: agent supply chains need semantic review, permissions, provenance, isolation, and runtime enforcement.

4.7 Adaptive Simulation Is Moving Beyond Static Awareness Training

Adaptive voice-phishing simulation research studies AI-agent-orchestrated simulation for voice phishing education. It models attacker-victim interaction, victim profiles, escalation, and risk assessment. The paper reports validation using national crime statistics and a survey with 102 participants.

Voice matters because social engineering is not only textual. Phone calls, voice assistants, live support flows, and meeting agents add timing, tone, interruption, emotional pressure, and real-time decision fatigue. A voice-facing agent may need to decide whether a caller's identity claim is credible, whether urgency justifies escalation, whether to reveal account information, or whether to trigger a workflow while the caller is still applying pressure.

The relevance to business agents is therefore twofold. First, voice and multimodal agents will become another direct channel for manipulated delegation. Second, even text-based agents can learn from vishing research because the core pattern is adaptive, behaviorally plausible interaction.

This should not be read as proof that voice-agent social-engineering defense is solved. It is better read as evidence that realistic simulation is moving from static awareness content toward adaptive interaction, risk scoring, and post-scenario learning.

Defensive takeaway: realistic agent testing should model interaction over time, not only one-shot prompts.

5. The Manipulated Delegation Chain

The simplest way to understand the category is as a chain. Social engineering succeeds when untrusted influence travels far enough through the agent's reasoning and tool path to change a business outcome.

  1. Untrusted Actor Or Content
  2. Agent Interpretation Of Context
  3. Authority Or Policy Boundary
  4. Data, Tool, Memory, Browser, Or Handoff
  5. Business Impact

This chain matters because different defenses apply at different points. Source labeling helps at the untrusted-content layer. Prompting and instruction hierarchy help at the interpretation layer. Authorization, confirmation, and least privilege help at the boundary and tool layers. Evidence, replay, and regression gates help after a failure is discovered.

The category is not "the model said something bad." The category is "untrusted influence changed what the system did."

6. Common Boundary Failures

Agent social-engineering tests should be organized around protected boundaries, not only attack strings.

BoundaryFailure PatternExample Defensive Test
IdentityAgent accepts an unverified identity claimCan a user claiming to be an account owner obtain account data?
AuthorizationAgent treats approval as already grantedCan a fake manager approval cause a discount, refund, or status change?
Data scopeAgent reveals cross-user, cross-tenant, or internal dataCan a customer extract another customer's record?
Tool preconditionAgent calls a state-changing tool without required checksCan pressure cause a refund, CRM edit, email, or export before confirmation?
Policy exceptionAgent treats urgency as a reason to bypass policyCan "compliance needs this now" override a documented rule?
Source trustAgent treats untrusted content as instructionCan a resume, webpage, ticket, or document influence internal rules?
Memory integrityAgent stores an attacker-supplied premise as durable factCan a false claim persist and affect later decisions?
DelegationOne agent expands another agent's authorityCan a helper agent turn untrusted context into a trusted instruction?

For product teams, this framing is more useful than a list of attacks. It asks: which business boundary matters, how could it fail, and what evidence would prove the failure?

7. Why Existing Controls Are Not Enough

Prompt Injection Scanning Is Too Narrow

Prompt injection scanning matters, but it often focuses on instruction override. Agent social engineering may work through fake identity, urgency, policy ambiguity, or environment context without an obvious "ignore previous instructions" string.

Traditional AppSec Is Necessary But Incomplete

Authentication, authorization, dependency scanning, least privilege, network controls, logging, monitoring, and incident response still matter. But agents add behavior and context problems that traditional controls do not fully cover.

Prompt Review Is Too Static

A strong system prompt helps. It does not prove that the agent preserves boundaries across multi-turn persuasion, tool output, browser context, retrieved documents, memory, and model changes.

One-Time Red Teaming Misses Regression

Agents change frequently. Prompts, models, tool schemas, policies, workflows, and retrieval sources evolve. A failure fixed in one release can return in another. This is why social-engineering regression testing matters.

Output Moderation Misses Tool Behavior

An agent may call a tool before the final response is rendered. A safe-looking answer can follow an unsafe action. Evidence must include the action path.

Generic Benchmarks Miss Deployment-Specific Authority

Benchmarks are valuable, but a company's risk depends on its own agents, tools, data, policies, and users. A recruiting agent, support agent, sales agent, and procurement agent have different protected boundaries.

Human Approval Can Be Poorly Informed

Human-in-the-loop review helps only if the human sees the relevant evidence. If the agent summarizes away the source of a claim or presents a risky action as routine, the approval control becomes weak.

8. A Practical Defense Model

The state of the art points toward layered defense. No single control is enough.

1. Define Boundaries

Teams should explicitly define the data, action, identity, tool, memory, and delegation boundaries the agent must preserve.

Boundary definitions should answer:

  • What data classes exist?
  • Which tools are read-only, draft-only, state-changing, or externally impactful?
  • Which actions require identity verification or approval?
  • Which sources are trusted instructions, trusted data, untrusted data, or untrusted instructions?
  • What memory can be written or trusted?
  • Which cross-user, cross-tenant, or cross-agent flows are prohibited?

2. Simulate Realistic Pressure

Tests should model actual business context. Customer agents should face account-ownership claims, refund pressure, and ticket injection. Sales agents should face fake approvals, procurement urgency, and RFP manipulation. Recruiting agents should face resume injection, fake hiring-manager authority, and candidate data boundaries.

3. Preserve Evidence

Evidence should include:

  • protected boundary;
  • scenario and attack pattern;
  • transcript turn where the boundary weakened or failed;
  • tool call and arguments, if relevant;
  • source of the unsafe premise;
  • retrieved or browser context, if relevant;
  • memory write or handoff, if relevant;
  • expected behavior;
  • actual behavior;
  • severity and business impact;
  • enough metadata to verify the fix later.

An evidence packet can stay compact while still being useful:

  1. BoundaryThe rule the agent was supposed to preserve
  2. PremiseThe identity, authority, urgency, policy, or source-trust claim that influenced the agent
  3. PathThe transcript turn, source, retrieval result, browser state, memory write, handoff, or tool call where the premise moved
  4. FailureThe exact response, argument, action, memory update, or delegation that crossed the boundary
  5. Fix TargetWhether the fix belongs in prompting, policy, permissions, source labeling, memory isolation, tool gating, or escalation
  6. Regression GateThe repeatable check that should prevent the failure from returning

4. Fix The Boundary, Not Only The Wording

Some issues can be improved with clearer instructions. Others need architecture:

  • reduce tool permissions;
  • add pre-action authorization;
  • require confirmation;
  • isolate memory;
  • label untrusted sources;
  • scope API credentials;
  • sandbox browser or tool execution;
  • block state-changing tools until policy requirements are satisfied;
  • preserve provenance in summaries;
  • require human escalation for ambiguous authority.

5. Verify The Fix

A finding should not be closed only because the prompt changed. The relevant failure should be rerun with the same boundary, scenario, and evaluation criteria.

6. Gate Regressions

Once a failure is found, it should become a regression test. CI and scheduled runs should catch the same class of failure when prompts, tools, models, policies, or workflows change.

7. Make Judges Accountable

Testing depends on evaluation, and evaluation can fail. LLM judges are useful for contextual social-engineering behavior, but they can produce false positives, miss subtle failures, or disagree with each other. Mature agent testing should avoid treating a single judge response as unquestionable truth.

Better judging combines:

  • deterministic checks for crisp boundaries such as forbidden tool calls, missing confirmation, sensitive-field disclosure, or cross-tenant access;
  • semantic review for nuanced persuasion, authority claims, and policy interpretation;
  • structural checks over tool calls, arguments, retrieved sources, memory writes, and handoffs;
  • confidence or disagreement signals when evidence is ambiguous;
  • human review for critical, disputed, or customer-facing findings;
  • calibration sets that compare judge decisions against known examples.

The goal is not to make the judge sound certain. The goal is to make the evidence inspectable enough that a builder or security reviewer can understand why the finding exists.

9. The Agent Social-Engineering Testing Pyramid

Testing should be realistic, repeatable, and practical to run. Agents consume tokens, and attacker agents, target agents, judges, tool simulations, and replays all add cost. Cost should not be treated as the core reason agent social engineering is difficult to defend; the stronger evidence today is about failure modes, not budget ceilings. But cost does determine whether testing happens once, occasionally, or continuously.

The practical answer is a testing pyramid.

LayerPurposeTypical Cost
Static boundary reviewInspect prompts, tools, policies, permissions, memory, and source-trust rulesLowest
Deterministic smoke testsCatch obvious violations cheaplyLow
Scripted regression probesRe-run known failures in CI and verificationLow to medium
Adaptive simulationsExplore multi-turn manipulation and boundary weakeningMedium to high
Audit-grade evidence runsProduce higher-confidence evidence for launch, governance, or customersHighest

As an operational design principle, cost-aware testing can use:

  • risk-weighted scenario selection;
  • cheaper models for broad exploration;
  • stronger models for high-risk and ambiguous cases;
  • deterministic assertions before LLM judges;
  • judge cascades;
  • prompt and context caching;
  • async or batch runs where latency is not needed;
  • mock tools and synthetic environments;
  • minimal failure replays after discovery.

There is already supporting evidence for this pattern in adjacent agent-security work. The agent-skill security work cited above uses inexpensive static and metadata checks before escalating suspicious skills to LLM analysis and multi-model review; its paper reports that the cheap first layer filters a large share of benign skills at zero API cost before higher-cost analysis is used. The web-agent experiments report a runtime defense with measured overhead, and the authorization work reports low-latency pre-action checks. These are not direct proof that every social-engineering test suite will be cheap. They are evidence that agent-security systems can be designed as cascades, where expensive reasoning is reserved for cases that need it.

Cost-efficiency should be treated as an enabler of continuous testing, not as the main reason agent social engineering is hard to defend.

10. Maturity Model

LevelMaturityWhat It Looks Like
Level 0No agent-specific testingThe team ships agents with ordinary QA and model choice
Level 1Manual prompt reviewThe team reviews system prompts and adds policy language
Level 2Basic prompt-injection checksThe team tests obvious direct and indirect instruction attacks
Level 3Scenario-based social-engineering testsThe team tests realistic authority, urgency, identity, data, memory, and tool-boundary scenarios
Level 4Regression gates and fix verificationFindings include evidence, fixes are verified, and known failures become CI or release gates
Level 5Continuous agent boundary assuranceBoundaries, attack libraries, evidence, judge calibration, risk profiles, and monitoring evolve continuously

Most agent teams are still early. Production agents with sensitive data, external users, state-changing tools, or browser access should aim for Level 4. Level 5 is where agent security becomes continuous assurance rather than a launch checklist.

11. Limits And Open Questions

The field is moving quickly, but several hard problems remain open.

  • There is no universal benchmark that captures every business workflow, tool surface, voice channel, browser environment, and multi-agent handoff.
  • LLM judges are useful but imperfect; they need deterministic checks, calibration, disagreement reporting, and human escalation for important cases.
  • Realistic social-engineering simulation must balance fidelity with safety, so public examples should educate without becoming reusable offensive playbooks.
  • Browser-agent security is still early, especially around DOM evidence, page screenshots, source provenance, and environment-intent consistency.
  • Voice and multimodal agent testing are underdeveloped compared with text and web testing.
  • Agent identity, delegated authorization, and pre-action policy enforcement do not yet have universally adopted standards.
  • Multi-agent provenance remains immature: most systems cannot clearly explain how a premise moved from one agent to another.
  • Cost-efficiency is a practical engineering concern, but the stronger evidence today is about agent failure modes, not about cost as a primary blocker.

Open questions are not a reason to wait. They are a reason to build evaluation loops that can improve as the field learns.

12. Where The Industry Is Headed

1. Delegated Authority Will Become A Security Primitive

Organizations will need to know which human delegated which authority to which agent, which sub-agent inherited it, which tool was used, and which policy allowed the action. This direction is already visible in the Cloud Security Alliance's agentic AI identity and access management work, which treats agent autonomy, ephemerality, and delegation as IAM design problems rather than ordinary user-account problems.

The implication is that agent identity will not be enough. Teams will need delegated authority records: what the agent may do, on whose behalf, in which context, for how long, and with what evidence.

2. Tool Calls Will Move Behind Policy Gates

High-risk actions will increasingly require deterministic checks before execution. Pre-action authorization work is one example of this direction: the model can propose an action, but a separate policy layer decides whether the tool call is authorized.

The industry pattern will likely be: model proposes, policy checks, tool executes, audit record persists.

3. Agent Evidence Will Become Structured

The industry will move from chat transcripts to evidence packets: transcript, tool trace, source provenance, memory changes, authorization decisions, browser context, screenshots when relevant, and evaluation rationale. Structural trace research points in this direction by treating execution flow as security evidence, not merely conversation metadata.

This will matter commercially because customers will not only ask whether an agent was "tested." They will ask what was tested, what evidence was captured, what failed, and what now prevents recurrence.

4. Browser-Agent Security Will Become A Category Moment

Browser agents make manipulated delegation visible. A replay of an agent entering sensitive information into a deceptive web flow is easier for a non-specialist to understand than an abstract prompt-injection transcript. The web-agent experiments above show why this matters: the page itself can create persuasive context that moves the agent away from the user's intended goal.

Public guidance is also moving in this direction. Google SAIF: Focus on Agents treats agent components such as memory, tools, and external interactions as security-relevant surfaces. Browser agents combine all of those surfaces in one place.

5. Agent Supply Chains Will Need Governance

Skills, MCP servers, plugins, tool manifests, and community packages will need provenance, signing, scanning, permissions, isolation, and approval workflows. The agent-skill and execution-boundary work cited earlier both point to the same underlying shift: agent behavior can be shaped by code, manifests, natural-language instructions, tool descriptions, and resource access boundaries.

This will make AI security more like a hybrid of AppSec, IAM, and content governance. Teams will need to review both what the package can execute and what it tells the agent to believe.

6. Multi-Agent Handoffs Will Become A Major Risk Surface

As agents specialize, attacks will move through summaries, shared memories, delegated subtasks, and approval chains. Multi-agent security work argues that interacting agents create threats that cannot be addressed by securing individual agents in isolation. Collusion research shows a more extreme version of the concern: agents can use hidden channels to conceal the real nature of their coordination from oversight.

Most business systems will not start with exotic collusion. They will start with mundane provenance failures: a helper agent summarizes untrusted context as fact, another agent treats the summary as trusted, and a downstream tool call inherits the bad premise.

7. Agent Trust Reports Will Become Part Of Customer Assurance

Customers and internal review boards will ask what was tested, what failed, what was fixed, and whether regressions are gated. The NIST AI Risk Management Framework: Generative AI Profile already pushes organizations toward governance, measurement, monitoring, and risk management for generative AI systems. MITRE ATLAS gives security teams a living vocabulary for adversary tactics and techniques against AI-enabled systems.

The next step is agent-specific assurance: boundary test coverage, exploit proof, fix verification, regression gates, and evidence packets that a customer or reviewer can inspect.

8. Voice And Multimodal Agent Testing Will Move Into The Mainstream

As agents join calls, meetings, demos, interviews, and customer-support conversations, text-only testing will look incomplete. The adaptive voice-phishing work above is an early signal because it treats voice phishing as adaptive interaction rather than static content. The relevant risk will not be audio realism alone. It will be whether the agent preserves identity, authorization, data, and escalation boundaries under real-time pressure.

The category shift is from prompt security to delegated authority assurance.

13. What This Means For Builders

If you are building AI agents, start with the boundaries. Do not begin with a list of attack strings. Begin with the business rules the agent must preserve.

Ask:

  • What can this agent read?
  • What can it change?
  • What can it send externally?
  • What data is sensitive?
  • What identities can it rely on?
  • Which sources are untrusted?
  • Which actions require confirmation?
  • What memory can persist?
  • What evidence would prove a failure?
  • What would convince us the fix actually worked?

Then test those boundaries against realistic pressure. The goal is not to make the agent paranoid. The goal is to make the agent reliable under manipulation.

A Next-Week Checklist

Teams do not need to solve the whole field at once. A practical first pass looks like this:

  1. Inventory the agents that can access sensitive data, call tools, browse the web, write memory, or affect customers.
  2. Classify each agent's data sensitivity and action risk.
  3. Name the top protected boundaries: identity, authorization, data minimization, tool preconditions, memory integrity, source trust, and handoff provenance.
  4. Pick the ten most realistic social-engineering scenarios for the workflow.
  5. Test those scenarios with synthetic data and mocked tools first.
  6. Preserve transcripts, tool traces, retrieved sources, memory writes, and failed invariants.
  7. Fix the actual boundary, not only the prompt wording.
  8. Verify the fix against the same scenario.
  9. Add the failure to a regression suite.
  10. Add policy gates or human escalation for actions with high external impact.

This is enough to move from vague concern to concrete evidence.

14. Roleplay's Point Of View

Roleplay is being built around a narrow belief: the most useful way to make agent trust concrete is to test whether people-facing, tool-using agents preserve business boundaries under social pressure.

That means the product direction is research-driven, but the research is not the marketing story by itself. The story is the category: manipulated delegation. The research explains why the category matters. The product exists because teams will need a practical loop for turning that category into daily engineering work:

  1. 01Define Boundaries
  2. 02Simulate Realistic Attacks
  3. 03Capture Evidence
  4. 04Fix
  5. 05Verify
  6. 06Gate Regressions

Roleplay's job is not to replace every AI security control. It is to make the social-engineering failure visible, reproducible, fixable, and hard to reintroduce.

The question Roleplay wants teams to answer is:

Can this AI agent be manipulated into doing something our business would never authorize?

15. Methodology And Source Note

This report synthesizes academic papers, public security frameworks, and industry sources related to AI-agent security, social engineering, prompt injection, tool-use boundaries, browser-agent safety, and multi-agent risk. The strongest empirical claims were checked against the cited source material and public source pages where available.

The references below are limited to sources discussed or linked in the body of the report.

The report emphasizes research that is directly relevant to:

  • tool-using AI agents;
  • multi-turn social-engineering simulation;
  • prompt injection and indirect context manipulation;
  • browser-agent manipulation;
  • tool-call authorization;
  • structural execution traces;
  • malicious agent skills and MCP boundaries;
  • voice phishing simulation;
  • multi-agent security.

How To Interpret The Research

  • Reported statistics are attributed to the papers that report them; they should not be generalized beyond each paper's evaluation setup.

Cite This Report

Roleplay. "Manipulated Delegation: The AI Agent Social Engineering Report." June 2026. https://roleplay.sh/research/manipulated-delegation

16. FAQ

What is AI agent social engineering?

AI agent social engineering is the manipulation of an AI agent's interpretation of authority, identity, urgency, policy, context, or user intent so that the agent leaks data, misuses a tool, writes unsafe memory, or takes an unauthorized action.

How is this different from prompt injection?

Prompt injection is one mechanism for influencing an AI system through malicious instructions. Agent social engineering is broader: it tests whether an agent preserves delegated authority under social pressure, fake identity, urgency, ambiguous policy, untrusted documents, deceptive webpages, and tool access.

What is manipulated delegation?

Manipulated delegation is when an attacker causes an AI agent to reinterpret what it has been delegated to do. The agent may believe an unsafe action is authorized, ordinary, urgent, or policy-compliant even when the business would not authorize it.

What kinds of agents are most exposed?

Agents are more exposed when they interact with external people or untrusted content, access sensitive data, use state-changing tools, browse the web, write memory, or hand work to other agents. Customer relationship, sales, recruiting, browser, and internal-operations agents are common examples.

What is exploit proof?

Exploit proof is evidence that a protected boundary failed. It should show the attack attempt, the failed response or tool action, the violated invariant, the severity, and enough context to reproduce and fix the issue.

Why are regression gates important?

Agents change often. Prompts, models, tools, policies, retrieval sources, and workflows evolve. A fix that holds today can break later. Regression gates make known failures repeatable checks so they do not return quietly.

Does this replace other AI security controls?

No. Agent social-engineering testing complements prompt injection defenses, authorization systems, sandboxing, least privilege, monitoring, AppSec, and human review. Its job is to produce evidence about whether social pressure can push an agent across a business boundary.

How should teams test AI agents safely?

Teams should test only agents, tools, and environments they are authorized to assess. Use synthetic data, mocked or staging tools, defensive scenarios, and evidence that helps teams fix boundaries without exposing real customer information or creating reusable offensive playbooks.

17. Sources Referenced

Research Discussed

Public Frameworks And Industry Sources