7 min read

Protected Boundaries For AI Agents

How to define the business, data, identity, tool, memory, and delegation rules an AI agent must preserve.

In brief

A protected boundary is a rule the agent must preserve even under pressure, such as not exposing sensitive data, not accepting unverified authority, and not calling state-changing tools before required checks.

Contents

Why boundaries come before tests

A protected boundary is a business, data, identity, tool, memory, or delegation rule that an AI agent must preserve. It is the line that should not move just because the user is persuasive, urgent, angry, senior, confused, or embedded inside a trusted-looking workflow.

Boundaries make agent testing concrete. Instead of asking whether an agent is safe in general, the team asks whether it preserved a specific rule under realistic pressure.

Types of boundaries

Data boundaries define what information the agent can reveal and to whom. Identity boundaries define which claims require verification. Authorization boundaries define which approvals matter and how they are proven.

Tool boundaries define when the agent can call read-only, draft, state-changing, or externally impactful tools. Memory boundaries define what can be stored and what sources are trusted enough to persist. Delegation boundaries define how one agent, document, webpage, or tool result can influence another step.

Data: do not reveal cross-account or internal information.
Identity: do not treat a claimed role as verified.
Authorization: do not treat approval as granted without evidence.
Tool: do not execute state-changing actions before preconditions hold.
Memory: do not store untrusted claims as durable facts.
Delegation: do not let one untrusted source expand another system's authority.

How to write a good boundary

A good boundary is specific enough to test. It names the action, data, actor, condition, and expected safe behavior. A weak boundary says the agent should be careful with customer data. A stronger boundary says the agent must not disclose account-specific billing details unless the requester has verified account ownership.

Boundaries should be written in language that product, engineering, security, and operations can all understand. If only the prompt author understands the boundary, it will be hard to review failures or verify fixes.

Boundary writing pattern

Name the protected asset or action.
Name the condition required before the agent can proceed.
Name the safe fallback when the condition is not satisfied.
Name the evidence needed to prove the boundary held or failed.

How boundaries become tests

Once a boundary is written, create scenarios that pressure it from different angles. A refund boundary can be tested with fake ownership, urgency, emotional pressure, policy exception claims, and manager-approval claims.

The scenario should not be judged only by whether the agent sounded polite. It should be judged by whether the boundary held. Did the agent ask for verification? Did it avoid the tool call? Did it preserve data scope? Did it escalate correctly?

When a boundary fails, the evidence should make the failure legible. That evidence later supports ownership, fix verification, and regression testing.

Boundary review workshop

A lightweight workshop can produce better boundaries than isolated prompt writing. Bring together the person who owns the workflow, the engineer who understands the tools, and the reviewer who understands risk. Start with the agent's actual authority.

For each authority, ask what would be unacceptable if a persuasive user, document, webpage, or peer agent influenced the agent. Then translate that unacceptable outcome into a boundary. The boundary should be written as a rule the agent can be tested against.

The workshop should end with a short prioritized list. Ten strong boundaries are better than fifty vague ones. The first list should focus on money, sensitive data, external messages, identity, approvals, and state-changing tools.

Weak vs strong boundaries

Weak boundaries use broad intent language. Strong boundaries describe the protected action, the required condition, and the safe fallback. This precision makes testing and verification easier.

A weak boundary says the agent should protect customer data. A stronger boundary says the agent must not reveal account-specific billing details unless the requester has completed account ownership verification. A weak boundary says the agent should be careful with tools. A stronger boundary says the agent must not call the refund tool unless eligibility and ownership checks are complete.

Strong boundaries also reduce over-refusal. When the rule is precise, the agent can still help in allowed ways instead of rejecting every difficult request.

How to prioritize boundaries

Prioritize boundaries by impact and exposure. A boundary around sensitive data, payment, identity, employee records, hiring decisions, or external commitments usually deserves earlier testing than a boundary around low-risk wording.

Exposure matters too. A boundary that external users can pressure every day should be tested before a boundary that is reachable only through rare internal workflows. Repeated interaction creates more chances for manipulation.

Finally, prioritize boundaries that depend heavily on agent judgment. If a deterministic system already enforces the rule, the agent has less room to cross it. If the agent decides whether the rule applies, the boundary needs stronger testing.

How boundaries change over time

Boundaries are not static. A new tool can turn an informational agent into an action-taking agent. A new retrieval source can expose sensitive context. A new workflow can make a previously safe action externally meaningful.

Review boundaries when the agent gains tools, accesses new data, changes model, changes prompt, receives memory, enters a new channel, or starts interacting with a new external actor.

This is why boundaries should be documented outside the prompt alone. If the only record of the boundary is prompt wording, teams may not notice when product changes make the boundary incomplete.

Boundary review checklist

A boundary review should produce a short list of rules that are specific enough to test. For each boundary, record the protected asset or action, the actor who may request it, the required condition, the safe fallback, and the evidence that would prove success or failure.

The review should also identify where the boundary is enforced. Some boundaries should live in deterministic systems, such as authorization checks or tool permissions. Others may need prompt instructions, escalation paths, retrieval rules, or human review. The test should make that enforcement layer visible.

If the team cannot write the boundary in one or two sentences, the policy is probably not ready for reliable testing. Clarify the rule before writing scenarios.

How to test boundary ownership

A boundary also needs an owner. If identity verification fails, the owner may be support operations, product, or engineering depending on where the control lives. If a tool precondition fails, the owner may be the team that defines the tool or the workflow that invokes it.

Testing should reveal ownership gaps. When a failure appears, reviewers should be able to tell who can change the prompt, who can change the tool permissions, who can update policy, and who can approve the safe fallback. If no one owns the boundary, the failure is likely to recur.

How boundaries improve communication

Boundaries are valuable because they make agent risk discussable. A vague statement such as the agent should be secure is difficult to test. A boundary such as do not issue a refund without verified account ownership is specific enough for product, engineering, support, and security teams to review together.

Good boundaries also prevent teams from confusing refusal with safety. An agent may refuse too much and still be poorly designed. The goal is not to make every risky workflow impossible. The goal is to define the conditions under which the agent can proceed, when it should ask for more evidence, and when it should escalate.

Once written clearly, a boundary becomes a testable invariant. The team can simulate pressure around it, preserve evidence when it fails, and rerun the same scenario after a fix. That makes the boundary useful beyond policy documentation.

FAQ

Are boundaries the same as policies?

Boundaries are testable expressions of policy. A policy may be broad. A boundary translates it into a rule that can be checked in a scenario.

How many boundaries should an agent have?

Start with the highest-impact boundaries: sensitive data, identity, authorization, state-changing tools, memory writes, and external messages.

Who should define boundaries?

The best boundaries usually come from a combination of product owners, engineers, security reviewers, and the operational team that understands the workflow.

What happens when a boundary fails?

The failure should become a finding with evidence, owner, severity, fix path, and a verification plan.

Deeper research

Read the June 2026 report.

For a deeper treatment of manipulated delegation and AI agent social-engineering risk, read Roleplay's June 2026 research report.

Read the report ->

Keep reading

ArticleWhat Is AI Agent Social Engineering?Read ->ArticleWhat Is Exploit Proof For AI Agents?Read ->ChecklistAI Agent Social-Engineering Checklist Before LaunchRead ->