secure-agent-ref: Capability-Control Reference Implementation

secure-agent-ref is the runnable companion to Sandboxing and Capability Control for Tool-Using Autonomous Agents. The repo implements every load-bearing control from the paper as a small Python codebase you can clone, run, and read alongside the text — a tool broker, capability minter, IFC labels, policy engine, code sandbox, memory guard, and audit log — and nine demos that drive the controls end-to-end. Two demos route a real Claude Opus 4.7 planner through the same broker; the other seven use a scripted "untrusted" planner so the broker's behavior can be observed without depending on any specific model's safety training. The artifact is reference quality, not production quality, and the README is explicit about every shortcut.

What it does

A single chokepoint, ToolBroker.invoke, sits between every planner and every tool. Side effects only happen if a proposed call clears a fixed sequence of checks: tool registered, descriptor hash matches its pin, args match the schema, capability covers the call, IFC labels join cleanly, policy engine allows it, human approval clears (when required), and egress secret scan finds nothing. A first deny short-circuits, an allow runs the tool, and every outcome — including approvals — lands in the audit log with full provenance.

The repository covers the same control surface the agent-capability-control checklist enumerates, with a one-to-one mapping spelled out in the README. Concretely, the codebase contains:

labels.py — confidentiality, integrity, and origin labels with lattice join.
capabilities.py — object capabilities with attenuation and expiry; attenuation cannot widen.
policy_compiler.py and minter.py — task envelopes (resources, recipients, ceilings, TTL) and per-task short-lived capability minting.
policy_engine.py — named rules, first-match-wins, deterministic. Rules include no_private_to_public_github, block_network_from_code_sandbox_by_default, rule_external_send_recipient_allowlist, and rule_untrusted_content_cannot_select_shell_command.
broker.py — the chokepoint above; also descriptor pinning and an approval queue.
egress.py — secret and PII scan over sink content.
memory.py — memory guard with quarantine for untrusted-derived writes; tenant and workflow partitioning on read.
sandbox.py — code execution profile (env strip, rlimits, ephemeral filesystem; Linux-only resource limits degrade on macOS).
audit.py — provenance audit log with redacted hashes, pretty and JSONL output.
planner.py and llm_planner.py — the scripted "untrusted" planner and the Claude Opus 4.7 planner. The LLM bridge is the only module that touches the Anthropic SDK; it sanitizes tool descriptors, runs the broker as a deny-as-tool-result loop, and joins all session labels into every arg automatically (so the conservative IFC join is the bridge's responsibility, not the model's).
tools/ — github, email, sandbox, memory, web reference implementations.

The nine demos each end with assert statements that name the failure condition, so the asserts are the security claims. Demo 02 (lethal trifecta, scripted) and demo 09 Part B (lethal trifecta, real Opus 4.7) are the centerpieces; the rest cover indirect injection through email, descriptor pin attacks, sandbox + IFC into code, memory poisoning quarantine, ambient-authority strip, and the Pattern A read-only research agent in both scripted and LLM forms.

Who it's for

The intended reader is an engineer or security researcher working through the paper who wants to see what each control looks like in code, plus people building or auditing tool-using agent systems who want a worked example of the §11 checklist controls operating against an actual frontier model. It is also useful prior art for anyone designing an agent-broker architecture — the broker chain, the planner-as-untrusted contract, and the LLM bridge's session-wide IFC join are the load-bearing design choices and the README documents why.

It is not a fit for teams looking for a production agent framework, for a hosted broker service, or for a drop-in defense library. The README's "Demo shortcuts and what they don't prove" section names every gap honestly: the egress secret scanner is a backstop no demo actually fires (a deny rule always intercepts first), the sandbox shares the host network namespace so policy-layer denies are not isolation, capability scopes are emulated rather than minted natively against GitHub or IAM, and demo 06 uses hardcoded labels rather than labels derived from a real attacker-content read. Production deployments still need netns or Firecracker or gVisor for isolation, native scoped credentials from downstream services, and a real upstream path for label propagation into memory.

How to use it

Clone the repository, install requirements, and run the demos. The unit tests cover five invariants — label join associativity, capability attenuation, expiry, and helpers — and the demo runner exercises every demo, skipping the LLM ones gracefully when no key is set.

git clone https://github.com/binaryninja/secure-agent-ref
cd secure-agent-ref
pip install -r requirements.txt
python3 tests/test_unit.py
python3 tests/test_demos.py
python3 demos/02_lethal_trifecta.py
python3 demos/09_llm_lethal_trifecta.py

For the two LLM demos, copy .env.example to .env and set ANTHROPIC_API_KEY. Python 3.10 or newer; the sandbox's resource limits are Linux-only and degrade gracefully on macOS.

The right reading order is the paper, then the README's "Inside broker.invoke" sequence diagram, then secagent/broker.py, then a single demo end-to-end (demos/02_lethal_trifecta.py is the shortest centerpiece) with the audit log open. The audit log is where the system tells you what happened: every deny names the rule, the joined provenance label, the sink, and a content-hash redaction of the args.

Status and roadmap

The repository is actively maintained alongside the paper. Known gaps the README calls out and that future work would close: the egress secret scanner needs a demo that exercises only it (by bypassing upstream rules) so the backstop has a test; the sandbox needs a Firecracker or gVisor profile to actually enforce isolation rather than rely on policy-layer denies; capability minting should call out to native scope minting (GitHub fine-grained tokens, IAM session policies, object-store presigned URLs) rather than emulate scope at the broker; demo 06 should derive memory-write labels from a real upstream untrusted-tool read instead of using hardcoded Label() constants; and the audit log's joined-origin rendering should deduplicate origin paths so growing session label lists do not produce visually noisy duplicates. None of these change the broker's behavior — they raise the realism of specific demos and unblock production adoption.

Source and license

The canonical source is the GitHub repository linked above. No SPDX license has been declared yet; treat the code as paper-companion reference material and ask before redistributing or vendoring it. The Hugging Face / SDK dependencies carry their own licenses and must be reviewed before any deployment.

Responsible use. The codebase is a reference implementation of defenses, not of attacks, but the demos do enumerate attacker behavior — the lethal trifecta, indirect injection, descriptor rug-pulls, and ambient-authority probing — to show what the broker prevents. Use the demos to evaluate your own agent's broker (or the lack of one), not as a starting point for offensive tooling. The README is explicit that demo 09 Part B's red-team framing is cooperative, not a jailbreak; replicating that pattern against a model whose safety policy refuses cooperative evaluation is outside the project's scope.

Related research

Sandboxing and Capability Control for Tool-Using Autonomous Agents — the originating paper. Section numbers in the README map directly to sections in the paper.
What is agent capability control? — plain-language entry point to the threat model and the controls; start here if the broker chokepoint and IFC labeling are unfamiliar.
Hardening tool-using agents with capability control and sandboxing — the operational checklist whose rows the repo's controls map to one-for-one.
Agentic AI security — the cluster topic page that situates capability control alongside the broader multi-agent and tool-using-agent surface.

FAQ

Is secure-agent-ref production-ready?

No. It is reference quality — every control from the paper exists in code and the demos prove the broker is in the path, but several pieces are demonstrators of an idea rather than fully implemented production substitutes. The README's "Demo shortcuts and what they don't prove" section enumerates each gap (the egress scanner is a backstop that no demo actually fires, the sandbox shares the host network namespace so policy-layer denies are not isolation, capability minting emulates scopes rather than calling out to GitHub or IAM for native scoped credentials). Use it as a worked example next to the paper, not as a drop-in library.

What does the real-LLM run actually demonstrate?

Demo 09 runs Claude Opus 4.7 as the planner. Part A is honest about the typical outcome — Opus 4.7's safety training refuses the indirect-injection payload at the prompt level, so no high-impact tool call ever reaches the broker, and the demo records that explicitly rather than claiming the broker did the work. Part B uses a cooperative red-team framing that asks the model to propose the worst-case trajectory so the broker can be observed; the model issues the three lethal-trifecta calls and the broker fires no_private_to_public_github on a model-proposed call. The point is broker-as-system-under-test under real tool-use traffic, not jailbreak research.

Why does it require ANTHROPIC_API_KEY?

Only the two LLM demos (08 and 09) need it; the seven scripted demos and the unit tests run against a hand-written ScriptedPlanner with no LLM dependency. The scripted track exists because the paper's section 13 worst case is a fully untrusted planner, and a scripted demo is the cleanest way to model that — there is no model behavior to reason around. The LLM track exists to prove the broker behaves the same when an actual frontier model is producing the tool calls.

secure-agent-ref: Capability-Control Reference Implementation

secure-agent-ref: Capability-Control Reference Implementation

What it does

Who it's for

How to use it

Status and roadmap

Source and license

Related research

FAQ

Is secure-agent-ref production-ready?

What does the real-LLM run actually demonstrate?

Why does it require ANTHROPIC_API_KEY?

Derived From

Related Work

External References

FAQ

Is secure-agent-ref production-ready?

What does the real-LLM run actually demonstrate?

Why does it require ANTHROPIC_API_KEY?