Setting boundaries : Getting Zero Trust Tool Calling Right For Agentic AI

Recently a friend pointed me to: https://news.ycombinator.com/item?id=45199713

The security concerns here are real and more severe than most realize. I've been looking at AI system security for a few months now, and the the tool-calling attack surface is expanding faster than our solutions can keep up... unfortunately business demand for AI seems to over-ride securing these systems.

The fundamental issue isn't just prompt injection – it's that we're mixing control and data planes without any security boundaries. When an LLM with tool access processes untrusted input, you get three attack vectors:

Intent Hijacking: The attacker's instructions override the user's intent
Tool Chaining: One compromised call enables further exploitation
Context Poisoning: The attacker corrupts the conversation state

To be clear, these problems are native to MCP as demonstrated in various documented attacks. The problem with MCP is that trust is implied, and the MCP specification punts to the implementation to figure it out. Until we have proper security primitives, every MCP deployment is a breach waiting to happen – in this case, OpenAI is a 1000x engineer / attacker and so exaggerates the risk.

Most guardrails "solutions" I've seen are just blocklists (easily bypassed) or prompt engineering (fundamentally flawed). Even recent ideas around training LLM / custom security models for detection offers an at-best probabilistic solution – for safety and security we need guarantees not heuristics.

For us, the breakthrough was in realizing the solution isn't detecting prompt injection - it's in reimagining the security boundaries – for example, the LLM cannot be the security boundary since an LLM trained to follow instructions cannot selectively ignore them.

Think about how mTLS solved web security - instead of trusting the network, every connection is authenticated and encrypted. We extended the same concept for Agentic AI – in this case it's not just about certificates and connections but also understanding the context and intent.

We've been building on a new primitive: "Authenticated Workflows" with zero-trust cryptographic enforcement at the tool layer. Every entity (LLM, tool, agent, user, app) gets a cryptographic identity. When they interact:

Intent gets expressed via policy and signed before the receiving agent sees it
The receiver verifies signatures and enforces policies before execution
Tools become the enforcement boundary (since agents can be compromised)
Policies are cryptographically bound to invocations
Every action creates an attestation chain

For MCP, tools provide natural boundaries for Policy Enforcement Points (PEPs). With LLMs, it's trickier - we need to handle prompts and client-side tool-calling. Effectively we need to defend against two different attack types:

Supply chain attacks (modifying prompts in transit) - we solve this with tamper-evident policy-bound prompts called Authenticated Prompts
Data-based attacks (malicious content in documents) - since prompts evolve as they incorporate data, we expanded Authenticated Prompts to use depth limits and intent binding. Even if a prompt morphs, it's bound by policies at least as restrictive as the original.

Effectively, even when malicious content gets through, it can't break the cryptographic chain. The LLM might be confused, but the tool still verifies:

Effective Permission = User Intent ∩ App Policy ∩ Tool Policy ∩ System Policy ∩ Context State

All layers must agree, cryptographically. No amount of prompt injection can forge signatures.

Practically, a challenge for us has been to make this invisible to developers, just like TLS. They call tool(params) and the security happens transparently. Interestingly, MCP's explicit separation of tools, prompts, and resources makes this easier than general function calling. The protocol already has natural enforcement boundaries - we just need to add the cryptographic verification layer.

We've validated this approach with implementations for both OpenAI (SecureOpenAI) and MCP (SecureMCP). They block injections (including secondary/tertiary injections distributed across files) that would otherwise succeed, while staying transparent to developers.

This isn't about making AI "safer" through training - it's about making it architecturally unable to cause harm regardless of behavior.

Would love to hear how others are approaching this? What kinds of defenses are people building? [Happy to share more details with anyone interested, or if you want to provide feedback on our secure OpenAI/Claude/MCP implementations]

Join the MACAW Private Beta

Get early access to cryptographic verification for your AI agents.

Request Access Learn More