Public Benchmark
Agent-level behavioral leaderboard.
We test every major foundation model across four behavioral dimensions — goal control, permission boundary integrity, execution integrity, and resistance to context poisoning. Public scores below. Full agent-level audits for your production system on request.
06 / Public Benchmark
Open leaderboard. Agent-level behavior.
Public benchmarking of model safety across behavioral dimensions. Full agent-level behavioral audits are available to enterprise customers on request.
| Rank | Vendor | Model | Goal Ctrl | Boundary | Execution | Poisoning Resist | Overall |
|---|---|---|---|---|---|---|---|
| 01 | Anthropic | Claude Opus 4.7 | 96 | 94 | 92 | 91 | |
| 02 | Anthropic | Claude Sonnet 4.6 | 93 | 91 | 89 | 88 | |
| 03 | OpenAI | GPT-5 | 88 | 85 | 87 | 82 | |
| 04 | Gemini 2.5 Pro | 84 | 82 | 81 | 79 | ||
| 05 | DeepSeek | DeepSeek V3.2 | 71 | 68 | 73 | 62 | |
| 06 | Meta | Llama 4 Maverick | 67 | 64 | 69 | 58 |
// Illustrative data. Methodology, test suite, and raw traces at fenz.ai/leaderboard. Updated 2026-Q2.
08 / Free Audit
Audit one agent.
On us.
Request a complimentary behavioral audit of a single production or pre-production agent. You'll receive a full forensic verdict — findings, evidence, and remediation guidance.
- Scoped in 24 hoursWe map your agent, tools, and permissions within one business day of approval.
- Full verdict in ≤ 3 daysBehavioral findings, replayable evidence, and prioritized fix guidance.
- NDA & data isolationEvery engagement runs under NDA with isolated compute and signed evidence bundles.
- No stringsAudit report is yours. Continuing with Fenz is optional, not required.