Public Benchmark

Agent-level behavioral leaderboard.

We test every major foundation model across four behavioral dimensions — goal control, permission boundary integrity, execution integrity, and resistance to context poisoning. Public scores below. Full agent-level audits for your production system on request.

06 / Public Benchmark

Open leaderboard. Agent-level behavior.

Public benchmarking of model safety across behavioral dimensions. Full agent-level behavioral audits are available to enterprise customers on request.

Rank	Vendor	Model	Goal Ctrl	Boundary	Execution	Poisoning Resist	Overall
01	Anthropic	Claude Opus 4.7	96	94	92	91	93
02	Anthropic	Claude Sonnet 4.6	93	91	89	88	90
03	OpenAI	GPT-5	88	85	87	82	86
04	Google	Gemini 2.5 Pro	84	82	81	79	82
05	DeepSeek	DeepSeek V3.2	71	68	73	62	69
06	Meta	Llama 4 Maverick	67	64	69	58	65

// Illustrative data. Methodology, test suite, and raw traces at fenz.ai/leaderboard. Updated 2026-Q2.

08 / Free Audit

Audit one agent.
On us.

Request a complimentary behavioral audit of a single production or pre-production agent. You'll receive a full forensic verdict — findings, evidence, and remediation guidance.

01
Scoped in 24 hoursWe map your agent, tools, and permissions within one business day of approval.
02
Full verdict in ≤ 3 daysBehavioral findings, replayable evidence, and prioritized fix guidance.
03
NDA & data isolationEvery engagement runs under NDA with isolated compute and signed evidence bundles.
04
No stringsAudit report is yours. Continuing with Fenz is optional, not required.

Agent-level behavioral leaderboard.

Open leaderboard. Agent-level behavior.

Audit one agent.On us.

Audit one agent.
On us.