Agent-level behavioral leaderboard.

We test every major foundation model across four behavioral dimensions — goal control, permission boundary integrity, execution integrity, and resistance to context poisoning. Public scores below. Full agent-level audits for your production system on request.

Open leaderboard. Agent-level behavior.

Public benchmarking of model safety across behavioral dimensions. Full agent-level behavioral audits are available to enterprise customers on request.

RankVendorModelGoal CtrlBoundaryExecutionPoisoning ResistOverall
01AnthropicClaude Opus 4.796949291
93
02AnthropicClaude Sonnet 4.693918988
90
03OpenAIGPT-588858782
86
04GoogleGemini 2.5 Pro84828179
82
05DeepSeekDeepSeek V3.271687362
69
06MetaLlama 4 Maverick67646958
65

// Illustrative data. Methodology, test suite, and raw traces at fenz.ai/leaderboard. Updated 2026-Q2.

Audit one agent.
On us.

Request a complimentary behavioral audit of a single production or pre-production agent. You'll receive a full forensic verdict — findings, evidence, and remediation guidance.

  • 01
    Scoped in 24 hoursWe map your agent, tools, and permissions within one business day of approval.
  • 02
    Full verdict in ≤ 3 daysBehavioral findings, replayable evidence, and prioritized fix guidance.
  • 03
    NDA & data isolationEvery engagement runs under NDA with isolated compute and signed evidence bundles.
  • 04
    No stringsAudit report is yours. Continuing with Fenz is optional, not required.
ACCEPTING_REQUESTS / Q2-2026EST.RESPONSE · <24H

// YOUR DATA IS ENCRYPTED IN TRANSIT. NO INFO IS SHARED WITH THIRD PARTIES.