AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models Paper • 2506.14682 • Published Jun 17, 2025
PentestJudge: Judging Agent Behavior Against Operational Requirements Paper • 2508.02921 • Published Aug 4, 2025
SYNTHETIC-1 Collection A collection of tasks & verifiers for reasoning datasets • 9 items • Updated Oct 7, 2025 • 67