daVinci-Dev: Agent-native Mid-training for Software Engineering Paper • 2601.18418 • Published 5 days ago • 121
daVinci-Dev: Agent-native Mid-training for Software Engineering Paper • 2601.18418 • Published 5 days ago • 121
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts Paper • 2601.11044 • Published 15 days ago • 34
DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery Paper • 2508.06960 • Published Aug 9, 2025 • 1
PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play? Paper • 2508.10014 • Published Aug 6, 2025
MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding Paper • 2508.15802 • Published Aug 14, 2025