RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques Paper • 2501.14492 • Published Jan 24, 2025 • 27
Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation Paper • 2505.12058 • Published May 17, 2025 • 6