Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory Paper • 2505.15055 • Published May 21 • 1