How are evaluation results generated for existing multilingual benchmarks that consist of queries only?
#2
by
haidequanbu
- opened
Thank you to the authors for your contribution to the open-source community.
While reading the paper, I noted that PolyGuard is trained on prompt-response pairs. Could you clarify how the evaluation is conducted on test sets consisting of prompts only, such as MultiJ and XSafety?
Hi, for prompt only benchmarks, an empty response is sent to the model