How to benchmark MMMU properly in SGLang?

#49
by JacobChang - opened

For GLM-5/4.7 in SGLang:
Launch the server:

python3 -m sglang.launch_server \
    --model /Path/to/zai-org/GLM-4.7 \
    --tp 8 \
    --tool-call-parser glm47  \
    --reasoning-parser glm45 

Benchmark:

python /sgl-workspace/sglang/benchmark/mmmu/bench_sglang.py  \
    --port 30000 --concurrency 900 --parallel 900 \
    --temperature 0 \
    --max-new-tokens 131072

The reported acc is about 0.55 but the previous model GLM-4.5V (w/ Thinking) achieved 0.754 as recorded in https://mmmu-benchmark.github.io/
There is a huge gap between these numbers. Any ideas?

Sign up or log in to comment