Q3_K_M.gguf vs UD-Q3_K_XL.gguf

#10

by urtuuuu - opened Apr 30, 2025

Apr 30, 2025

•

edited Apr 30, 2025

Can you explain why i get better results from Q3_K_M? I use exactly same parameters temp 1, top p 0.95 top k 64 min p 0.
So i test both quants on a reasoning promt, and Q3_K_M seems to answer correctly 9/10 or 10/10 of time. While UD-Q3_K_XL about 7/10. And i tested them more than 10 times each...
Here is the promt, correct answer is 36:

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?

owao

Jun 10, 2025

I was confused too. Your observation is in line with their measurements:

Model	MMLU-5-shots	Disk Size	Efficiency
Q3_K_M	70.70	12.51	3.58
Q3_K_XL	70.87	12.76	3.49

(https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs#click-here-for-fullgoogles-gemma-3-27b-qat-benchmarks)

But that said, the MMLU-5-shots difference is so slight that I wouldn't have expected the output quality to differ that much! That's intriguing, maybe KLD metric + MMLU alone aren't enough? Could you produce more than 10 samples? Maybe you still observed noise

owao

Jun 11, 2025

Dude I might have confused you hardly xDD and am really sorry for that: I don't know how I messed up with the numbers in my head!
So, I correct myself:
Your observation is NOT in line with their measurement! That just the opposite! Really don't ask me how I end up thinking the MMLU was higher for the Q3_K_M! But I really thought it was too, so I was honest saying I was confused too!
So, now I'm not confused anymore by their results, but however YOUR results are confusing lol!
But still you might have observed noise.

Sorry again i confused both of us, me first :D

urtuuuu

Jun 12, 2025

I hardly believe it could be noise, because i tested both quants on same question so many times... Why can't you try the same test?
Actually i'm just a regular user, so i have no idea if just one promt/question reveals the quality difference between two quants.

owao

Jun 12, 2025

•

edited Jun 12, 2025

Actually i'm just a regular user, so i have no idea if just one promt/question reveals the quality difference between two quants.

Nah you're right it doesn't, and I guess that's why we create benchmark datasets :D

Why can't you try the same test?

Honestly I think you're right there is no point multiplying the samples here. Better idea would be to and use a benchmarking framework or build a simple one then evaluate a private dataset :) But I don't feel up to it, personally

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment