Q3_K_M.gguf vs UD-Q3_K_XL.gguf
Can you explain why i get better results from Q3_K_M? I use exactly same parameters temp 1, top p 0.95 top k 64 min p 0.
So i test both quants on a reasoning promt, and Q3_K_M seems to answer correctly 9/10 or 10/10 of time. While UD-Q3_K_XL about 7/10. And i tested them more than 10 times each...
Here is the promt, correct answer is 36:
I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
I was confused too. Your observation is in line with their measurements:
| Model | MMLU-5-shots | Disk Size | Efficiency |
|---|---|---|---|
| Q3_K_M | 70.70 | 12.51 | 3.58 |
| Q3_K_XL | 70.87 | 12.76 | 3.49 |
But that said, the MMLU-5-shots difference is so slight that I wouldn't have expected the output quality to differ that much! That's intriguing, maybe KLD metric + MMLU alone aren't enough? Could you produce more than 10 samples? Maybe you still observed noise
Dude I might have confused you hardly xDD and am really sorry for that: I don't know how I messed up with the numbers in my head!
So, I correct myself:
Your observation is NOT in line with their measurement! That just the opposite! Really don't ask me how I end up thinking the MMLU was higher for the Q3_K_M! But I really thought it was too, so I was honest saying I was confused too!
So, now I'm not confused anymore by their results, but however YOUR results are confusing lol!
But still you might have observed noise.
Sorry again i confused both of us, me first :D
I hardly believe it could be noise, because i tested both quants on same question so many times... Why can't you try the same test?
Actually i'm just a regular user, so i have no idea if just one promt/question reveals the quality difference between two quants.
Actually i'm just a regular user, so i have no idea if just one promt/question reveals the quality difference between two quants.
Nah you're right it doesn't, and I guess that's why we create benchmark datasets :D
Why can't you try the same test?
Honestly I think you're right there is no point multiplying the samples here. Better idea would be to and use a benchmarking framework or build a simple one then evaluate a private dataset :) But I don't feel up to it, personally