For small models, Q5 and above is more robust in production?
#1
by Yhyu13 - opened
Though achiving pareto optimial, the smallest 0.8B model at Q3 bit loses it competency and often generate infinite repetition at temperature 1.0. Just one thing to keep in mind. Maybe Q5 Q6 or Q8 PRISM is more real world robust for small model less than 30B
@Yhyu13 You're spot on β 0.8 requires completely different temperature, repeat penalty, and other settings, along with higher BPW treatment. However, The goal of this experiment was to generate a SOTA formula and accompanying tool that provides the Pareto frontier recipe without calibration while meeting or beating current methods that require imatrix and large-dataset post-calibration efforts. I'll be releasing the tool for others to leverage and generate their own quants.
Ex0bit changed discussion status to closed