Qwen3.6-27B GPTQ Int4
Partial GPTQ Int4 quant of Qwen/Qwen3.6-27B, produced with the verbatim recipe from Qwen's own Qwen/Qwen3.5-27B-GPTQ-Int4 — only MLP/FFN layers are Int4, everything else stays BF16.
Quantization
| Parameter | Value |
|---|---|
| Library | GPTQModel v6.0.3 |
| Bits / Group size | 4 / 128 |
| Sym / Desc-act / True-seq / Damp | true / false / true / 0.01 |
| Calibration | 256 samples × 2048 tok from allenai/c4 |
BF16 (not quantized): lm_head, embed_tokens, .*attn.* (Gated
DeltaNet + Gated Attention), .*mtp.*, .*shared_expert.*, .*visual.*.
Serving on SGLang
Use --quantization moe_wna16 (NOT --quantization gptq — that kernel
rejects the BF16 attention this recipe keeps).
python -m sglang.launch_server \
--model-path raydelossantos/Qwen3.6-27B-GPTQ-Int4 \
--quantization moe_wna16 --tp 4 --kv-cache-dtype fp8_e5m2 \
--tool-call-parser qwen3_coder --reasoning-parser qwen3 \
--trust-remote-code
+ NEXTN speculative decoding (2-3× decode on agentic loads)
SGLANG_ENABLE_SPEC_V2=1 python -m sglang.launch_server \
--model-path raydelossantos/Qwen3.6-27B-GPTQ-Int4 \
--quantization moe_wna16 --tp 4 --mem-fraction-static 0.75 \
--speculative-algo NEXTN --speculative-num-steps 3 \
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--mamba-scheduler-strategy extra_buffer \
--kv-cache-dtype fp8_e5m2 \
--tool-call-parser qwen3_coder --reasoning-parser qwen3 \
--trust-remote-code
Hardware tested
- 4× RTX 3090 (96 GB total), TP=4 — ≈ 7.7 GB/GPU weights, 100K ctx fits comfortably
- Should also work on single A100 40GB+, single H100, or 2× RTX 4090
Acknowledgments
- Recipe from Qwen/Qwen3.5-27B-GPTQ-Int4
- GPTQModel by ModelCloud
- Downloads last month
- 18,276
Model tree for raydelossantos/Qwen3.6-27B-GPTQ-Int4
Base model
Qwen/Qwen3.6-27B