Kimi-K2.6-MoE-Smart-Quant (MLX)
MoE-aware mixed-precision quantization of moonshotai/Kimi-K2.6 for Apple Silicon.
Quantization Strategy
Unlike uniform quantization, this applies per-component bit allocation optimized for MoE + MLA architecture:
| Component | Bits | Rationale |
|---|---|---|
| Routed experts (384 SwitchLinear) | 4-bit | Only 8/384 fire per token โ very tolerant of low-bit |
| Shared expert (always active) | 6-bit | Every-token path, needs precision |
| MLA value projections (v_a/v_b) | 8-bit | Most sensitive attention weights |
| MLA other projections (q_a/q_b/kv_a/kv_b/o) | 6-bit | Latent compression layer |
| lm_head + embed_tokens | 8-bit | Output quality |
| First/last 3 decoder layers | 6-bit | Boundary layer sensitivity |
| Gate/router | unquantized | Tiny params, routing-critical |
| Vision encoder | unquantized | Preserved via mlx-vlm |
Effective average: ~4.5 bpw โ near-6-bit quality at near-4-bit size.
Model Details
- Base model: Kimi-K2.6 (1T params, 32B active, 384 experts)
- Architecture: MoE + MLA (kimi_k25)
- Context: 256K tokens
- Modality: Vision + Language (VLM)
- Converted with: mlx-vlm 0.4.2
Usage
Hardware Requirements
- Single node: M3/M4 Ultra 192GB+ (fits in ~150GB)
- Distributed: 2x M3 Ultra via JACCL/RDMA for headroom
Weights uploading โ conversion in progress.
Model tree for mlx-community/Kimi-K2.6-MoE-Smart-Quant
Base model
moonshotai/Kimi-K2.6