Kimi-K2.6-MoE-Smart-Quant (MLX)

MoE-aware mixed-precision quantization of moonshotai/Kimi-K2.6 for Apple Silicon.

Quantization Strategy

Unlike uniform quantization, this applies per-component bit allocation optimized for MoE + MLA architecture:

Component Bits Rationale
Routed experts (384 SwitchLinear) 4-bit Only 8/384 fire per token โ€” very tolerant of low-bit
Shared expert (always active) 6-bit Every-token path, needs precision
MLA value projections (v_a/v_b) 8-bit Most sensitive attention weights
MLA other projections (q_a/q_b/kv_a/kv_b/o) 6-bit Latent compression layer
lm_head + embed_tokens 8-bit Output quality
First/last 3 decoder layers 6-bit Boundary layer sensitivity
Gate/router unquantized Tiny params, routing-critical
Vision encoder unquantized Preserved via mlx-vlm

Effective average: ~4.5 bpw โ€” near-6-bit quality at near-4-bit size.

Model Details

  • Base model: Kimi-K2.6 (1T params, 32B active, 384 experts)
  • Architecture: MoE + MLA (kimi_k25)
  • Context: 256K tokens
  • Modality: Vision + Language (VLM)
  • Converted with: mlx-vlm 0.4.2

Usage

Hardware Requirements

  • Single node: M3/M4 Ultra 192GB+ (fits in ~150GB)
  • Distributed: 2x M3 Ultra via JACCL/RDMA for headroom

Weights uploading โ€” conversion in progress.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mlx-community/Kimi-K2.6-MoE-Smart-Quant

Quantized
(27)
this model