How to use from the
Use from the
MLX library
# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("JANGQ-AI/MiniMax-M2.7-JANGTQ")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

MiniMax-M2.7-JANGTQ

MiniMax M2.7 — 47 GB on disk (down from the ~230 GB FP8 source) — 2-bit JANGTQ2 quantization in JANGTQ-PRESTACK layout (pre-stacked routed experts on disk → instant cold load, no runtime cache sidecar).

  • Source: MiniMaxAI/MiniMax-M2.7 (MiniMax M2 architecture, FP8 E4M3 block-128 native, 196K context, 62 layers, 256 routed experts top-8)
  • Quantization: JANGTQ2 — 2-bit MXTQ codebook (Hadamard-rotated, Lloyd-Max optimized) on routed-expert weights + 8-bit affine on attention / shared expert / embed / lm_head + fp16 passthrough on RMSNorms / router gate / expert_bias
  • Routed-expert layout: pre-stacked along axis 0 (block_sparse_moe.switch_mlp.<proj>.tq_packed shape [256, out, packed_in]) per the JANGTQ-PRESTACK STANDARD — no runtime restacking, no jangtq_stacked.safetensors sidecar
  • Bundle size: 47 GB on-disk across 51 shards
  • Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio

What's new in this build (2026-05-04)

This bundle is shipped in JANGTQ-PRESTACK layout — the routed-expert TurboQuant tensors are stacked along axis 0 directly in the main shards. Wins vs the previous per-expert layout:

Metric Old (per-expert) This (pre-stacked)
First-load time ~5-10s restacking pass mx.load() direct (~14 s incl warmup)
Decode tok/s reference identical (same MXTQ codec, same fused decode kernels)
Bundle size ~57 GB ~47 GB (smaller by virtue of removing per-expert metadata duplication)
Loader path streaming hydrate + per-expert restack generic loader's prestack branch

What's in the bundle

Module Source dtype Bundle dtype
Routed experts (256 × 3 mats × 62 layers, pre-stacked along axis 0) FP8 E4M3 + F32 weight_scale_inv 2-bit MXTQ + sidecar codebook
Attention (q/k/v/o, q/k norms) FP8 E4M3 / BF16 8-bit affine g=64
embed_tokens / lm_head BF16 8-bit affine g=64
RMSNorm / router gate / e_score_correction_bias BF16 / F32 fp16 / fp32 passthrough

jangtq_runtime.safetensors sidecar (~25 KB) for Swift runtimes — covers (in_features={1536, 3072}, seed=42, bits=2) codebooks + sign-flip vectors.

Loading (Python)

pip install jang-tools mlx-lm
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("JANGQ-AI/MiniMax-M2.7-JANGTQ")

The loader detects the pre-stacked layout via jang_config.routed_expert_layout == "prestacked" and routes through the generic JANGTQ loader's prestack branch. Decode applies the standard SwitchGLU fused gate+up + P15 router compile + P18 QKV fusion patches automatically.

Reasoning + tools

  • Reasoning parser: qwen3 (extracts <think>...</think> blocks)
  • Tool parser: minimax
  • Default mode: thinking ON (chat template opens <think> for the assistant); pass enable_thinking=False to skip reasoning
  • Cache: kv (standard MLA-free MoE attention cache)

Credits

  • Quantization + MLX runtime: Jinho Jang (eric@jangq.ai)
  • Base model: MiniMaxAI — M2.7 architecture
Downloads last month
5,761
Safetensors
Model size
15B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for JANGQ-AI/MiniMax-M2.7-JANGTQ

Quantized
(107)
this model

Collection including JANGQ-AI/MiniMax-M2.7-JANGTQ