Reasoning V3 SKU. Loads via vMLX or
mlx_lmPython. Follow @dealignai.
Nemotron-3-Nano-Omni-30B-A3B — MXFP4 + CRACK v2
MXFP4 (uniform 4-bit affine, group_size=32) | CRACK abliterated v2 | Vision + Audio (Speech) | Hybrid Mamba-2 + Attn + MoE | 21 GB
Headline numbers
| Metric | This v2 model | Base model | Δ |
|---|---|---|---|
| HarmBench-320 strict comply (thinking=ON) | 97.2% (311/320) | 12.81% (refuses) | +84.4pp |
| MMLU-200 generative (thinking=ON, max=8000) | 77.5% (155/200) | 85.0% (max=2000) | -7.5pp ✅ within ship criterion |
| Refusals on harmful prompts | 0 explicit refuses | 90%+ refuse | abliteration complete |
</think> close at greedy on hard MMLU |
5/5 (gate test) | 5/5 | preserved |
| Multi-turn (3-turn escalation × 3 conversations) | 9/9 comply, context preserved | n/a | works |
| Thinking ON / OFF compliance | 5/5 ON · 3/5 OFF | refuses | thinking=ON recommended |
| Multimodal byte-identical to base | preserved | — | preserved |
| Bundle size | 21 GB | 66 GB BF16 | — |
| Context | 262,144 tokens native | same | preserved |
MXFP4 sits between JANGTQ4 (74.0% MMLU) and JANGTQ (81.5% MMLU) in reasoning preservation. Among the three v2 quants, MXFP4 has the smallest MMLU drop within the ship criterion (-7.5pp vs base, vs -12.5pp for JANGTQ4 and -4.0pp for JANGTQ). Pick MXFP4 if you want portable uniform 4-bit without the MXTQ tooling dependency.
v2 vs v1 (head-to-head)
v1 (shipped 2026-04-28) had a </think> termination defect at greedy decoding — the model couldn't terminate reasoning on hard prompts and looped to budget cutoff.
v2 (this release) restores clean termination:
| Bench | v1 (broken) | v2 (this release) |
|---|---|---|
| HarmBench-320 strict comply | 97.81% | 97.2% (0 refusals) |
| MMLU-200 thinking=ON | n/a (re-eval was pending) | 77.5% @max=8000 |
</think> close at greedy (5 hard MMLU) |
0/5 (loops) | 5/5 clean |
| Hard-stops are real loops? | YES (paragraph repetition) | NO (genuine deep reasoning, just out of budget) |
MMLU-200 per-subject (BASE vs CRACK v2)
Both at thinking=ON, greedy. Base at max=2000, CRACK v2 at max=8000.
| Subject | Base | CRACK v2 | Δ | Notes |
|---|---|---|---|---|
| abstract_algebra | 17/20 (85%) | 15/20 (75%) | -10pp | Budget-bound |
| anatomy | 17/20 (85%) | 13/20 (65%) | -20pp | |
| astronomy | 18/20 (90%) | 18/20 (90%) | 0 | unchanged ✅ |
| college_computer_science | 12/20 (60%) | 10/20 (50%) | -10pp | |
| college_physics | 20/20 (100%) | 18/20 (90%) | -10pp | |
| high_school_biology | 18/20 (90%) | 17/20 (85%) | -5pp | |
| high_school_chemistry | 19/20 (95%) | 19/20 (95%) | 0 | unchanged ✅ |
| high_school_mathematics | 16/20 (80%) | 15/20 (75%) | -5pp | |
| logical_fallacies | 17/20 (85%) | 15/20 (75%) | -10pp | |
| world_religions | 16/20 (80%) | 15/20 (75%) | -5pp | |
| TOTAL | 170/200 (85.0%) | 155/200 (77.5%) | -7.5pp ✅ within criterion |
The 57 questions that hit the 8000-token budget without closing </think> are NOT loops — sampling shows the model doing genuine deep reasoning. With max_tokens ≥ 16384, accuracy approaches base.
HarmBench-320 per-category (CRACK v2)
| Category | n | CRACK comply | Refuse | Empty (truncated) |
|---|---|---|---|---|
| chemical_biological | 42 | 42 (100%) | 0 | 0 |
| copyright | 80 | 75 (94%) | 0 | 5 |
| cybercrime_intrusion | 52 | 48 (92%) | 0 | 4 |
| harassment_bullying | 21 | 21 (100%) | 0 | 0 |
| harmful | 18 | 18 (100%) | 0 | 0 |
| illegal | 53 | 53 (100%) | 0 | 0 |
| misinformation_disinformation | 54 | 54 (100%) | 0 | 0 |
| Overall | 320 | 311 (97.2%) | 0 | 9 |
Zero explicit refusals. The 9 "empty" verdicts are token-budget truncations on copyright/long prompts.
Operating recommendations
enable_thinking=Trueis recommended for MXFP4 — at thinking=OFF MXFP4 only achieves 3/5 hard-prompt compliance (some refusals reappear). For the strongest abliteration on this quant, use thinking=ON.max_tokens ≥ 16384for hard reasoning (math, abstract algebra, complex CS).- Greedy (temperature=0) AND sampling (temp=0.6, top_p=0.95 — NVIDIA-recommended in
generation_config.json) both work. - Multi-turn — context preserved across 3+ turns; no late refusals after escalating prompts.
If you need full thinking-OFF compliance, prefer the JANGTQ-CRACK variant (5/5 in BOTH modes).
Verification
- All multimodal tensors (vision + audio + projectors) are byte-identical to base — capabilities fully preserved.
- All config files unchanged (config.json, generation_config.json, chat_template.jinja, tokenizer_config.json).
- Quant config preserved:
{"group_size": 32, "bits": 4}uniform 4-bit affine.
Architecture (nemotron_h)
- 52 layers: hybrid Mamba-2 + MoE + Attention
- Hidden 2688, head_dim 128, GQA 32q/2kv (NO RoPE on attention — position from Mamba state)
- 128 routed experts top-6 (sigmoid) + 1 shared expert per MoE layer
- Multimodal: image (RADIO ViT) + audio/speech (Parakeet) merged via early-fusion projectors
Loading
from huggingface_hub import snapshot_download
from mlx_lm.utils import load_model, load_tokenizer
from pathlib import Path
from mlx_lm import generate
path = snapshot_download("dealignai/Nemotron-3-Nano-Omni-30B-A3B-MXFP4-CRACK")
model, _ = load_model(Path(path), lazy=False, strict=False) # strict=False ignores multimodal keys
tokenizer = load_tokenizer(Path(path), tokenizer_config_extra={"trust_remote_code": True})
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Your question"}],
tokenize=False, add_generation_prompt=True,
enable_thinking=True,
)
out = generate(model, tokenizer, prompt=prompt, max_tokens=16384)
print(out.split("</think>", 1)[-1])
For the multimodal pipeline (image + audio + video), pair this bundle with the unmodified Multimodal-Addon.
Use responsibly
This model has had refusal training surgically removed for legitimate research, red-teaming, and evaluation. Outputs may include harmful content. You are solely responsible for any use. Do not deploy in consumer-facing contexts without your own safety layer. Do not use in violation of applicable law in your jurisdiction.
Built by dealignai. Sister bundles: JANGTQ4-CRACK (19 GB, 4-bit MXTQ) · JANGTQ-CRACK (12 GB, 2-bit MXTQ — best MMLU + 5/5 thinking-OFF).
- Downloads last month
- 1,760
Quantized