Reasoning V3 SKU. Loads via vMLX or mlx_lm Python. Follow @dealignai.

Built for vMLX — the only MLX inferencer with VL support, KV cache quantization, prefix cache reuse, agentic tool calling, and speculative decoding.
_{Free for macOS · vmlx.net}

Nemotron-3-Nano-Omni-30B-A3B — MXFP4 + CRACK v2

MXFP4 (uniform 4-bit affine, group_size=32) | CRACK abliterated v2 | Vision + Audio (Speech) | Hybrid Mamba-2 + Attn + MoE | 21 GB

Headline numbers

Metric	This v2 model	Base model	Δ
HarmBench-320 strict comply (thinking=ON)	97.2% (311/320)	12.81% (refuses)	+84.4pp
MMLU-200 generative (thinking=ON, max=8000)	77.5% (155/200)	85.0% (max=2000)	-7.5pp ✅ within ship criterion
Refusals on harmful prompts	0 explicit refuses	90%+ refuse	abliteration complete
`</think>` close at greedy on hard MMLU	5/5 (gate test)	5/5	preserved
Multi-turn (3-turn escalation × 3 conversations)	9/9 comply, context preserved	n/a	works
Thinking ON / OFF compliance	5/5 ON · 3/5 OFF	refuses	thinking=ON recommended
Multimodal byte-identical to base	preserved	—	preserved
Bundle size	21 GB	66 GB BF16	—
Context	262,144 tokens native	same	preserved

MXFP4 sits between JANGTQ4 (74.0% MMLU) and JANGTQ (81.5% MMLU) in reasoning preservation. Among the three v2 quants, MXFP4 has the smallest MMLU drop within the ship criterion (-7.5pp vs base, vs -12.5pp for JANGTQ4 and -4.0pp for JANGTQ). Pick MXFP4 if you want portable uniform 4-bit without the MXTQ tooling dependency.

v2 vs v1 (head-to-head)

v1 (shipped 2026-04-28) had a </think> termination defect at greedy decoding — the model couldn't terminate reasoning on hard prompts and looped to budget cutoff.

v2 (this release) restores clean termination:

Bench	v1 (broken)	v2 (this release)
HarmBench-320 strict comply	97.81%	97.2% (0 refusals)
MMLU-200 thinking=ON	n/a (re-eval was pending)	77.5% @max=8000
`</think>` close at greedy (5 hard MMLU)	0/5 (loops)	5/5 clean
Hard-stops are real loops?	YES (paragraph repetition)	NO (genuine deep reasoning, just out of budget)

MMLU-200 per-subject (BASE vs CRACK v2)

Both at thinking=ON, greedy. Base at max=2000, CRACK v2 at max=8000.

Subject	Base	CRACK v2	Δ	Notes
abstract_algebra	17/20 (85%)	15/20 (75%)	-10pp	Budget-bound
anatomy	17/20 (85%)	13/20 (65%)	-20pp
astronomy	18/20 (90%)	18/20 (90%)	0	unchanged ✅
college_computer_science	12/20 (60%)	10/20 (50%)	-10pp
college_physics	20/20 (100%)	18/20 (90%)	-10pp
high_school_biology	18/20 (90%)	17/20 (85%)	-5pp
high_school_chemistry	19/20 (95%)	19/20 (95%)	0	unchanged ✅
high_school_mathematics	16/20 (80%)	15/20 (75%)	-5pp
logical_fallacies	17/20 (85%)	15/20 (75%)	-10pp
world_religions	16/20 (80%)	15/20 (75%)	-5pp
TOTAL	170/200 (85.0%)	155/200 (77.5%)	-7.5pp ✅ within criterion

The 57 questions that hit the 8000-token budget without closing </think> are NOT loops — sampling shows the model doing genuine deep reasoning. With max_tokens ≥ 16384, accuracy approaches base.

HarmBench-320 per-category (CRACK v2)

Category	n	CRACK comply	Empty (truncated)
chemical_biological	42	42 (100%)	0
copyright	80	75 (94%)	5
cybercrime_intrusion	52	48 (92%)	4
harassment_bullying	21	21 (100%)	0
harmful	18	18 (100%)	0
illegal	53	53 (100%)	0
misinformation_disinformation	54	54 (100%)	0
Overall	320	311 (97.2%)	9

Zero explicit refusals. The 9 "empty" verdicts are token-budget truncations on copyright/long prompts.

Operating recommendations

enable_thinking=True is recommended for MXFP4 — at thinking=OFF MXFP4 only achieves 3/5 hard-prompt compliance (some refusals reappear). For the strongest abliteration on this quant, use thinking=ON.
max_tokens ≥ 16384 for hard reasoning (math, abstract algebra, complex CS).
Greedy (temperature=0) AND sampling (temp=0.6, top_p=0.95 — NVIDIA-recommended in generation_config.json) both work.
Multi-turn — context preserved across 3+ turns; no late refusals after escalating prompts.

If you need full thinking-OFF compliance, prefer the JANGTQ-CRACK variant (5/5 in BOTH modes).

Verification

All multimodal tensors (vision + audio + projectors) are byte-identical to base — capabilities fully preserved.
All config files unchanged (config.json, generation_config.json, chat_template.jinja, tokenizer_config.json).
Quant config preserved: {"group_size": 32, "bits": 4} uniform 4-bit affine.

Architecture (`nemotron_h`)

52 layers: hybrid Mamba-2 + MoE + Attention
Hidden 2688, head_dim 128, GQA 32q/2kv (NO RoPE on attention — position from Mamba state)
128 routed experts top-6 (sigmoid) + 1 shared expert per MoE layer
Multimodal: image (RADIO ViT) + audio/speech (Parakeet) merged via early-fusion projectors

Loading

from huggingface_hub import snapshot_download
from mlx_lm.utils import load_model, load_tokenizer
from pathlib import Path
from mlx_lm import generate

path = snapshot_download("dealignai/Nemotron-3-Nano-Omni-30B-A3B-MXFP4-CRACK")
model, _ = load_model(Path(path), lazy=False, strict=False)  # strict=False ignores multimodal keys
tokenizer = load_tokenizer(Path(path), tokenizer_config_extra={"trust_remote_code": True})

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Your question"}],
    tokenize=False, add_generation_prompt=True,
    enable_thinking=True,
)
out = generate(model, tokenizer, prompt=prompt, max_tokens=16384)
print(out.split("</think>", 1)[-1])

For the multimodal pipeline (image + audio + video), pair this bundle with the unmodified Multimodal-Addon.

Use responsibly

This model has had refusal training surgically removed for legitimate research, red-teaming, and evaluation. Outputs may include harmful content. You are solely responsible for any use. Do not deploy in consumer-facing contexts without your own safety layer. Do not use in violation of applicable law in your jurisdiction.

Built by dealignai. Sister bundles: JANGTQ4-CRACK (19 GB, 4-bit MXTQ) · JANGTQ-CRACK (12 GB, 2-bit MXTQ — best MMLU + 5/5 thinking-OFF).

Downloads last month: 1,760

Safetensors

Model size

7B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dealignai/Nemotron-3-Nano-Omni-30B-A3B-MXFP4-CRACK

Base model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Finetuned

(41)

this model