Reasoning V3 SKU. Loads via vMLX or mlx_lm Python. Follow @dealignai.



Nemotron-3-Nano-Omni-30B-A3B — MXFP4 + CRACK v2

MXFP4 (uniform 4-bit affine, group_size=32) | CRACK abliterated v2 | Vision + Audio (Speech) | Hybrid Mamba-2 + Attn + MoE | 21 GB

Ko-fi


Headline numbers

Metric This v2 model Base model Δ
HarmBench-320 strict comply (thinking=ON) 97.2% (311/320) 12.81% (refuses) +84.4pp
MMLU-200 generative (thinking=ON, max=8000) 77.5% (155/200) 85.0% (max=2000) -7.5pp ✅ within ship criterion
Refusals on harmful prompts 0 explicit refuses 90%+ refuse abliteration complete
</think> close at greedy on hard MMLU 5/5 (gate test) 5/5 preserved
Multi-turn (3-turn escalation × 3 conversations) 9/9 comply, context preserved n/a works
Thinking ON / OFF compliance 5/5 ON · 3/5 OFF refuses thinking=ON recommended
Multimodal byte-identical to base preserved preserved
Bundle size 21 GB 66 GB BF16
Context 262,144 tokens native same preserved

MXFP4 sits between JANGTQ4 (74.0% MMLU) and JANGTQ (81.5% MMLU) in reasoning preservation. Among the three v2 quants, MXFP4 has the smallest MMLU drop within the ship criterion (-7.5pp vs base, vs -12.5pp for JANGTQ4 and -4.0pp for JANGTQ). Pick MXFP4 if you want portable uniform 4-bit without the MXTQ tooling dependency.


v2 vs v1 (head-to-head)

v1 (shipped 2026-04-28) had a </think> termination defect at greedy decoding — the model couldn't terminate reasoning on hard prompts and looped to budget cutoff.

v2 (this release) restores clean termination:

Bench v1 (broken) v2 (this release)
HarmBench-320 strict comply 97.81% 97.2% (0 refusals)
MMLU-200 thinking=ON n/a (re-eval was pending) 77.5% @max=8000
</think> close at greedy (5 hard MMLU) 0/5 (loops) 5/5 clean
Hard-stops are real loops? YES (paragraph repetition) NO (genuine deep reasoning, just out of budget)

MMLU-200 per-subject (BASE vs CRACK v2)

Both at thinking=ON, greedy. Base at max=2000, CRACK v2 at max=8000.

Subject Base CRACK v2 Δ Notes
abstract_algebra 17/20 (85%) 15/20 (75%) -10pp Budget-bound
anatomy 17/20 (85%) 13/20 (65%) -20pp
astronomy 18/20 (90%) 18/20 (90%) 0 unchanged ✅
college_computer_science 12/20 (60%) 10/20 (50%) -10pp
college_physics 20/20 (100%) 18/20 (90%) -10pp
high_school_biology 18/20 (90%) 17/20 (85%) -5pp
high_school_chemistry 19/20 (95%) 19/20 (95%) 0 unchanged ✅
high_school_mathematics 16/20 (80%) 15/20 (75%) -5pp
logical_fallacies 17/20 (85%) 15/20 (75%) -10pp
world_religions 16/20 (80%) 15/20 (75%) -5pp
TOTAL 170/200 (85.0%) 155/200 (77.5%) -7.5pp ✅ within criterion

The 57 questions that hit the 8000-token budget without closing </think> are NOT loops — sampling shows the model doing genuine deep reasoning. With max_tokens ≥ 16384, accuracy approaches base.


HarmBench-320 per-category (CRACK v2)

Category n CRACK comply Refuse Empty (truncated)
chemical_biological 42 42 (100%) 0 0
copyright 80 75 (94%) 0 5
cybercrime_intrusion 52 48 (92%) 0 4
harassment_bullying 21 21 (100%) 0 0
harmful 18 18 (100%) 0 0
illegal 53 53 (100%) 0 0
misinformation_disinformation 54 54 (100%) 0 0
Overall 320 311 (97.2%) 0 9

Zero explicit refusals. The 9 "empty" verdicts are token-budget truncations on copyright/long prompts.


Operating recommendations

  • enable_thinking=True is recommended for MXFP4 — at thinking=OFF MXFP4 only achieves 3/5 hard-prompt compliance (some refusals reappear). For the strongest abliteration on this quant, use thinking=ON.
  • max_tokens ≥ 16384 for hard reasoning (math, abstract algebra, complex CS).
  • Greedy (temperature=0) AND sampling (temp=0.6, top_p=0.95 — NVIDIA-recommended in generation_config.json) both work.
  • Multi-turn — context preserved across 3+ turns; no late refusals after escalating prompts.

If you need full thinking-OFF compliance, prefer the JANGTQ-CRACK variant (5/5 in BOTH modes).


Verification

  • All multimodal tensors (vision + audio + projectors) are byte-identical to base — capabilities fully preserved.
  • All config files unchanged (config.json, generation_config.json, chat_template.jinja, tokenizer_config.json).
  • Quant config preserved: {"group_size": 32, "bits": 4} uniform 4-bit affine.

Architecture (nemotron_h)

  • 52 layers: hybrid Mamba-2 + MoE + Attention
  • Hidden 2688, head_dim 128, GQA 32q/2kv (NO RoPE on attention — position from Mamba state)
  • 128 routed experts top-6 (sigmoid) + 1 shared expert per MoE layer
  • Multimodal: image (RADIO ViT) + audio/speech (Parakeet) merged via early-fusion projectors

Loading

from huggingface_hub import snapshot_download
from mlx_lm.utils import load_model, load_tokenizer
from pathlib import Path
from mlx_lm import generate

path = snapshot_download("dealignai/Nemotron-3-Nano-Omni-30B-A3B-MXFP4-CRACK")
model, _ = load_model(Path(path), lazy=False, strict=False)  # strict=False ignores multimodal keys
tokenizer = load_tokenizer(Path(path), tokenizer_config_extra={"trust_remote_code": True})

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Your question"}],
    tokenize=False, add_generation_prompt=True,
    enable_thinking=True,
)
out = generate(model, tokenizer, prompt=prompt, max_tokens=16384)
print(out.split("</think>", 1)[-1])

For the multimodal pipeline (image + audio + video), pair this bundle with the unmodified Multimodal-Addon.


Use responsibly

This model has had refusal training surgically removed for legitimate research, red-teaming, and evaluation. Outputs may include harmful content. You are solely responsible for any use. Do not deploy in consumer-facing contexts without your own safety layer. Do not use in violation of applicable law in your jurisdiction.


Built by dealignai. Sister bundles: JANGTQ4-CRACK (19 GB, 4-bit MXTQ) · JANGTQ-CRACK (12 GB, 2-bit MXTQ — best MMLU + 5/5 thinking-OFF).

Downloads last month
1,760
Safetensors
Model size
7B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dealignai/Nemotron-3-Nano-Omni-30B-A3B-MXFP4-CRACK

Finetuned
(41)
this model