majentik's picture
Upload folder using huggingface_hub
0240fe4 verified
|
raw
history blame
2.39 kB
metadata
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
tags:
  - nemotron
  - multimodal
  - turboquant
  - kv-cache
  - gguf
  - combo-card

Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF IQ4_XS + TurboQuant KV-Cache (matched stack)

Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack of Nemotron-3-Nano-Omni-30B-A3B-Reasoning at GGUF IQ4_XS.

No new weights are published here. This card describes a runtime configuration: load the weights from majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS (forthcoming in Phase 2.2 of the publication plan) and apply the KV-cache modifier documented in majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant.

Modality matrix

Modality Encoder Quantization in this variant
Text LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) per the variant suffix
Image CRADIO v4-H BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file)
Audio Parakeet-TDT-0.6B-v2 BF16 (same rationale)
Video Parakeet-TDT-0.6B-v2 + frame sampler BF16 (≤ 2 min, 256 frames @ 2 FPS)

NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal MLP projectors in BF16 to preserve multimodal accuracy. We follow that convention in every quantized variant we ship.

Runtime quirks

llama.cpp

Use llama-mtmd-cli for multimodal inference; pass --mmproj mmproj-F16.gguf (see majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16).

Do NOT use CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use the Metal/CPU paths.

Ollama

Text-only; multimodal is blocked because Ollama doesn't yet support the mmproj split-file pattern.

Reasoning mode

enable_thinking defaults to True. To disable extended reasoning (e.g., for latency-sensitive cases), pass enable_thinking=False to the chat template / generate call. No separate "no-think" variant card exists — this is a runtime flag, not a model variant.