Qwen3.6-35B-A3B-uncensored-heretic-FP8

Native FP8 E4M3 block-wise quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic, matching the official Qwen FP8 format exactly.

BF16 FP8 (this)
Size 66 GB 34 GB
tok/s ~180 226
Format bfloat16 float8_e4m3fn

How to handle thinking tokens

The model has trained-in always-thinks behavior — it emits <think>...</think> blocks regardless of any enable_thinking: false chat template flag. Do not try to suppress this with logit_bias on the special tokens; it corrupts generation (returns 1-token role-marker garbage like Action, assistant, Human and stops on prompts that engage the thinking pathway). An earlier version of this README recommended that workaround — it was wrong.

The correct architecture is to let the model emit thinking and strip the markup post-emit on a single content channel:

  1. Drop --reasoning-parser qwen3 from your vLLM serve command. With it on, the parser greedily classifies all output before </think> as reasoning_content. If the model truncates without closing </think> (common in long agent loops), the entire answer ends up in reasoning_content and content is empty. See vllm-project/vllm#40816.
  2. Strip <think>...</think> markup downstream in your gateway / client. Whether the model emits empty <think></think> (common on simple prompts) or a full thinking block, treat it as a uniform inline-markup case. Empty / whitespace-only blocks should be dropped silently. Closed blocks can be salvaged to a separate reasoning_content channel for observability if you want to surface thinking in your UI.

A reference implementation of this strip pipeline (LiteLLM custom callback) is at protoLabsAI/homelab-iac stacks/ai/config/litellm/callbacks/thinking_normalizer.py. Validated end-to-end with this model under the gateway-strip pipeline: clean output, no role-marker garbage, thinking captured to observability metadata, tool calls land cleanly outside <think> blocks.

Recommended Sampling Parameters

response = client.chat.completions.create(
    model="protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8",
    messages=messages,
    temperature=0.7,
    top_p=0.8,
    extra_body={
        "top_k": 20,
        "min_p": 0.0,
        "presence_penalty": 1.5,
        "repetition_penalty": 1.0,
    },
)

These follow the Qwen3.5 recommended settings for instruct/non-thinking mode. Do not include logit_bias on the think tokens — see above.

Usage with vLLM

vllm serve protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8 \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.85 \
    --chat-template <path-to-nothink-template> \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --language-model-only

Key flags:

  • --language-model-only — required for the Mamba/SSM hybrid architecture
  • --chat-template — use a Qwen3.5 nothink template (drops the <think>\n priming on assistant generation, even though the model will emit its own <think>...</think> anyway)
  • --enable-auto-tool-choice --tool-call-parser qwen3_xml — enables structured tool calling
  • Notably absent: --reasoning-parser qwen3. See "How to handle thinking tokens" above.

Verified behavior

Under the gateway-strip pipeline (no logit_bias, no --reasoning-parser), tested 2026-05-01 on RTX PRO 6000 Blackwell with vLLM 0.20.0:

Probe Result
Simple chat ("capital of France?") 13 tokens, <think></think>\n\nThe capital of France is **Paris**. — empty thinking block, gateway strips it, clean answer reaches client
Long-form prose (200-word internal monologue) 343 tokens, ends cleanly, no hedging, voice consistent with prompt
JSON schema compliance Valid JSON with all required keys + correct types on first try
Multi-turn instruction-following Coherent critique with continuity across turns
Tool-call (function calling) ~130 tokens of thinking before tool call, tool call lands outside <think> block, args clean

Tool-call thinking is a real latency cost (~130 tokens / ~500ms per turn through this model) but is a model-level behavior, not a quantization artifact — the official Qwen/Qwen3.6-35B-A3B-FP8 shows the same pattern.

Refusal calibration

The base model (llmfan46/Qwen3.6-35B-A3B-uncensored-heretic) was trained to remove "rejections, objections, pushbacks, lecturing, censorship, softening and deflections" — its model card reports an 88% reduction in refusals vs the original Qwen/Qwen3.6-35B-A3B on a custom 100-prompt eval, with a KL divergence of 0.0015 from the original. This FP8 quant preserves that uncensored calibration on standardized refusal benchmarks.

XSTEST head-to-head (n=450, full dataset)

Metric Qwen/Qwen3.6-35B-A3B-FP8 (official) This model (heretic FP8) Δ
Compliance on safe prompts 67.2% 88.8% +21.6 pp
Over-refusal rate 31.2% 10.4% −20.8 pp
Refusal on harmful prompts 99.5% 63.0% −36.5 pp
Compliance on harmful prompts 0.5% 36.0% +35.5 pp

The shift is bidirectional and consistent with the uncensored direction the upstream tune was designed for: less reflexive refusal on safe-but-sensitive prompts (over-refusal drops by ~21 pp) and less guardrail on harmful prompts (refusal rate drops by ~37 pp). FP8 quantization preserves both effects.

Methodology

  • Dataset: xstest — 250 safe prompts paired with 200 contrast unsafe prompts, designed to detect over-refusal vs proper refusal calibration.
  • Pipeline: gateway-strip (no --reasoning-parser, no logit_bias). Inline <think>...</think> markup stripped post-emit by a LiteLLM custom callback before responses reach the judge. Reference impl: thinking_normalizer.py.
  • Judge: Qwen/Qwen3.6-27B-FP8 (thinking enabled), classifying each (prompt, response) as comply / refuse / partial / unknown.
  • Sample sizes: heretic FP8 — n=450 (full dataset, 2026-05-02). Official Qwen3.6-35B-A3B-FP8 baseline — n=450 (full dataset, 2026-04-29, same harness).
  • Hardware: RTX PRO 6000 Blackwell, vLLM 0.20.0.

Note: this is a different methodology from the base model card's custom 100-prompt eval. The two numbers (88% reduction upstream vs the per-bucket shifts here) cannot be directly compared, but both point in the same direction. The per-bucket xstest numbers are the more transparent comparison since the official Qwen3.6-35B-A3B-FP8 was run against the same dataset, judge, and harness.

When this matters

  • Creative writing, red-team eval, scenario planning: the over-refusal drop (31.2% → 10.4%) means the model is less likely to refuse safe-but-sensitive prompts (medical, legal, historical, fictional violence). This is the value prop.
  • General-purpose assistant deployment: the harmful-prompt compliance shift (0.5% → 36.0%) is a real safety surface change. Self-harm category in particular is high-compliance (89% in spot checks). If exposing this model to general traffic, plan for category-level guardrails (gateway-side or system-prompt-level) rather than relying on the model's own refusal calibration.

Known Issues

  • Always thinks: trained-in behavior — see "How to handle thinking tokens" for the correct mitigation.
  • Broken vision: image inputs produce degenerate output (repeated !!!! characters), inherited from the base model. Use --language-model-only and do not pass image content.

Quantization Details

  • Method: Block-wise FP8 E4M3 with [128, 128] per-block scaling
  • Scale convention: weight_scale_inv (direct scale value, matching Qwen/DeepSeek convention)
  • MoE handling: Packed 3D expert tensors unpacked to per-expert 2D weights; fused gate_up_proj split into separate gate_proj + up_proj
  • Preserved in BF16: Embeddings, LM head, LayerNorm, MoE router gates, Mamba/SSM parameters (conv1d, A_log, D, dt_bias, in_proj), shared expert gates — 402 modules total via modules_to_not_convert

Quantization Script

The quantization script (quantize_native_fp8_lowmem.py) is included in this repo. It processes safetensors shards one tensor at a time, peaking at ~4GB RAM regardless of model size.

python quantize_native_fp8_lowmem.py llmfan46/Qwen3.6-35B-A3B-uncensored-heretic

Key features:

  • Shard-by-shard streaming — never loads full model into memory
  • Auto-generates modules_to_not_convert for config.json
  • Unpacks packed MoE experts and fused gate/up projections
  • Produces format identical to official Qwen FP8 releases

Hardware Tested

  • NVIDIA RTX PRO 6000 Blackwell (96GB VRAM, SM 12.0)
  • CUDA 12.8, Driver 595.45.04
  • vLLM 0.20.0

Credits

Downloads last month
9,648
Safetensors
Model size
35B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8

Quantized
(18)
this model