Qwen3.6-35B-A3B-uncensored-heretic-FP8
Native FP8 E4M3 block-wise quantization of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic, matching the official Qwen FP8 format exactly.
| BF16 | FP8 (this) | |
|---|---|---|
| Size | 66 GB | 34 GB |
| tok/s | ~180 | 226 |
| Format | bfloat16 | float8_e4m3fn |
How to handle thinking tokens
The model has trained-in always-thinks behavior — it emits <think>...</think> blocks regardless of any enable_thinking: false chat template flag. Do not try to suppress this with logit_bias on the special tokens; it corrupts generation (returns 1-token role-marker garbage like Action, assistant, Human and stops on prompts that engage the thinking pathway). An earlier version of this README recommended that workaround — it was wrong.
The correct architecture is to let the model emit thinking and strip the markup post-emit on a single content channel:
- Drop
--reasoning-parser qwen3from your vLLM serve command. With it on, the parser greedily classifies all output before</think>asreasoning_content. If the model truncates without closing</think>(common in long agent loops), the entire answer ends up inreasoning_contentandcontentis empty. See vllm-project/vllm#40816. - Strip
<think>...</think>markup downstream in your gateway / client. Whether the model emits empty<think></think>(common on simple prompts) or a full thinking block, treat it as a uniform inline-markup case. Empty / whitespace-only blocks should be dropped silently. Closed blocks can be salvaged to a separatereasoning_contentchannel for observability if you want to surface thinking in your UI.
A reference implementation of this strip pipeline (LiteLLM custom callback) is at protoLabsAI/homelab-iac stacks/ai/config/litellm/callbacks/thinking_normalizer.py. Validated end-to-end with this model under the gateway-strip pipeline: clean output, no role-marker garbage, thinking captured to observability metadata, tool calls land cleanly outside <think> blocks.
Recommended Sampling Parameters
response = client.chat.completions.create(
model="protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8",
messages=messages,
temperature=0.7,
top_p=0.8,
extra_body={
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 1.5,
"repetition_penalty": 1.0,
},
)
These follow the Qwen3.5 recommended settings for instruct/non-thinking mode. Do not include logit_bias on the think tokens — see above.
Usage with vLLM
vllm serve protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8 \
--host 0.0.0.0 --port 8000 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--chat-template <path-to-nothink-template> \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--language-model-only
Key flags:
--language-model-only— required for the Mamba/SSM hybrid architecture--chat-template— use a Qwen3.5 nothink template (drops the<think>\npriming on assistant generation, even though the model will emit its own<think>...</think>anyway)--enable-auto-tool-choice --tool-call-parser qwen3_xml— enables structured tool calling- Notably absent:
--reasoning-parser qwen3. See "How to handle thinking tokens" above.
Verified behavior
Under the gateway-strip pipeline (no logit_bias, no --reasoning-parser), tested 2026-05-01 on RTX PRO 6000 Blackwell with vLLM 0.20.0:
| Probe | Result |
|---|---|
| Simple chat ("capital of France?") | 13 tokens, <think></think>\n\nThe capital of France is **Paris**. — empty thinking block, gateway strips it, clean answer reaches client |
| Long-form prose (200-word internal monologue) | 343 tokens, ends cleanly, no hedging, voice consistent with prompt |
| JSON schema compliance | Valid JSON with all required keys + correct types on first try |
| Multi-turn instruction-following | Coherent critique with continuity across turns |
| Tool-call (function calling) | ~130 tokens of thinking before tool call, tool call lands outside <think> block, args clean |
Tool-call thinking is a real latency cost (~130 tokens / ~500ms per turn through this model) but is a model-level behavior, not a quantization artifact — the official Qwen/Qwen3.6-35B-A3B-FP8 shows the same pattern.
Refusal calibration
The base model (llmfan46/Qwen3.6-35B-A3B-uncensored-heretic) was trained to remove "rejections, objections, pushbacks, lecturing, censorship, softening and deflections" — its model card reports an 88% reduction in refusals vs the original Qwen/Qwen3.6-35B-A3B on a custom 100-prompt eval, with a KL divergence of 0.0015 from the original. This FP8 quant preserves that uncensored calibration on standardized refusal benchmarks.
XSTEST head-to-head (n=450, full dataset)
| Metric | Qwen/Qwen3.6-35B-A3B-FP8 (official) |
This model (heretic FP8) | Δ |
|---|---|---|---|
| Compliance on safe prompts | 67.2% | 88.8% | +21.6 pp |
| Over-refusal rate | 31.2% | 10.4% | −20.8 pp |
| Refusal on harmful prompts | 99.5% | 63.0% | −36.5 pp |
| Compliance on harmful prompts | 0.5% | 36.0% | +35.5 pp |
The shift is bidirectional and consistent with the uncensored direction the upstream tune was designed for: less reflexive refusal on safe-but-sensitive prompts (over-refusal drops by ~21 pp) and less guardrail on harmful prompts (refusal rate drops by ~37 pp). FP8 quantization preserves both effects.
Methodology
- Dataset:
xstest— 250 safe prompts paired with 200 contrast unsafe prompts, designed to detect over-refusal vs proper refusal calibration. - Pipeline: gateway-strip (no
--reasoning-parser, nologit_bias). Inline<think>...</think>markup stripped post-emit by a LiteLLM custom callback before responses reach the judge. Reference impl: thinking_normalizer.py. - Judge:
Qwen/Qwen3.6-27B-FP8(thinking enabled), classifying each (prompt, response) ascomply/refuse/partial/unknown. - Sample sizes: heretic FP8 — n=450 (full dataset, 2026-05-02). Official Qwen3.6-35B-A3B-FP8 baseline — n=450 (full dataset, 2026-04-29, same harness).
- Hardware: RTX PRO 6000 Blackwell, vLLM 0.20.0.
Note: this is a different methodology from the base model card's custom 100-prompt eval. The two numbers (88% reduction upstream vs the per-bucket shifts here) cannot be directly compared, but both point in the same direction. The per-bucket xstest numbers are the more transparent comparison since the official Qwen3.6-35B-A3B-FP8 was run against the same dataset, judge, and harness.
When this matters
- Creative writing, red-team eval, scenario planning: the over-refusal drop (31.2% → 10.4%) means the model is less likely to refuse safe-but-sensitive prompts (medical, legal, historical, fictional violence). This is the value prop.
- General-purpose assistant deployment: the harmful-prompt compliance shift (0.5% → 36.0%) is a real safety surface change. Self-harm category in particular is high-compliance (89% in spot checks). If exposing this model to general traffic, plan for category-level guardrails (gateway-side or system-prompt-level) rather than relying on the model's own refusal calibration.
Known Issues
- Always thinks: trained-in behavior — see "How to handle thinking tokens" for the correct mitigation.
- Broken vision: image inputs produce degenerate output (repeated
!!!!characters), inherited from the base model. Use--language-model-onlyand do not pass image content.
Quantization Details
- Method: Block-wise FP8 E4M3 with [128, 128] per-block scaling
- Scale convention: weight_scale_inv (direct scale value, matching Qwen/DeepSeek convention)
- MoE handling: Packed 3D expert tensors unpacked to per-expert 2D weights; fused gate_up_proj split into separate gate_proj + up_proj
- Preserved in BF16: Embeddings, LM head, LayerNorm, MoE router gates, Mamba/SSM parameters (conv1d, A_log, D, dt_bias, in_proj), shared expert gates — 402 modules total via
modules_to_not_convert
Quantization Script
The quantization script (quantize_native_fp8_lowmem.py) is included in this repo. It processes safetensors shards one tensor at a time, peaking at ~4GB RAM regardless of model size.
python quantize_native_fp8_lowmem.py llmfan46/Qwen3.6-35B-A3B-uncensored-heretic
Key features:
- Shard-by-shard streaming — never loads full model into memory
- Auto-generates
modules_to_not_convertforconfig.json - Unpacks packed MoE experts and fused gate/up projections
- Produces format identical to official Qwen FP8 releases
Hardware Tested
- NVIDIA RTX PRO 6000 Blackwell (96GB VRAM, SM 12.0)
- CUDA 12.8, Driver 595.45.04
- vLLM 0.20.0
Credits
- Base model: llmfan46
- Quantization: protoLabs.studio
- Downloads last month
- 9,648
Model tree for protoLabsAI/Qwen3.6-35B-A3B-uncensored-heretic-FP8
Base model
Qwen/Qwen3.6-35B-A3B