Dream-v0-Instruct-7B-FP8

TevunahAi Professional Quantization

🏆 First FP8 quantized Dream model for native PyTorch/transformers inference.

This is an FP8 quantized version of Dream-v0-Instruct-7B, a diffusion-based large language model from HKU NLP Group.

What is Dream?

Dream 7B is a Diffusion Large Language Model (dLLM) - unlike traditional autoregressive models (GPT, LLaMA, Claude) that generate text left-to-right one token at a time, Dream uses parallel denoising to refine the entire sequence simultaneously.

Key advantages:

🔄 Bidirectional context modeling - considers full context in both directions
🎯 Flexible text generation order - not constrained to left-to-right
🧠 Superior planning abilities - excels at tasks requiring multi-step reasoning
⚡ Adjustable quality-speed tradeoff - control inference steps for your needs

Quantization Details

Property	Value
Base Model	Dream-v0-Instruct-7B
Quantization	FP8 Dynamic (Weight-only)
Method	llmcompressor FP8_DYNAMIC
Calibration	Data-free
Storage Size	~8.7GB
VRAM Required	~10GB
Quantization Time	1.7 minutes

Quantization Infrastructure

Professional hardware ensures consistent, high-quality quantization:

CPUs: Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
Memory: 256GB DDR5 + 128GB HBM2e = 384GB total system memory
Software Stack: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13.0 | llm-compressor

Memory Comparison

Precision	Size	VRAM Required
BF16	~14 GB	~16 GB
FP8	~8.7 GB	~10 GB

Usage

With Transformers (Required for Diffusion Models)

Note: Dream uses a custom diffusion architecture that requires transformers with trust_remote_code=True. It is not compatible with standard inference frameworks like vLLM.

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "TevunahAi/Dream-v0-Instruct-7B-FP8"

# Load FP8 model - will decompress to BF16 during inference
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype="auto",  # Auto-detects FP8, decompresses to BF16
    trust_remote_code=True,  # Required for diffusion architecture
    device_map="auto",
    low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Prepare input
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    return_dict=True,
    add_generation_prompt=True
)

input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)

# Dream uses diffusion_generate, not generate!
output = model.diffusion_generate(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=512,
    steps=256,  # More steps = better quality
    temperature=0.7,
    top_p=0.9,
    alg="entropy",
    alg_temp=0.,
)

# Decode and clean up response
response = tokenizer.decode(output[0][input_ids.shape[1]:].tolist())
response = response.split("<|endoftext|>")[0].strip()
print(response)

Requirements

pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors

System Requirements:

~10GB VRAM (FP8 weights decompress to BF16 during inference)
CUDA 11.8 or newer
PyTorch 2.1+ with CUDA support

Generation Parameters

Parameter	Description	Recommended Values
`steps`	Number of diffusion steps	128-512 (more = better quality)
`max_new_tokens`	Maximum tokens to generate	256-1024
`temperature`	Randomness (higher = creative)	0.7-1.0
`top_p`	Nucleus sampling threshold	0.9-0.95
`alg`	Decoding algorithm	"entropy"
`alg_temp`	Algorithm temperature	0.0

Quality vs Speed:

Fast (128 steps): Quick responses, good for simple queries
Balanced (256 steps): Default setting, good quality
High Quality (512 steps): Best output, slower generation

Important Notes

⚠️ Use diffusion_generate() not generate() - Dream is a diffusion model!
⚠️ Requires trust_remote_code=True for custom diffusion architecture
📦 FP8 decompresses to BF16 during inference (~10GB VRAM)
🔍 Stop token cleanup: split response on <|endoftext|>
📏 Context length: 2048 tokens
🚫 Not compatible with vLLM - requires transformers with custom code

Why FP8 for Dream?

Benefits:

✅ Smaller download size (~8.7GB vs ~14GB BF16)
✅ Faster model loading from disk
✅ Storage efficiency for model archives
✅ Compatible with standard transformers workflow

Trade-offs:

⚠️ Decompresses to BF16 during inference (~10GB VRAM)
⚠️ No runtime memory benefit (diffusion models need full precision)
⚠️ Not vLLM compatible (custom architecture)

FP8 primarily benefits storage and download speed for this model.

Diffusion vs Autoregressive

Traditional Autoregressive (GPT-style):

The quick | → | brown | → | fox | → | jumps | → | ...

Generates left-to-right, one token at a time.

Diffusion (Dream):

[noise] → [rough text] → [refined text] → [final output]

Generates entire sequence through iterative refinement.

Result: Better at:

Long-range planning and coherence
Complex reasoning tasks
Bidirectional context understanding
Flexible generation strategies

Use Cases

Dream excels at tasks requiring:

📝 Long-form writing with complex structure
🧮 Multi-step reasoning and problem solving
🔄 Text revision and refinement
🎯 Planning-heavy tasks like stories, essays, arguments
🧠 Tasks requiring global coherence

📚 Original Model

This quantization is based on Dream-org/Dream-v0-Instruct-7B by HKU NLP Group.

For comprehensive information about:

Diffusion LM architecture
Training methodology
Evaluation benchmarks
Research papers

Please refer to the original model card.

📄 License

This model inherits the Apache 2.0 License from the original Dream model.

🙏 Acknowledgments

Original Model: Dream-org / HKU NLP Group - Pioneering diffusion-based language models
Quantization Framework: Neural Magic's llm-compressor
Quantized by: TevunahAi

📝 Citation

If you use Dream, please cite the original paper:

@article{dream2025,
  title={Dream 7B: Diffusion Large Language Models},
  author={Ye, Jiacheng and Xie, Zhihui and others},
  journal={arXiv preprint},
  year={2025}
}

Professional AI Model Quantization by TevunahAi

Enterprise-grade quantization on specialized hardware

View all models | Contact for custom quantization