Dream-v0-Instruct-7B-FP8
TevunahAi Professional Quantization
๐ First FP8 quantized Dream model for native PyTorch/transformers inference.
This is an FP8 quantized version of Dream-v0-Instruct-7B, a diffusion-based large language model from HKU NLP Group.
What is Dream?
Dream 7B is a Diffusion Large Language Model (dLLM) - unlike traditional autoregressive models (GPT, LLaMA, Claude) that generate text left-to-right one token at a time, Dream uses parallel denoising to refine the entire sequence simultaneously.
Key advantages:
- ๐ Bidirectional context modeling - considers full context in both directions
- ๐ฏ Flexible text generation order - not constrained to left-to-right
- ๐ง Superior planning abilities - excels at tasks requiring multi-step reasoning
- โก Adjustable quality-speed tradeoff - control inference steps for your needs
Quantization Details
| Property | Value |
|---|---|
| Base Model | Dream-v0-Instruct-7B |
| Quantization | FP8 Dynamic (Weight-only) |
| Method | llmcompressor FP8_DYNAMIC |
| Calibration | Data-free |
| Storage Size | ~8.7GB |
| VRAM Required | ~10GB |
| Quantization Time | 1.7 minutes |
Quantization Infrastructure
Professional hardware ensures consistent, high-quality quantization:
- CPUs: Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
- GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
- Memory: 256GB DDR5 + 128GB HBM2e = 384GB total system memory
- Software Stack: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13.0 | llm-compressor
Memory Comparison
| Precision | Size | VRAM Required |
|---|---|---|
| BF16 | ~14 GB | ~16 GB |
| FP8 | ~8.7 GB | ~10 GB |
Usage
With Transformers (Required for Diffusion Models)
Note: Dream uses a custom diffusion architecture that requires transformers with trust_remote_code=True. It is not compatible with standard inference frameworks like vLLM.
import torch
from transformers import AutoModel, AutoTokenizer
model_path = "TevunahAi/Dream-v0-Instruct-7B-FP8"
# Load FP8 model - will decompress to BF16 during inference
model = AutoModel.from_pretrained(
model_path,
torch_dtype="auto", # Auto-detects FP8, decompresses to BF16
trust_remote_code=True, # Required for diffusion architecture
device_map="auto",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Prepare input
messages = [
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
return_dict=True,
add_generation_prompt=True
)
input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)
# Dream uses diffusion_generate, not generate!
output = model.diffusion_generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=512,
steps=256, # More steps = better quality
temperature=0.7,
top_p=0.9,
alg="entropy",
alg_temp=0.,
)
# Decode and clean up response
response = tokenizer.decode(output[0][input_ids.shape[1]:].tolist())
response = response.split("<|endoftext|>")[0].strip()
print(response)
Requirements
pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
System Requirements:
- ~10GB VRAM (FP8 weights decompress to BF16 during inference)
- CUDA 11.8 or newer
- PyTorch 2.1+ with CUDA support
Generation Parameters
| Parameter | Description | Recommended Values |
|---|---|---|
steps |
Number of diffusion steps | 128-512 (more = better quality) |
max_new_tokens |
Maximum tokens to generate | 256-1024 |
temperature |
Randomness (higher = creative) | 0.7-1.0 |
top_p |
Nucleus sampling threshold | 0.9-0.95 |
alg |
Decoding algorithm | "entropy" |
alg_temp |
Algorithm temperature | 0.0 |
Quality vs Speed:
- Fast (128 steps): Quick responses, good for simple queries
- Balanced (256 steps): Default setting, good quality
- High Quality (512 steps): Best output, slower generation
Important Notes
- โ ๏ธ Use
diffusion_generate()notgenerate()- Dream is a diffusion model! - โ ๏ธ Requires
trust_remote_code=Truefor custom diffusion architecture - ๐ฆ FP8 decompresses to BF16 during inference (~10GB VRAM)
- ๐ Stop token cleanup: split response on
<|endoftext|> - ๐ Context length: 2048 tokens
- ๐ซ Not compatible with vLLM - requires transformers with custom code
Why FP8 for Dream?
Benefits:
- โ Smaller download size (~8.7GB vs ~14GB BF16)
- โ Faster model loading from disk
- โ Storage efficiency for model archives
- โ Compatible with standard transformers workflow
Trade-offs:
- โ ๏ธ Decompresses to BF16 during inference (~10GB VRAM)
- โ ๏ธ No runtime memory benefit (diffusion models need full precision)
- โ ๏ธ Not vLLM compatible (custom architecture)
FP8 primarily benefits storage and download speed for this model.
Diffusion vs Autoregressive
Traditional Autoregressive (GPT-style):
The quick | โ | brown | โ | fox | โ | jumps | โ | ...
Generates left-to-right, one token at a time.
Diffusion (Dream):
[noise] โ [rough text] โ [refined text] โ [final output]
Generates entire sequence through iterative refinement.
Result: Better at:
- Long-range planning and coherence
- Complex reasoning tasks
- Bidirectional context understanding
- Flexible generation strategies
Use Cases
Dream excels at tasks requiring:
- ๐ Long-form writing with complex structure
- ๐งฎ Multi-step reasoning and problem solving
- ๐ Text revision and refinement
- ๐ฏ Planning-heavy tasks like stories, essays, arguments
- ๐ง Tasks requiring global coherence
๐ Original Model
This quantization is based on Dream-org/Dream-v0-Instruct-7B by HKU NLP Group.
For comprehensive information about:
- Diffusion LM architecture
- Training methodology
- Evaluation benchmarks
- Research papers
Please refer to the original model card.
๐ License
This model inherits the Apache 2.0 License from the original Dream model.
๐ Acknowledgments
- Original Model: Dream-org / HKU NLP Group - Pioneering diffusion-based language models
- Quantization Framework: Neural Magic's llm-compressor
- Quantized by: TevunahAi
๐ Citation
If you use Dream, please cite the original paper:
@article{dream2025,
title={Dream 7B: Diffusion Large Language Models},
author={Ye, Jiacheng and Xie, Zhihui and others},
journal={arXiv preprint},
year={2025}
}
Professional AI Model Quantization by TevunahAi
Enterprise-grade quantization on specialized hardware
- Downloads last month
- 14
Model tree for TevunahAi/Dream-v0-Instruct-7B-FP8
Base model
Dream-org/Dream-v0-Instruct-7B