Dream-v0-Instruct-7B-FP8

TevunahAi Professional Quantization

๐Ÿ† First FP8 quantized Dream model for native PyTorch/transformers inference.

This is an FP8 quantized version of Dream-v0-Instruct-7B, a diffusion-based large language model from HKU NLP Group.

What is Dream?

Dream 7B is a Diffusion Large Language Model (dLLM) - unlike traditional autoregressive models (GPT, LLaMA, Claude) that generate text left-to-right one token at a time, Dream uses parallel denoising to refine the entire sequence simultaneously.

Key advantages:

  • ๐Ÿ”„ Bidirectional context modeling - considers full context in both directions
  • ๐ŸŽฏ Flexible text generation order - not constrained to left-to-right
  • ๐Ÿง  Superior planning abilities - excels at tasks requiring multi-step reasoning
  • โšก Adjustable quality-speed tradeoff - control inference steps for your needs

Quantization Details

Property Value
Base Model Dream-v0-Instruct-7B
Quantization FP8 Dynamic (Weight-only)
Method llmcompressor FP8_DYNAMIC
Calibration Data-free
Storage Size ~8.7GB
VRAM Required ~10GB
Quantization Time 1.7 minutes

Quantization Infrastructure

Professional hardware ensures consistent, high-quality quantization:

  • CPUs: Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
  • GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
  • Memory: 256GB DDR5 + 128GB HBM2e = 384GB total system memory
  • Software Stack: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13.0 | llm-compressor

Memory Comparison

Precision Size VRAM Required
BF16 ~14 GB ~16 GB
FP8 ~8.7 GB ~10 GB

Usage

With Transformers (Required for Diffusion Models)

Note: Dream uses a custom diffusion architecture that requires transformers with trust_remote_code=True. It is not compatible with standard inference frameworks like vLLM.

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "TevunahAi/Dream-v0-Instruct-7B-FP8"

# Load FP8 model - will decompress to BF16 during inference
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype="auto",  # Auto-detects FP8, decompresses to BF16
    trust_remote_code=True,  # Required for diffusion architecture
    device_map="auto",
    low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Prepare input
messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    return_dict=True,
    add_generation_prompt=True
)

input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)

# Dream uses diffusion_generate, not generate!
output = model.diffusion_generate(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=512,
    steps=256,  # More steps = better quality
    temperature=0.7,
    top_p=0.9,
    alg="entropy",
    alg_temp=0.,
)

# Decode and clean up response
response = tokenizer.decode(output[0][input_ids.shape[1]:].tolist())
response = response.split("<|endoftext|>")[0].strip()
print(response)

Requirements

pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors

System Requirements:

  • ~10GB VRAM (FP8 weights decompress to BF16 during inference)
  • CUDA 11.8 or newer
  • PyTorch 2.1+ with CUDA support

Generation Parameters

Parameter Description Recommended Values
steps Number of diffusion steps 128-512 (more = better quality)
max_new_tokens Maximum tokens to generate 256-1024
temperature Randomness (higher = creative) 0.7-1.0
top_p Nucleus sampling threshold 0.9-0.95
alg Decoding algorithm "entropy"
alg_temp Algorithm temperature 0.0

Quality vs Speed:

  • Fast (128 steps): Quick responses, good for simple queries
  • Balanced (256 steps): Default setting, good quality
  • High Quality (512 steps): Best output, slower generation

Important Notes

  1. โš ๏ธ Use diffusion_generate() not generate() - Dream is a diffusion model!
  2. โš ๏ธ Requires trust_remote_code=True for custom diffusion architecture
  3. ๐Ÿ“ฆ FP8 decompresses to BF16 during inference (~10GB VRAM)
  4. ๐Ÿ” Stop token cleanup: split response on <|endoftext|>
  5. ๐Ÿ“ Context length: 2048 tokens
  6. ๐Ÿšซ Not compatible with vLLM - requires transformers with custom code

Why FP8 for Dream?

Benefits:

  • โœ… Smaller download size (~8.7GB vs ~14GB BF16)
  • โœ… Faster model loading from disk
  • โœ… Storage efficiency for model archives
  • โœ… Compatible with standard transformers workflow

Trade-offs:

  • โš ๏ธ Decompresses to BF16 during inference (~10GB VRAM)
  • โš ๏ธ No runtime memory benefit (diffusion models need full precision)
  • โš ๏ธ Not vLLM compatible (custom architecture)

FP8 primarily benefits storage and download speed for this model.

Diffusion vs Autoregressive

Traditional Autoregressive (GPT-style):

The quick | โ†’ | brown | โ†’ | fox | โ†’ | jumps | โ†’ | ...

Generates left-to-right, one token at a time.

Diffusion (Dream):

[noise] โ†’ [rough text] โ†’ [refined text] โ†’ [final output]

Generates entire sequence through iterative refinement.

Result: Better at:

  • Long-range planning and coherence
  • Complex reasoning tasks
  • Bidirectional context understanding
  • Flexible generation strategies

Use Cases

Dream excels at tasks requiring:

  • ๐Ÿ“ Long-form writing with complex structure
  • ๐Ÿงฎ Multi-step reasoning and problem solving
  • ๐Ÿ”„ Text revision and refinement
  • ๐ŸŽฏ Planning-heavy tasks like stories, essays, arguments
  • ๐Ÿง  Tasks requiring global coherence

๐Ÿ“š Original Model

This quantization is based on Dream-org/Dream-v0-Instruct-7B by HKU NLP Group.

For comprehensive information about:

  • Diffusion LM architecture
  • Training methodology
  • Evaluation benchmarks
  • Research papers

Please refer to the original model card.

๐Ÿ“„ License

This model inherits the Apache 2.0 License from the original Dream model.

๐Ÿ™ Acknowledgments

๐Ÿ“ Citation

If you use Dream, please cite the original paper:

@article{dream2025,
  title={Dream 7B: Diffusion Large Language Models},
  author={Ye, Jiacheng and Xie, Zhihui and others},
  journal={arXiv preprint},
  year={2025}
}

Professional AI Model Quantization by TevunahAi

Enterprise-grade quantization on specialized hardware

View all models | Contact for custom quantization

Downloads last month
14
Safetensors
Model size
8B params
Tensor type
BF16
ยท
F8_E4M3
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for TevunahAi/Dream-v0-Instruct-7B-FP8

Quantized
(8)
this model

Collection including TevunahAi/Dream-v0-Instruct-7B-FP8