DFloat11 Compressed Model: `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`

This model uses DFloat11 lossless compression. It's 32% smaller than the original BFloat16 model, yet produces bit-identical outputs and runs efficiently on GPUs.

📊 Performance Comparison

Metric	DeepSeek-R1-0528-Qwen3-8B (BFloat16)	DeepSeek-R1-0528-Qwen3-8B (DFloat11)
Model Size	16.38 GB	11.16 GB
Peak GPU Memory (1024 tokens generation)	16.53 GB	12.56 GB
Generation Time (on an A100 GPU)	47 seconds	75 seconds

🔍 How It Works

We apply Huffman coding to the exponent bits of BFloat16 model weights, which are highly compressible. We leverage hardware-aware algorithmic designs to enable highly efficient, on-the-fly weight decompression directly on the GPU. Find out more in our research paper.

🔧 How to Use

Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):
```
pip install -U dfloat11[cuda12]
# or if you have CUDA version 11:
# pip install -U dfloat11[cuda11]
```

To use the DFloat11 model, run the following example code in Python:

import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dfloat11 import DFloat11Model

model_name = "DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = DFloat11Model.from_pretrained(model_name, device_map="auto")

prompt = "Give me an introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

torch.cuda.reset_peak_memory_stats()

torch.cuda.synchronize()
start_time = time.time()
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024,
)
torch.cuda.synchronize()
end_time = time.time()

output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

print(f"Latency: {end_time - start_time:.2f} seconds")
print(f"GPU Peak Memory Usage: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
print(f'Prompt: {prompt}')
print(f'Response: {content}')

📄 Learn More

Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
GitHub: https://github.com/LeanModels/DFloat11
HuggingFace: https://huggingface.co/DFloat11

Downloads last month: 4

Safetensors

Model size

4.1k params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11

Base model

deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

Quantized

(87)

this model

Paper for DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Paper • 2504.11651 • Published Apr 15, 2025 • 31

DFloat11 Compressed Model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B