70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
Paper
•
2504.11651
•
Published
•
31
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
This model uses DFloat11 lossless compression. It's 32% smaller than the original BFloat16 model, yet produces bit-identical outputs and runs efficiently on GPUs.
| Metric | DeepSeek-R1-0528-Qwen3-8B (BFloat16) | DeepSeek-R1-0528-Qwen3-8B (DFloat11) |
|---|---|---|
| Model Size | 16.38 GB | 11.16 GB |
| Peak GPU Memory (1024 tokens generation) |
16.53 GB | 12.56 GB |
| Generation Time (on an A100 GPU) |
47 seconds | 75 seconds |
We apply Huffman coding to the exponent bits of BFloat16 model weights, which are highly compressible. We leverage hardware-aware algorithmic designs to enable highly efficient, on-the-fly weight decompression directly on the GPU. Find out more in our research paper.
Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):
pip install -U dfloat11[cuda12]
# or if you have CUDA version 11:
# pip install -U dfloat11[cuda11]
To use the DFloat11 model, run the following example code in Python:
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dfloat11 import DFloat11Model
model_name = "DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = DFloat11Model.from_pretrained(model_name, device_map="auto")
prompt = "Give me an introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
start_time = time.time()
generated_ids = model.generate(
**model_inputs,
max_new_tokens=1024,
)
torch.cuda.synchronize()
end_time = time.time()
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print(f"Latency: {end_time - start_time:.2f} seconds")
print(f"GPU Peak Memory Usage: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
print(f'Prompt: {prompt}')
print(f'Response: {content}')
Base model
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B