12/07/2025. All Files Uploaded.
Grok-2-FP8 (TevunahAi Quantization)
First public FP8 quantization of xAI's Grok-2 (270B MoE)
Model Details
| Property | Value |
|---|---|
| Base Model | xai-org/grok-2 |
| Total Parameters | 269.5B |
| Active Parameters | ~115B (2 of 8 experts) |
| Architecture | 64 layers, 8192 hidden, GQA (64/8 heads) |
| Quantization | FP8 (E4M3FN) per-channel |
| Original Size | ~539 GB (BF16) |
| Quantized Size | ~272 GB (FP8) |
| Compression | 1.98x |
Why FP8?
Storage & Download Benefits
| Connection | BF16 (539GB) | FP8 (272GB) | Time Saved |
|---|---|---|---|
| 100 Mbps | ~12 hours | ~6 hours | 6 hours |
| 50 Mbps | ~24 hours | ~12 hours | 12 hours |
| 25 Mbps | ~48 hours | ~24 hours | 24 hours |
Not everyone has fast internet. Even with hardware to run 270B models, downloading 539GB is painful. This cuts it in half.
Quality
- >99.9% cosine similarity to original BF16 weights
- ~2% relative error - near-lossless
- Per-channel scales preserve accuracy
What's Included
- FP8 quantized weights (18 shards, ~272GB)
- Per-channel dequantization scales
- HuggingFace-compatible model code
- Custom tokenizer for xAI's format
- Dequantization script
Architecture Notes
Grok-2 has a unique architecture not found in other MoE models:
Each Layer:
βββ pre_attn_norm β self_attn β post_attn_norm
βββ pre_moe_norm
βββ PARALLEL:
β βββ Shared MLP (32768 intermediate)
β βββ Sparse MoE (8 experts, top-2, 16384 each)
βββ post_moe_norm
Key difference: Most MoE models (Mixtral, DeepSeek) have EITHER dense MLP OR MoE. Grok-2 runs BOTH in parallel and sums outputs. This is why existing vLLM loaders don't work yet.
Usage
Dequantize to BF16 for Inference
Standard transformers can't compute with FP8 weights directly. Dequantize first:
import torch
from safetensors import safe_open
from pathlib import Path
from tqdm import tqdm
def load_and_dequantize(model_path):
"""Load FP8 weights and dequantize to BF16."""
model_path = Path(model_path)
weights, scales = {}, {}
for shard in sorted(model_path.glob("*.safetensors")):
with safe_open(str(shard), framework="pt") as f:
for key in f.keys():
tensor = f.get_tensor(key)
if key.endswith('.scale'):
scales[key[:-6]] = tensor
else:
weights[key] = tensor
# Dequantize: bf16 = fp8 / scale
dequantized = {}
for key, tensor in tqdm(weights.items()):
if key in scales:
dequant = tensor.to(torch.float32) / scales[key].unsqueeze(-1)
dequantized[key] = dequant.to(torch.bfloat16)
else:
dequantized[key] = tensor
return dequantized
# Load weights
state_dict = load_and_dequantize("TevunahAi/Grok-2-FP8")
Full Loading Example
import sys
import torch
from transformers import AutoConfig, AutoModelForCausalLM
# Load config
config = AutoConfig.from_pretrained("TevunahAi/Grok-2-FP8", trust_remote_code=True)
# Initialize model
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
# Load dequantized weights
state_dict = load_and_dequantize("TevunahAi/Grok-2-FP8")
model.load_state_dict(state_dict, strict=False)
model.eval()
# Load tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TevunahAi/Grok-2-FP8", trust_remote_code=True)
Standalone Dequantization Script
# Dequantize and save as BF16
python dequantize.py --input ./Grok-2-FP8 --output ./Grok-2-BF16
# Verify quality against original
python dequantize.py --input ./Grok-2-FP8 --verify ./original-grok-2
Hardware Requirements
For BF16 Inference (after dequantization)
- ~540GB RAM/VRAM required
- 8x H100 80GB recommended
- 8x A100 80GB works
Future: Native FP8 Inference
When vLLM/SGLang add Grok-2 architecture support:
- ~272GB VRAM with FP8 kernels
- Would run on 4x H100 80GB
Current Limitations
| Feature | Status |
|---|---|
| Quantization | β Verified (99.9%+ cosine sim) |
| Dequantization | β Verified |
| Model loading | β Verified (1.2TB memory test) |
| Text generation | β οΈ Not tested (hardware limited) |
| vLLM support | β Architecture not implemented |
| SGLang support | β Architecture not implemented |
Note: Full inference testing requires >540GB memory. Quantization quality was verified mathematically. Users with appropriate hardware can validate generation quality.
Quantization Details
What's Quantized (FP8)
- Attention: Q, K, V, O projections
- Shared MLP: gate, up, down projections
- MoE Experts: w1, w2, w3 (8 experts Γ 64 layers)
- Total: 1,984 tensors
What's Preserved (BF16)
- Embeddings (embed_tokens, lm_head)
- All layer norms
- MoE router gates
- Total: 323 tensors
Dequantization Formula
bf16_weight = fp8_weight / scale
Scale is per output channel (dimension 0).
Files
βββ config.json # Model config
βββ configuration_grok2.py # Config class
βββ modeling_grok2.py # Model implementation
βββ tokenization_grok2.py # Tokenizer
βββ tokenizer.tok.json # Vocabulary
βββ tokenizer_config.json # Tokenizer config
βββ __init__.py # Module init
βββ dequantize.py # Dequantization script
βββ load_grok2_fp8.py # Loading helper
βββ model.safetensors.index.json # Weight index
βββ pytorch_model-*.safetensors # FP8 weights (~272GB)
βββ tevunahai_quant_info.json # Quantization metadata
License
Inherits Grok-2 License from xAI.
Acknowledgments
- xAI for releasing Grok-2
- Original model: xai-org/grok-2
Quantized by TevunahAi
- Downloads last month
- 46
Model tree for TevunahAi/grok-2-FP8
Base model
xai-org/grok-2