12/07/2025. All Files Uploaded.

Grok-2-FP8 (TevunahAi Quantization)

First public FP8 quantization of xAI's Grok-2 (270B MoE)

Model Details

Property	Value
Base Model	xai-org/grok-2
Total Parameters	269.5B
Active Parameters	~115B (2 of 8 experts)
Architecture	64 layers, 8192 hidden, GQA (64/8 heads)
Quantization	FP8 (E4M3FN) per-channel
Original Size	~539 GB (BF16)
Quantized Size	~272 GB (FP8)
Compression	1.98x

Why FP8?

Storage & Download Benefits

Connection	BF16 (539GB)	FP8 (272GB)	Time Saved
100 Mbps	~12 hours	~6 hours	6 hours
50 Mbps	~24 hours	~12 hours	12 hours
25 Mbps	~48 hours	~24 hours	24 hours

Not everyone has fast internet. Even with hardware to run 270B models, downloading 539GB is painful. This cuts it in half.

Quality

>99.9% cosine similarity to original BF16 weights
~2% relative error - near-lossless
Per-channel scales preserve accuracy

What's Included

FP8 quantized weights (18 shards, ~272GB)
Per-channel dequantization scales
HuggingFace-compatible model code
Custom tokenizer for xAI's format
Dequantization script

Architecture Notes

Grok-2 has a unique architecture not found in other MoE models:

Each Layer:
├── pre_attn_norm → self_attn → post_attn_norm
├── pre_moe_norm
├── PARALLEL:
│   ├── Shared MLP (32768 intermediate) 
│   └── Sparse MoE (8 experts, top-2, 16384 each)
└── post_moe_norm

Key difference: Most MoE models (Mixtral, DeepSeek) have EITHER dense MLP OR MoE. Grok-2 runs BOTH in parallel and sums outputs. This is why existing vLLM loaders don't work yet.

Usage

Dequantize to BF16 for Inference

Standard transformers can't compute with FP8 weights directly. Dequantize first:

import torch
from safetensors import safe_open
from pathlib import Path
from tqdm import tqdm

def load_and_dequantize(model_path):
    """Load FP8 weights and dequantize to BF16."""
    model_path = Path(model_path)
    weights, scales = {}, {}
    
    for shard in sorted(model_path.glob("*.safetensors")):
        with safe_open(str(shard), framework="pt") as f:
            for key in f.keys():
                tensor = f.get_tensor(key)
                if key.endswith('.scale'):
                    scales[key[:-6]] = tensor
                else:
                    weights[key] = tensor
    
    # Dequantize: bf16 = fp8 / scale
    dequantized = {}
    for key, tensor in tqdm(weights.items()):
        if key in scales:
            dequant = tensor.to(torch.float32) / scales[key].unsqueeze(-1)
            dequantized[key] = dequant.to(torch.bfloat16)
        else:
            dequantized[key] = tensor
    
    return dequantized

# Load weights
state_dict = load_and_dequantize("TevunahAi/Grok-2-FP8")

Full Loading Example

import sys
import torch
from transformers import AutoConfig, AutoModelForCausalLM

# Load config
config = AutoConfig.from_pretrained("TevunahAi/Grok-2-FP8", trust_remote_code=True)

# Initialize model
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

# Load dequantized weights
state_dict = load_and_dequantize("TevunahAi/Grok-2-FP8")
model.load_state_dict(state_dict, strict=False)
model.eval()

# Load tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TevunahAi/Grok-2-FP8", trust_remote_code=True)

Standalone Dequantization Script

# Dequantize and save as BF16
python dequantize.py --input ./Grok-2-FP8 --output ./Grok-2-BF16

# Verify quality against original
python dequantize.py --input ./Grok-2-FP8 --verify ./original-grok-2

Hardware Requirements

For BF16 Inference (after dequantization)

~540GB RAM/VRAM required
8x H100 80GB recommended
8x A100 80GB works

Future: Native FP8 Inference

When vLLM/SGLang add Grok-2 architecture support:

~272GB VRAM with FP8 kernels
Would run on 4x H100 80GB

Current Limitations

Feature	Status
Quantization	✅ Verified (99.9%+ cosine sim)
Dequantization	✅ Verified
Model loading	✅ Verified (1.2TB memory test)
Text generation	⚠️ Not tested (hardware limited)
vLLM support	❌ Architecture not implemented
SGLang support	❌ Architecture not implemented

Note: Full inference testing requires >540GB memory. Quantization quality was verified mathematically. Users with appropriate hardware can validate generation quality.

Quantization Details

What's Quantized (FP8)

Attention: Q, K, V, O projections
Shared MLP: gate, up, down projections
MoE Experts: w1, w2, w3 (8 experts × 64 layers)
Total: 1,984 tensors

What's Preserved (BF16)

Embeddings (embed_tokens, lm_head)
All layer norms
MoE router gates
Total: 323 tensors

Dequantization Formula

bf16_weight = fp8_weight / scale

Scale is per output channel (dimension 0).

Files

├── config.json                    # Model config
├── configuration_grok2.py         # Config class
├── modeling_grok2.py              # Model implementation
├── tokenization_grok2.py          # Tokenizer
├── tokenizer.tok.json             # Vocabulary
├── tokenizer_config.json          # Tokenizer config
├── __init__.py                    # Module init
├── dequantize.py                  # Dequantization script
├── load_grok2_fp8.py              # Loading helper
├── model.safetensors.index.json   # Weight index
├── pytorch_model-*.safetensors    # FP8 weights (~272GB)
└── tevunahai_quant_info.json      # Quantization metadata

License

Inherits Grok-2 License from xAI.

Acknowledgments

xAI for releasing Grok-2
Original model: xai-org/grok-2

Quantized by TevunahAi

Downloads last month: 46

Safetensors

Model size

270B params

Tensor type

BF16

F32

F8_E4M3

Model tree for TevunahAi/grok-2-FP8

Base model

xai-org/grok-2

Finetuned

(4)

this model

Collection including TevunahAi/grok-2-FP8

FP8 Models.

Collection

These models have been quantized in FP8 format. • 10 items • Updated 4 days ago