12/07/2025. All Files Uploaded.

Grok-2-FP8 (TevunahAi Quantization)

First public FP8 quantization of xAI's Grok-2 (270B MoE)

Model Details

Property Value
Base Model xai-org/grok-2
Total Parameters 269.5B
Active Parameters ~115B (2 of 8 experts)
Architecture 64 layers, 8192 hidden, GQA (64/8 heads)
Quantization FP8 (E4M3FN) per-channel
Original Size ~539 GB (BF16)
Quantized Size ~272 GB (FP8)
Compression 1.98x

Why FP8?

Storage & Download Benefits

Connection BF16 (539GB) FP8 (272GB) Time Saved
100 Mbps ~12 hours ~6 hours 6 hours
50 Mbps ~24 hours ~12 hours 12 hours
25 Mbps ~48 hours ~24 hours 24 hours

Not everyone has fast internet. Even with hardware to run 270B models, downloading 539GB is painful. This cuts it in half.

Quality

  • >99.9% cosine similarity to original BF16 weights
  • ~2% relative error - near-lossless
  • Per-channel scales preserve accuracy

What's Included

  • FP8 quantized weights (18 shards, ~272GB)
  • Per-channel dequantization scales
  • HuggingFace-compatible model code
  • Custom tokenizer for xAI's format
  • Dequantization script

Architecture Notes

Grok-2 has a unique architecture not found in other MoE models:

Each Layer:
β”œβ”€β”€ pre_attn_norm β†’ self_attn β†’ post_attn_norm
β”œβ”€β”€ pre_moe_norm
β”œβ”€β”€ PARALLEL:
β”‚   β”œβ”€β”€ Shared MLP (32768 intermediate) 
β”‚   └── Sparse MoE (8 experts, top-2, 16384 each)
└── post_moe_norm

Key difference: Most MoE models (Mixtral, DeepSeek) have EITHER dense MLP OR MoE. Grok-2 runs BOTH in parallel and sums outputs. This is why existing vLLM loaders don't work yet.

Usage

Dequantize to BF16 for Inference

Standard transformers can't compute with FP8 weights directly. Dequantize first:

import torch
from safetensors import safe_open
from pathlib import Path
from tqdm import tqdm

def load_and_dequantize(model_path):
    """Load FP8 weights and dequantize to BF16."""
    model_path = Path(model_path)
    weights, scales = {}, {}
    
    for shard in sorted(model_path.glob("*.safetensors")):
        with safe_open(str(shard), framework="pt") as f:
            for key in f.keys():
                tensor = f.get_tensor(key)
                if key.endswith('.scale'):
                    scales[key[:-6]] = tensor
                else:
                    weights[key] = tensor
    
    # Dequantize: bf16 = fp8 / scale
    dequantized = {}
    for key, tensor in tqdm(weights.items()):
        if key in scales:
            dequant = tensor.to(torch.float32) / scales[key].unsqueeze(-1)
            dequantized[key] = dequant.to(torch.bfloat16)
        else:
            dequantized[key] = tensor
    
    return dequantized

# Load weights
state_dict = load_and_dequantize("TevunahAi/Grok-2-FP8")

Full Loading Example

import sys
import torch
from transformers import AutoConfig, AutoModelForCausalLM

# Load config
config = AutoConfig.from_pretrained("TevunahAi/Grok-2-FP8", trust_remote_code=True)

# Initialize model
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

# Load dequantized weights
state_dict = load_and_dequantize("TevunahAi/Grok-2-FP8")
model.load_state_dict(state_dict, strict=False)
model.eval()

# Load tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TevunahAi/Grok-2-FP8", trust_remote_code=True)

Standalone Dequantization Script

# Dequantize and save as BF16
python dequantize.py --input ./Grok-2-FP8 --output ./Grok-2-BF16

# Verify quality against original
python dequantize.py --input ./Grok-2-FP8 --verify ./original-grok-2

Hardware Requirements

For BF16 Inference (after dequantization)

  • ~540GB RAM/VRAM required
  • 8x H100 80GB recommended
  • 8x A100 80GB works

Future: Native FP8 Inference

When vLLM/SGLang add Grok-2 architecture support:

  • ~272GB VRAM with FP8 kernels
  • Would run on 4x H100 80GB

Current Limitations

Feature Status
Quantization βœ… Verified (99.9%+ cosine sim)
Dequantization βœ… Verified
Model loading βœ… Verified (1.2TB memory test)
Text generation ⚠️ Not tested (hardware limited)
vLLM support ❌ Architecture not implemented
SGLang support ❌ Architecture not implemented

Note: Full inference testing requires >540GB memory. Quantization quality was verified mathematically. Users with appropriate hardware can validate generation quality.

Quantization Details

What's Quantized (FP8)

  • Attention: Q, K, V, O projections
  • Shared MLP: gate, up, down projections
  • MoE Experts: w1, w2, w3 (8 experts Γ— 64 layers)
  • Total: 1,984 tensors

What's Preserved (BF16)

  • Embeddings (embed_tokens, lm_head)
  • All layer norms
  • MoE router gates
  • Total: 323 tensors

Dequantization Formula

bf16_weight = fp8_weight / scale

Scale is per output channel (dimension 0).

Files

β”œβ”€β”€ config.json                    # Model config
β”œβ”€β”€ configuration_grok2.py         # Config class
β”œβ”€β”€ modeling_grok2.py              # Model implementation
β”œβ”€β”€ tokenization_grok2.py          # Tokenizer
β”œβ”€β”€ tokenizer.tok.json             # Vocabulary
β”œβ”€β”€ tokenizer_config.json          # Tokenizer config
β”œβ”€β”€ __init__.py                    # Module init
β”œβ”€β”€ dequantize.py                  # Dequantization script
β”œβ”€β”€ load_grok2_fp8.py              # Loading helper
β”œβ”€β”€ model.safetensors.index.json   # Weight index
β”œβ”€β”€ pytorch_model-*.safetensors    # FP8 weights (~272GB)
└── tevunahai_quant_info.json      # Quantization metadata

License

Inherits Grok-2 License from xAI.

Acknowledgments


Quantized by TevunahAi

Downloads last month
46
Safetensors
Model size
270B params
Tensor type
BF16
Β·
F32
Β·
F8_E4M3
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TevunahAi/grok-2-FP8

Base model

xai-org/grok-2
Finetuned
(4)
this model

Collection including TevunahAi/grok-2-FP8