Gemma 4 Sparse Autoencoder (SAE)

First open-source Sparse Autoencoder trained on Gemma 4 26B activations.

Built for the ErnOSAgent neural interpretability pipeline.

Architecture

Parameter Value
Features 131,072
Model Dimension 2,816
Expansion Factor 46.6ร—
Format SafeTensors
Source Model Gemma 4 26B IT (Q4_K_M)
Training Hardware Apple M3 Ultra (512GB RAM)
Extraction Layer Last-layer residual stream

Files

  • gemma4_sae_1m.safetensors โ€” SAE encoder/decoder weights (2.8GB)
  • feature_map.json โ€” 195 labeled features via automated probing

Usage

With ErnOSAgent (Rust)

# Place weights in the data directory
mkdir -p ~/.ernosagent/sae_training/
# Download weights
huggingface-cli download MettaMazza/gemma4-sae gemma4_sae_1m.safetensors --local-dir ~/.ernosagent/sae_training/
# Download feature map
huggingface-cli download MettaMazza/gemma4-sae feature_map.json --local-dir ~/.ernosagent/sae_training/

# Run ErnOS โ€” SAE loads automatically
cd ~/Desktop/ErnOSAgent && cargo run --release

With Python

from safetensors import safe_open
import numpy as np

with safe_open("gemma4_sae_1m.safetensors", framework="numpy") as f:
    encoder = f.get_tensor("encoder.weight")  # [131072, 2816]
    decoder = f.get_tensor("decoder.weight")  # [2816, 131072]
    bias = f.get_tensor("encoder.bias")       # [131072]

# Encode activations โ†’ sparse features
activations = np.random.randn(2816).astype(np.float32)  # from model
features = np.maximum(0, encoder @ activations + bias)    # ReLU

# Top-k active features
top_k = np.argsort(features)[-20:][::-1]
for idx in top_k:
    if features[idx] > 0:
        print(f"Feature {idx}: {features[idx]:.3f}")

Feature Map

The feature_map.json contains 195 human-interpretable labels mapped to SAE feature indices via automated probing. Categories include:

  • Reasoning: Chain-of-thought, logical deduction, mathematical reasoning
  • Safety: Refusal, deception detection, bias detection, power-seeking
  • Cognitive: Creativity, recall, planning, context integration
  • Emotional: Valence, arousal, emotional tone detection
  • Technical: Code generation, technical depth, language detection

Training

Trained using ErnOSAgent's native SAE training pipeline (cargo run -- --train-sae):

  1. Activation Collection: Extract 2816-dim residual stream vectors from Gemma 4 26B via llama.cpp's native /embedding endpoint
  2. Training: TopK sparse autoencoder with gradient descent (k=64, LR=3e-4)
  3. Probing: Automated feature labeling via targeted prompt pairs

License

MIT โ€” same as ErnOSAgent.

Citation

@misc{mettamazza2026gemma4sae,
  title={Gemma 4 Sparse Autoencoder for Neural Interpretability},
  author={MettaMazza},
  year={2026},
  url={https://huggingface.co/MettaMazza/gemma4-sae}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support