Gemma 4 Sparse Autoencoder (SAE)

First open-source Sparse Autoencoder trained on Gemma 4 26B activations.

Built for the ErnOSAgent neural interpretability pipeline.

Architecture

Parameter	Value
Features	131,072
Model Dimension	2,816
Expansion Factor	46.6×
Format	SafeTensors
Source Model	Gemma 4 26B IT (Q4_K_M)
Training Hardware	Apple M3 Ultra (512GB RAM)
Extraction Layer	Last-layer residual stream

Files

gemma4_sae_1m.safetensors — SAE encoder/decoder weights (2.8GB)
feature_map.json — 195 labeled features via automated probing

Usage

With ErnOSAgent (Rust)

# Place weights in the data directory
mkdir -p ~/.ernosagent/sae_training/
# Download weights
huggingface-cli download MettaMazza/gemma4-sae gemma4_sae_1m.safetensors --local-dir ~/.ernosagent/sae_training/
# Download feature map
huggingface-cli download MettaMazza/gemma4-sae feature_map.json --local-dir ~/.ernosagent/sae_training/

# Run ErnOS — SAE loads automatically
cd ~/Desktop/ErnOSAgent && cargo run --release

With Python

from safetensors import safe_open
import numpy as np

with safe_open("gemma4_sae_1m.safetensors", framework="numpy") as f:
    encoder = f.get_tensor("encoder.weight")  # [131072, 2816]
    decoder = f.get_tensor("decoder.weight")  # [2816, 131072]
    bias = f.get_tensor("encoder.bias")       # [131072]

# Encode activations → sparse features
activations = np.random.randn(2816).astype(np.float32)  # from model
features = np.maximum(0, encoder @ activations + bias)    # ReLU

# Top-k active features
top_k = np.argsort(features)[-20:][::-1]
for idx in top_k:
    if features[idx] > 0:
        print(f"Feature {idx}: {features[idx]:.3f}")

Feature Map

The feature_map.json contains 195 human-interpretable labels mapped to SAE feature indices via automated probing. Categories include:

Reasoning: Chain-of-thought, logical deduction, mathematical reasoning
Safety: Refusal, deception detection, bias detection, power-seeking
Cognitive: Creativity, recall, planning, context integration
Emotional: Valence, arousal, emotional tone detection
Technical: Code generation, technical depth, language detection

Training

Trained using ErnOSAgent's native SAE training pipeline (cargo run -- --train-sae):

Activation Collection: Extract 2816-dim residual stream vectors from Gemma 4 26B via llama.cpp's native /embedding endpoint
Training: TopK sparse autoencoder with gradient descent (k=64, LR=3e-4)
Probing: Automated feature labeling via targeted prompt pairs

License

MIT — same as ErnOSAgent.

Citation

@misc{mettamazza2026gemma4sae,
  title={Gemma 4 Sparse Autoencoder for Neural Interpretability},
  author={MettaMazza},
  year={2026},
  url={https://huggingface.co/MettaMazza/gemma4-sae}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support