---
license: mit
library_name: safetensors
tags:
  - sparse-autoencoder
  - interpretability
  - gemma-4
  - mechanistic-interpretability
  - sae
  - neural-interpretability
datasets:
  - self-generated
language:
  - en
base_model: google/gemma-4-27b-it
---

# Gemma 4 Sparse Autoencoder (SAE)

**First open-source Sparse Autoencoder trained on Gemma 4 26B activations.**

Built for the [ErnOSAgent](https://github.com/MettaMazza/ErnOSAgent) neural interpretability pipeline.

## Architecture

| Parameter | Value |
|---|---|
| Features | 131,072 |
| Model Dimension | 2,816 |
| Expansion Factor | 46.6× |
| Format | SafeTensors |
| Source Model | Gemma 4 26B IT (Q4_K_M) |
| Training Hardware | Apple M3 Ultra (512GB RAM) |
| Extraction Layer | Last-layer residual stream |

## Files

- `gemma4_sae_1m.safetensors` — SAE encoder/decoder weights (2.8GB)
- `feature_map.json` — 195 labeled features via automated probing

## Usage

### With ErnOSAgent (Rust)

```bash
# Place weights in the data directory
mkdir -p ~/.ernosagent/sae_training/
# Download weights
huggingface-cli download MettaMazza/gemma4-sae gemma4_sae_1m.safetensors --local-dir ~/.ernosagent/sae_training/
# Download feature map
huggingface-cli download MettaMazza/gemma4-sae feature_map.json --local-dir ~/.ernosagent/sae_training/

# Run ErnOS — SAE loads automatically
cd ~/Desktop/ErnOSAgent && cargo run --release
```

### With Python

```python
from safetensors import safe_open
import numpy as np

with safe_open("gemma4_sae_1m.safetensors", framework="numpy") as f:
    encoder = f.get_tensor("encoder.weight")  # [131072, 2816]
    decoder = f.get_tensor("decoder.weight")  # [2816, 131072]
    bias = f.get_tensor("encoder.bias")       # [131072]

# Encode activations → sparse features
activations = np.random.randn(2816).astype(np.float32)  # from model
features = np.maximum(0, encoder @ activations + bias)    # ReLU

# Top-k active features
top_k = np.argsort(features)[-20:][::-1]
for idx in top_k:
    if features[idx] > 0:
        print(f"Feature {idx}: {features[idx]:.3f}")
```

## Feature Map

The `feature_map.json` contains 195 human-interpretable labels mapped to SAE feature indices via automated probing. Categories include:

- **Reasoning**: Chain-of-thought, logical deduction, mathematical reasoning
- **Safety**: Refusal, deception detection, bias detection, power-seeking
- **Cognitive**: Creativity, recall, planning, context integration
- **Emotional**: Valence, arousal, emotional tone detection
- **Technical**: Code generation, technical depth, language detection

## Training

Trained using ErnOSAgent's native SAE training pipeline (`cargo run -- --train-sae`):

1. **Activation Collection**: Extract 2816-dim residual stream vectors from Gemma 4 26B via llama.cpp's native `/embedding` endpoint
2. **Training**: TopK sparse autoencoder with gradient descent (k=64, LR=3e-4)
3. **Probing**: Automated feature labeling via targeted prompt pairs

## License

MIT — same as ErnOSAgent.

## Citation

```bibtex
@misc{mettamazza2026gemma4sae,
  title={Gemma 4 Sparse Autoencoder for Neural Interpretability},
  author={MettaMazza},
  year={2026},
  url={https://huggingface.co/MettaMazza/gemma4-sae}
}
```