---
license: mit
datasets:
- openai/gsm8k
language:
- en
metrics:
- accuracy
pipeline_tag: text-generation
library_name: transformers
tags:
- math
- experiment
- moe
- deepseek
- from-scratch
- tiny-model
- cpu
- math
- deepseek-v3-architecture
---

# Axion1-350K-A250K

> **DeepSeek-V3 architecture scaled to ~344k total parameters (~160k active/token) — runs entirely on CPU.**

Built from scratch as a proof-of-concept that the real DeepSeek-V3 architectural innovations
(MLA + DeepSeekMoE + auxiliary-loss-free load balancing) work correctly even at extreme miniaturization.

---

## Architecture

This is **not** a distilled or quantized version of DeepSeek. Every component was implemented
from scratch in pure PyTorch, faithfully following the DeepSeek-V3 technical report
([arXiv:2412.19437](https://arxiv.org/abs/2412.19437)).

| Component | DeepSeek-V3 | Axion1 |
|---|---|---|
| Attention | MLA (Multi-head Latent Attention) | ✅ Identical MLA |
| FFN | DeepSeekMoE (256 routed experts) | ✅ MoE (4 routed, top-2) |
| Load balancing | Auxiliary-loss-free (dynamic bias) | ✅ Section 2.3.2 |
| Position | RoPE | ✅ RoPE |
| Normalization | RMSNorm | ✅ RMSNorm |
| Activation | SwiGLU | ✅ SwiGLU |
| Total params | 671B | **344k** |
| Active params/token | 37B | **~160k** |

---

## Model Details

```
d_model           : 64
n_layers          : 4
n_heads           : 4   (MLA)
d_head            : 16
kv_lora_rank      : 8   (MLA KV compression)
q_lora_rank       : 16  (MLA Q compression)
n_shared_experts  : 1
n_routed_experts  : 4   (top-2 activated)
d_ff              : 64  (per expert)
vocab_size        : 1024 (BPE, trained on GSM8K)
max_seq_len       : 512
total_params      : 343,616
active_params/tok : ~160,000
```

---

## Training

- **Dataset:** [GSM8K](https://huggingface.co/datasets/openai/gsm8k) — grade school math, converted to plain text with question / reasoning / answer format
- **Tokenizer:** BPE trained from scratch, vocab size 1024
- **Hardware:** AMD Ryzen 5 5600G — CPU only, 12 threads, 32 GB RAM
- **Speed:** ~1,000–1,100 tokens/sec on CPU
- **Epochs:** 20 | **Final val loss:** ~3.2 | **Total time:** ~115 minutes

### Training Curve

| Epoch | Val Loss |
|-------|----------|
| 1  | 5.49 |
| 2  | 4.59 |
| 3  | 4.30 |
| 5  | 3.88 |
| 7  | 3.66 |
| 9  | 3.54 |
| 20 | ~3.2 |

---

## Usage

```python
from transformers import AutoModelForCausalLM, LogitsProcessor, LogitsProcessorList
from tokenizer import BPETokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "AxionLab-official/Axion1-350k-A250k",
    trust_remote_code=True
)
model.eval()

tok = BPETokenizer.load("model.vocab", "model.model")

# Bloqueia EOS e PAD nos primeiros min_tokens gerados
class MinNewTokens(LogitsProcessor):
    def __init__(self, min_tokens: int, eos_id: int, pad_id: int):
        self.min_tokens = min_tokens
        self.bad = [eos_id, pad_id]
        self.generated = 0

    def __call__(self, input_ids, scores):
        if self.generated < self.min_tokens:
            for bid in self.bad:
                scores[:, bid] = float("-inf")
        self.generated += 1
        return scores

eos_id = tok.token2id["<eos>"]
pad_id = tok.token2id["<pad>"]

prompt = "# Pergunta:\nQuanto é 5 + 3?\n--\n# Resposta:\n"
ids = tok.encode(prompt, add_bos=True, add_eos=False)
input_ids = torch.tensor([ids])

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=80,
        temperature=0.9,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        eos_token_id=eos_id,
        pad_token_id=pad_id,
        use_cache=False,
        logits_processor=LogitsProcessorList([
            MinNewTokens(min_tokens=5, eos_id=eos_id, pad_id=pad_id)
        ]),
    )

new_tokens = output[0][len(ids):].tolist()
# Remove EOS do final se presente
if new_tokens and new_tokens[-1] == eos_id:
    new_tokens = new_tokens[:-1]

print("Resposta:", tok.decode(new_tokens))
```

---

## Scaling Roadmap

| Version | Params | Status |
|---------|--------|--------|
| Axion1-v0.1 (this) | 344k | ✅ Released |
| Axion1-v0.2 | ~1.5M | 🔜 Next |
| Axion1-v0.3 | ~6M | 📅 Planned |
| Axion1--v0.4 | ~24M | 📅 Planned |
| Axion1--v0.5 | ~100M | 📅 Planned |

---

## Files

```
├── model.py             # Full DeepSeek-V3 architecture (MLA + MoE)
├── modeling_axion.py    # HuggingFace wrapper
├── config.json          # Model configuration
├── model.safetensors    # Trained weights
├── model.vocab          # BPE vocabulary
└── model.model          # BPE merge rules
```

---

## Limitations

With only 344k parameters, the model has learned mathematical vocabulary and co-occurrence
patterns from GSM8K but cannot reliably solve problems or maintain syntactic coherence.
This is expected — the purpose of this release is to demonstrate that the DeepSeek-V3
architectural components work correctly at any scale, and to serve as a foundation for
the scaling roadmap above.

---

## Citation

```bibtex
@article{deepseekv3,
  title  = {DeepSeek-V3 Technical Report},
  author = {DeepSeek-AI},
  year   = {2024},
  url    = {https://arxiv.org/abs/2412.19437}
}
```

---

## License

MIT — free to use, modify, and build upon.

---

*Made by [AxionLab](https://huggingface.co/AxionLab-official)*