--- license: mit datasets: - openai/gsm8k language: - en metrics: - accuracy pipeline_tag: text-generation library_name: transformers tags: - math - experiment - moe - deepseek - from-scratch - tiny-model - cpu - math - deepseek-v3-architecture --- # Axion1-350K-A250K > **DeepSeek-V3 architecture scaled to ~344k total parameters (~160k active/token) — runs entirely on CPU.** Built from scratch as a proof-of-concept that the real DeepSeek-V3 architectural innovations (MLA + DeepSeekMoE + auxiliary-loss-free load balancing) work correctly even at extreme miniaturization. --- ## Architecture This is **not** a distilled or quantized version of DeepSeek. Every component was implemented from scratch in pure PyTorch, faithfully following the DeepSeek-V3 technical report ([arXiv:2412.19437](https://arxiv.org/abs/2412.19437)). | Component | DeepSeek-V3 | Axion1 | |---|---|---| | Attention | MLA (Multi-head Latent Attention) | ✅ Identical MLA | | FFN | DeepSeekMoE (256 routed experts) | ✅ MoE (4 routed, top-2) | | Load balancing | Auxiliary-loss-free (dynamic bias) | ✅ Section 2.3.2 | | Position | RoPE | ✅ RoPE | | Normalization | RMSNorm | ✅ RMSNorm | | Activation | SwiGLU | ✅ SwiGLU | | Total params | 671B | **344k** | | Active params/token | 37B | **~160k** | --- ## Model Details ``` d_model : 64 n_layers : 4 n_heads : 4 (MLA) d_head : 16 kv_lora_rank : 8 (MLA KV compression) q_lora_rank : 16 (MLA Q compression) n_shared_experts : 1 n_routed_experts : 4 (top-2 activated) d_ff : 64 (per expert) vocab_size : 1024 (BPE, trained on GSM8K) max_seq_len : 512 total_params : 343,616 active_params/tok : ~160,000 ``` --- ## Training - **Dataset:** [GSM8K](https://huggingface.co/datasets/openai/gsm8k) — grade school math, converted to plain text with question / reasoning / answer format - **Tokenizer:** BPE trained from scratch, vocab size 1024 - **Hardware:** AMD Ryzen 5 5600G — CPU only, 12 threads, 32 GB RAM - **Speed:** ~1,000–1,100 tokens/sec on CPU - **Epochs:** 20 | **Final val loss:** ~3.2 | **Total time:** ~115 minutes ### Training Curve | Epoch | Val Loss | |-------|----------| | 1 | 5.49 | | 2 | 4.59 | | 3 | 4.30 | | 5 | 3.88 | | 7 | 3.66 | | 9 | 3.54 | | 20 | ~3.2 | --- ## Usage ```python from transformers import AutoModelForCausalLM, LogitsProcessor, LogitsProcessorList from tokenizer import BPETokenizer import torch model = AutoModelForCausalLM.from_pretrained( "AxionLab-official/Axion1-350k-A250k", trust_remote_code=True ) model.eval() tok = BPETokenizer.load("model.vocab", "model.model") # Bloqueia EOS e PAD nos primeiros min_tokens gerados class MinNewTokens(LogitsProcessor): def __init__(self, min_tokens: int, eos_id: int, pad_id: int): self.min_tokens = min_tokens self.bad = [eos_id, pad_id] self.generated = 0 def __call__(self, input_ids, scores): if self.generated < self.min_tokens: for bid in self.bad: scores[:, bid] = float("-inf") self.generated += 1 return scores eos_id = tok.token2id[""] pad_id = tok.token2id[""] prompt = "# Pergunta:\nQuanto é 5 + 3?\n--\n# Resposta:\n" ids = tok.encode(prompt, add_bos=True, add_eos=False) input_ids = torch.tensor([ids]) with torch.no_grad(): output = model.generate( input_ids, max_new_tokens=80, temperature=0.9, do_sample=True, top_k=50, top_p=0.95, eos_token_id=eos_id, pad_token_id=pad_id, use_cache=False, logits_processor=LogitsProcessorList([ MinNewTokens(min_tokens=5, eos_id=eos_id, pad_id=pad_id) ]), ) new_tokens = output[0][len(ids):].tolist() # Remove EOS do final se presente if new_tokens and new_tokens[-1] == eos_id: new_tokens = new_tokens[:-1] print("Resposta:", tok.decode(new_tokens)) ``` --- ## Scaling Roadmap | Version | Params | Status | |---------|--------|--------| | Axion1-v0.1 (this) | 344k | ✅ Released | | Axion1-v0.2 | ~1.5M | 🔜 Next | | Axion1-v0.3 | ~6M | 📅 Planned | | Axion1--v0.4 | ~24M | 📅 Planned | | Axion1--v0.5 | ~100M | 📅 Planned | --- ## Files ``` ├── model.py # Full DeepSeek-V3 architecture (MLA + MoE) ├── modeling_axion.py # HuggingFace wrapper ├── config.json # Model configuration ├── model.safetensors # Trained weights ├── model.vocab # BPE vocabulary └── model.model # BPE merge rules ``` --- ## Limitations With only 344k parameters, the model has learned mathematical vocabulary and co-occurrence patterns from GSM8K but cannot reliably solve problems or maintain syntactic coherence. This is expected — the purpose of this release is to demonstrate that the DeepSeek-V3 architectural components work correctly at any scale, and to serve as a foundation for the scaling roadmap above. --- ## Citation ```bibtex @article{deepseekv3, title = {DeepSeek-V3 Technical Report}, author = {DeepSeek-AI}, year = {2024}, url = {https://arxiv.org/abs/2412.19437} } ``` --- ## License MIT — free to use, modify, and build upon. --- *Made by [AxionLab](https://huggingface.co/AxionLab-official)*