--- license: other license_name: embedl-models-community-licence-1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model: - google/gemma-3-1b-it tags: - text-generation-inference extra_gated_prompt: The information you provide will be collected, stored, processed and shared in accordance with the [Embedl Privacy Policy](https://www.embedl.com/privacy-policy). ---
Optimized by Embedl
Need to fine-tune, hit performance targets, or deploy on specific hardware?
We've got you covered.
Learn more Get in touch →
# gemma-3-1b-it-FlashHead ![FlashHead](https://huggingface.co/datasets/embedl/documentation-images/resolve/main/flashhead.png) [![GitHub](https://img.shields.io/badge/GitHub-flash--head-black?logo=github)](https://github.com/embedl/flash-head) **Optimized version of gemma-3-1b-it using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.** Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging: - FlashHead - vLLM plugin via [`flash-head`](https://github.com/embedl/flash-head) FlashHead matches the gemma-3-1b-it baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency. ### Quickstart ```bash pip install flash-head vllm serve embedl/gemma-3-1b-it-FlashHead ``` --- ## Model Details | **Field** | **Value** | |------------|------------| | **Base Model** | gemma-3-1b-it | | **Input / Output** | Text → Text | | **Release Date** | 2025-12-08 | | **Version** | 1.0 | | **Optimizations** | FlashHead LM Head| | **Developers** | Embedl | | **Licenses** | Upstream: Gemma Terms of Use.
Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* | | **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs | Architecture graph for embedl/gemma-3-1b-it-FlashHead. Open in hfviewer --- ## Optimizations - **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput. - **vLLM Plugin Integration** - compatible with **vLLM (0.14.0+)** via the [`flash-head`](https://github.com/embedl/flash-head) plugin. --- ## Performance Edge Inference Benchmarks for Gemma-3 ### Token Generation Speed (RTX 3500 Ada, batch size = 1) | **Precision** | **Tokens/sec** | **Speedup vs BF16** | |----------------|----------------|----------------------| | BF16 baseline | 148 | 1.0× | | **FlashHead (Embedl)** | **178** | **1.20×** | | W4A16 baseline | 243 | 1.64x× | | **FlashHead W4A16 (Embedl)** | **336** | **2.27×** | FlashHead improves end-to-end speed by **1.38×** over state-of-the-art, while maintaining full accuracy parity. **Measurement setup:** vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs. --- ## Accuracy (Parity with Baseline) | **Method** | **MMLU-Pro** | **IFEval** | **BBH** | **TruthfulQA** | **GSM8K** | |-------------|---------------|--------------|-------------|----------------|--------------| | **Baseline** | 0.15 | 0.55 | 0.38 | 0.31 | 0.42 | | **FlashHead** | 0.15 | 0.49 | 0.38 | 0.31 | 0.39 | FlashHead closely matches baseline accuracy. --- ## Installation ```bash pip install flash-head ``` The [`flash-head`](https://github.com/embedl/flash-head) vLLM plugin is required. It activates automatically at startup. --- ## Usage Examples **Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`). ### vLLM Inference ```python from vllm import LLM, SamplingParams model_id = "embedl/gemma-3-1b-it-FlashHead" if __name__ == "__main__": sampling = SamplingParams(max_tokens=128, temperature=0.0) llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072) prompt = "Write a haiku about coffee." output = llm.generate([prompt], sampling) print(output[0].outputs[0].text) ``` --- ## Limitations - Requires **vLLM >= 0.14.0** - Currently optimized for **NVIDIA RTX GPUs** --- ## Roadmap Planned improvements: - Advanced mixed precision quantization - Huggingface transformers generation - vLLM CLI benchmarking for detailed latency evaluation - `lm-eval-harness` integration for detailed accuracy evaluation - Upstream support in **Transformers** and **vLLM** - Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc. - Broader model coverage (larger models, VLMs, VLAs) --- ## License - **Upstream:** Gemma Terms of Use. - **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)* --- ## Contact **Enterprise & Commercial Inquiries** [models@embedl.com](mailto:models@embedl.com) **Technical Issues & Early Access** [https://github.com/embedl/flash-head](https://github.com/embedl/flash-head) **More Information & Model Releases** [https://embedl.com](https://embedl.com) --- ### Partner & Developer Opportunities If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for: - Embedl SDK - AI optimization tools & profiling - Embedl HUB - benchmarking platform - Engineering support for on-prem/edge deployments - Migration guidance (Llama / Qwen / Gemma) - Early access & partner co-marketing opportunities Contact: [models@embedl.com](mailto:models@embedl.com)
Community & support
Need help with this model? Chat with the Embedl team and other engineers on Discord.
Quantization gotchas, hardware questions, fine-tuning tips — bring them all.
Join our Discord →