| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - HuggingFaceFW/fineweb-2 |
| | model-index: |
| | - name: DragonLLM/Dragon-3B-Base-alpha |
| | results: |
| |
|
| | - task: |
| | type: multiple-choice-qa |
| | name: ARC Challenge |
| | dataset: |
| | type: ai2_arc |
| | name: AI2 ARC (Challenge) |
| | config: ARC-Challenge |
| | split: test |
| | metrics: |
| | - type: accuracy |
| | name: Test accuracy |
| | value: 50.00 |
| |
|
| | - task: |
| | type: multiple-choice-qa |
| | name: ARC Easy |
| | dataset: |
| | type: ai2_arc |
| | name: AI2 ARC (Easy) |
| | config: ARC-Easy |
| | split: test |
| | metrics: |
| | - type: accuracy |
| | name: Test accuracy |
| | value: 76.01 |
| |
|
| | - task: |
| | type: commonsense-reasoning |
| | name: HellaSwag |
| | dataset: |
| | type: hellaswag |
| | name: HellaSwag |
| | split: validation |
| | metrics: |
| | - type: accuracy |
| | name: Acc |
| | value: 71.73 |
| |
|
| | - task: |
| | type: language-modeling |
| | name: LAMBADA (word prediction) |
| | dataset: |
| | type: lambada |
| | name: LAMBADA |
| | split: test |
| | metrics: |
| | - type: accuracy |
| | name: Acc |
| | value: 65.03 |
| |
|
| | - task: |
| | type: commonsense-reasoning |
| | name: PIQA |
| | dataset: |
| | type: piqa |
| | name: PIQA |
| | split: validation |
| | metrics: |
| | - type: accuracy |
| | name: Acc |
| | value: 79.11 |
| |
|
| | - task: |
| | type: information-extraction |
| | name: SWDE |
| | dataset: |
| | type: swde |
| | name: SWDE |
| | split: test |
| | metrics: |
| | - type: accuracy |
| | name: Acc |
| | value: 89.92 |
| |
|
| | - task: |
| | type: classification |
| | name: FDA |
| | dataset: |
| | type: fda |
| | name: FDA |
| | split: test |
| | metrics: |
| | - type: accuracy |
| | name: Acc |
| | value: 81.13 |
| |
|
| | --- |
| | ## Highlights |
| |
|
| | Dragon LLM introduces its new LLM Architecture. Built on a new hybrid GDN -Transformer that outperforms traditional architectures, it can power frugal, sovereign models that can be rapidly specialized on business data and use cases. |
| |
|
| | Dragon Architecture features : |
| | - Very strong ability to remember past words in the sequence compared to other Hybrid approach, inspired by Hymba (NVIDIA) |
| | - Ability to be used simultaneously by more users on equivalent hardware and better throughput on long-context scenario |
| | - Extremely efficient learning |
| | It has been been validated at large scale by the training of a 3B model on 3.5T tokens. It achieves comparable performance against smolLM-3B-Base and Qwen3-4B-Base on ARC, HellaSwag, LAMBADA, and PIQA, while trained on 3-5 time less data. |
| |
|
| | Why is this important? |
| | - **Proves performance** → same performance with 3–5× less data. |
| | - **Cut cost** : more users can be served on the same hardware |
| | - Ability to deploy in secure environment with constraint on the hardware (even on CPU) |
| | - **Scales better** : higher throughput and strong long-context handling (Long documents, files, codes or contracts). |
| |
|
| |
|
| | How has Dragon LLM achieved this? |
| | • By combining the best recent research papers on LLM architectures, cumulating gains across all processes, from deep layer optimization to attention head or kv cache management. |
| | • Agile Team able to adapt quickly and test new ideas extremely fast |
| | • Compute support by the EU Commission (euroHPC - JUPITER and Leonardo HPC) |
| |
|
| |
|
| | What's next? |
| | The next step is to deliver foundation models for this architecture : |
| | • a 3B and 7B version of DragonBase trained on 10T+ tokens |
| | • Chat version of these models |
| | • Specialized versions for specific industry vertical such as Finance |
| |
|
| | If you want to know more and get updates on the project, follow us ! |
| |
|
| | If you would like a comprehensive deep dive on the architecture : [read our blog post](https://open.substack.com/pub/dragonllm/p/inside-dragons-architecture?r=3j0al4&utm_campaign=post&utm_medium=web) |
| |
|
| | ## Model Overview |
| |
|
| |  |
| |
|
| |
|
| | ## Model Benchmark |
| |
|
| | |Benchmarks |Dragon |Qwen3-4B |SmolLM3| |
| | |----|----|----|----| |
| | |ARC Challenge |50% |51.28% |**52.56%**| |
| | |ARC Easy |76.01% |75.97% |**76.81%**| |
| | |HellaSwag |71.73% |54.46% |**75.2%**| |
| | |LAMBADA |65.03% |62.62% |**65.05%**| |
| | |PIQA |**79.11%** |77.86% |78.84%| |
| | |SWDE |89.92% |**91.99%** |88.03%| |
| | |FDA |81.13% |**86.75%** |76.13%| |
| | |Average |**73.27%** |71.56% |73.23%| |
| |
|
| | All evaluations are performed using with lm-eval and few shot set to 0. |
| |
|
| | ## Limitations |
| |
|
| | This model is a foundation model, trained on large-scale general-purpose text corpora. It has not been fine-tuned for any specific downstream task. As such: |
| |
|
| | It may produce inaccurate or misleading information, particularly for factual or time-sensitive queries. |
| |
|
| | It has no understanding of truth or intent and may generate biased, toxic, or harmful content inherited from its training data. |
| |
|
| | It is not suitable for direct use in safety-critical or decision-making contexts (e.g., healthcare, finance, law) without additional alignment or validation. |
| |
|
| | The model does not perform well on tasks requiring domain-specific expertise, numerical precision, or structured reasoning unless further fine-tuned. |
| |
|
| | Long or complex prompts may lead to loss of coherence or hallucinations as context length grows. |
| |
|
| | Fine-tuning, prompt-engineering, or evaluation on downstream tasks is recommended before any production use. |
| |
|
| | ## Quickstart |
| |
|
| | Try it with: |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | model_name = "DragonLLM/Dragon-3B-Base-alpha" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | dtype="auto", |
| | device_map="auto", |
| | trust_remote_code=True, |
| | ) |
| | |
| | prompt = "Once upon a time, a valiant knight named Segurant set out on a quest to chase a dragon. He was" |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | |
| | generated_ids = model.generate( |
| | **inputs, |
| | max_new_tokens=512, |
| | ) |
| | |
| | print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## Setup |
| |
|
| | For better performance on GPU, we recommend using : |
| | - [flash-linear-attention](https://github.com/fla-org/flash-linear-attention): the Gated DeltaNet Triton kernels |
| | Install with ```pip install flash-linear-attention``` |
| |
|
| | If you use NVIDIA GPU, you can improve performance with : |
| | - [flash-attention](https://github.com/Dao-AILab/flash-attention): |
| | Install with ```pip install flash-attn --no-build-isolation``` |
| |
|
| | - [causal-conv1d](https://github.com/Dao-AILab/causal-conv1d): a short convolution is used as part of the Gated DeltaNet layer |
| | Install with ```pip install causal-conv1d``` |
| |
|
| | - (optional, recommended only for A100) [flex-head-ha](https://github.com/xiayuqing0622/flex_head_fa): computing attention with different head dimensions for qk and vo, used for differential attention |
| | Install with ```pip install flex-head-fa --no-build-isolation``` |