Sunflower-Speech
A speech-native large language model for Ugandan languages. Processes audio directly without cascaded ASR, enabling real-time spoken interaction with 55ms time-to-first-token. Includes GRPO post-training for improved transcription fidelity.
Architecture
Sunflower-Speech combines three components:
- Audio encoder: Whisper Large V3 fine-tuned on 10 Ugandan languages
- Multimodal projector: 2-layer MLP with 8Γ frame stacking (Ultravox architecture)
- Language model: Sunflower-32B (Qwen 3 32B adapted for Ugandan languages)
The projector maps Whisper encoder outputs directly into the LLM's embedding space, bypassing explicit transcription and eliminating cascade latency.
Supported Languages
| Code | Language |
|---|---|
| en | English |
| lug | Luganda |
| ach | Acholi |
| lgg | Lugbara |
| teo | Ateso |
| nyn | Runyankole |
| myx | Lumasaba |
| xog | Lusoga |
| sw | Swahili |
| rw | Kinyarwanda |
Performance
Speech Translation (BLEU β)
| Direction | Sunflower-Speech | Cascaded | Text-only |
|---|---|---|---|
| lug β eng | 38.2 | 38.3 | 42.1 |
| eng β lug | 30.9 | 31.3 | 34.3 |
| nyn β eng | 22.8 | 24.0 | 26.0 |
| ach β eng | 19.9 | 24.0 | 24.0 |
| lgg β eng | 18.4 | 17.0 | 23.0 |
| teo β eng | 18.2 | 17.5 | 26.0 |
| eng β nyn | 17.4 | 18.0 | 18.5 |
| eng β lgg | 18.3 | 17.5 | 20.0 |
| eng β ach | 16.6 | 16.8 | 17.5 |
| eng β teo | 15.0 | 16.0 | 17.0 |
| Average | 21.6 | 22.0 | 24.8 |
Cascaded: Whisper transcription β Sunflower translation. Text-only: Sunflower translating ground-truth text (oracle upper bound).
Transcription (WER β, median %)
| Language | Sunflower-Speech | + GRPO | Whisper |
|---|---|---|---|
| English | 0.0 | 0.0 | 0.0 |
| Luganda | 22.6 | 16.7 | 5.0 |
| Kinyarwanda | 27.8 | 28.2 | 1.0 |
| Acholi | 41.0 | 36.9 | 16.8 |
| Runyankole | 43.5 | 37.5 | 19.2 |
| Lugbara | 55.5 | 45.8 | 15.8 |
| Lusoga | 58.3 | 50.0 | 28.6 |
| Ateso | 62.5 | 58.6 | 28.6 |
Whisper baseline uses a standard encoder-decoder trained on the same data. GRPO post-training reduces the model's tendency to paraphrase or respond conversationally.
Latency (A100-80GB)
| Metric | Value |
|---|---|
| TTFT (p50) | 55 ms |
| TTFT (p90) | 61 ms |
| Cascaded baseline | >1 s |
Usage
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained("Sunbird/Sunflower-Speech", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Sunbird/Sunflower-Speech")
# Load audio (16kHz)
inputs = processor(audio=audio_array, sampling_rate=16000, return_tensors="pt")
# Generate response
outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.decode(outputs[0], skip_special_tokens=True)
Intended Use
- Real-time speech translation for Ugandan languages
- Voice-based instruction following and Q&A
- Accessible interfaces for primarily spoken languages
Limitations
- Transcription accuracy lower than standalone Whisper (trades accuracy for latency and flexibility)
- May paraphrase or respond conversationally instead of transcribing verbatim
- Requires A100-80GB for full-precision inference
- Evaluated on single-turn interactions only
Hardware Requirements
| Precision | VRAM |
|---|---|
| FP16 | ~80 GB |
| INT4 (GPTQ/AWQ) | ~20 GB |
Citation
@inproceedings{akera2026realtime,
title={Real-Time Spoken Instruction Following and Translation in Ugandan Languages},
author={Akera, Benjamin and Hu, Tim Wenjie and Walukagga, Patrick and Ouma, Evelyn Nafula and Gilbert, Yiga and Mwebaze, Ernest Tonny and Quinn, John},
booktitle={Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP)},
year={2026}
}
Acknowledgments
Developed by Sunbird AI. Built on the Ultravox architecture.
- Downloads last month
- 2,136