You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Sunflower-Speech

A speech-native large language model for Ugandan languages. Processes audio directly without cascaded ASR, enabling real-time spoken interaction with 55ms time-to-first-token. Includes GRPO post-training for improved transcription fidelity.

Architecture

Sunflower-Speech Architecture

Sunflower-Speech combines three components:

  • Audio encoder: Whisper Large V3 fine-tuned on 10 Ugandan languages
  • Multimodal projector: 2-layer MLP with 8Γ— frame stacking (Ultravox architecture)
  • Language model: Sunflower-32B (Qwen 3 32B adapted for Ugandan languages)

The projector maps Whisper encoder outputs directly into the LLM's embedding space, bypassing explicit transcription and eliminating cascade latency.

Supported Languages

Code Language
en English
lug Luganda
ach Acholi
lgg Lugbara
teo Ateso
nyn Runyankole
myx Lumasaba
xog Lusoga
sw Swahili
rw Kinyarwanda

Performance

Speech Translation (BLEU ↑)

Direction Sunflower-Speech Cascaded Text-only
lug β†’ eng 38.2 38.3 42.1
eng β†’ lug 30.9 31.3 34.3
nyn β†’ eng 22.8 24.0 26.0
ach β†’ eng 19.9 24.0 24.0
lgg β†’ eng 18.4 17.0 23.0
teo β†’ eng 18.2 17.5 26.0
eng β†’ nyn 17.4 18.0 18.5
eng β†’ lgg 18.3 17.5 20.0
eng β†’ ach 16.6 16.8 17.5
eng β†’ teo 15.0 16.0 17.0
Average 21.6 22.0 24.8

Cascaded: Whisper transcription β†’ Sunflower translation. Text-only: Sunflower translating ground-truth text (oracle upper bound).

Transcription (WER ↓, median %)

Language Sunflower-Speech + GRPO Whisper
English 0.0 0.0 0.0
Luganda 22.6 16.7 5.0
Kinyarwanda 27.8 28.2 1.0
Acholi 41.0 36.9 16.8
Runyankole 43.5 37.5 19.2
Lugbara 55.5 45.8 15.8
Lusoga 58.3 50.0 28.6
Ateso 62.5 58.6 28.6

Whisper baseline uses a standard encoder-decoder trained on the same data. GRPO post-training reduces the model's tendency to paraphrase or respond conversationally.

Latency (A100-80GB)

Metric Value
TTFT (p50) 55 ms
TTFT (p90) 61 ms
Cascaded baseline >1 s

Usage

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained("Sunbird/Sunflower-Speech", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Sunbird/Sunflower-Speech")

# Load audio (16kHz)
inputs = processor(audio=audio_array, sampling_rate=16000, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.decode(outputs[0], skip_special_tokens=True)

Intended Use

  • Real-time speech translation for Ugandan languages
  • Voice-based instruction following and Q&A
  • Accessible interfaces for primarily spoken languages

Limitations

  • Transcription accuracy lower than standalone Whisper (trades accuracy for latency and flexibility)
  • May paraphrase or respond conversationally instead of transcribing verbatim
  • Requires A100-80GB for full-precision inference
  • Evaluated on single-turn interactions only

Hardware Requirements

Precision VRAM
FP16 ~80 GB
INT4 (GPTQ/AWQ) ~20 GB

Citation

@inproceedings{akera2026realtime,
  title={Real-Time Spoken Instruction Following and Translation in Ugandan Languages},
  author={Akera, Benjamin and Hu, Tim Wenjie and Walukagga, Patrick and Ouma, Evelyn Nafula and Gilbert, Yiga and Mwebaze, Ernest Tonny and Quinn, John},
  booktitle={Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP)},
  year={2026}
}

Acknowledgments

Developed by Sunbird AI. Built on the Ultravox architecture.

Downloads last month
2,136
Safetensors
Model size
0.7B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Sunbird/Sunflower-Speech

Base model

Qwen/Qwen3-32B
Finetuned
(4)
this model

Dataset used to train Sunbird/Sunflower-Speech