You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Sunflower-Speech

A speech-native large language model for Ugandan languages. Processes audio directly without cascaded ASR, enabling real-time spoken interaction with 55ms time-to-first-token. Includes GRPO post-training for improved transcription fidelity.

Architecture

Sunflower-Speech Architecture

Sunflower-Speech combines three components:

Audio encoder: Whisper Large V3 fine-tuned on 10 Ugandan languages
Multimodal projector: 2-layer MLP with 8× frame stacking (Ultravox architecture)
Language model: Sunflower-32B (Qwen 3 32B adapted for Ugandan languages)

The projector maps Whisper encoder outputs directly into the LLM's embedding space, bypassing explicit transcription and eliminating cascade latency.

Supported Languages

Code	Language
en	English
lug	Luganda
ach	Acholi
lgg	Lugbara
teo	Ateso
nyn	Runyankole
myx	Lumasaba
xog	Lusoga
sw	Swahili
rw	Kinyarwanda

Performance

Speech Translation (BLEU ↑)

Direction	Sunflower-Speech	Cascaded	Text-only
lug → eng	38.2	38.3	42.1
eng → lug	30.9	31.3	34.3
nyn → eng	22.8	24.0	26.0
ach → eng	19.9	24.0	24.0
lgg → eng	18.4	17.0	23.0
teo → eng	18.2	17.5	26.0
eng → nyn	17.4	18.0	18.5
eng → lgg	18.3	17.5	20.0
eng → ach	16.6	16.8	17.5
eng → teo	15.0	16.0	17.0
Average	21.6	22.0	24.8

Cascaded: Whisper transcription → Sunflower translation. Text-only: Sunflower translating ground-truth text (oracle upper bound).

Transcription (WER ↓, median %)

Language	Sunflower-Speech	+ GRPO	Whisper
English	0.0	0.0	0.0
Luganda	22.6	16.7	5.0
Kinyarwanda	27.8	28.2	1.0
Acholi	41.0	36.9	16.8
Runyankole	43.5	37.5	19.2
Lugbara	55.5	45.8	15.8
Lusoga	58.3	50.0	28.6
Ateso	62.5	58.6	28.6

Whisper baseline uses a standard encoder-decoder trained on the same data. GRPO post-training reduces the model's tendency to paraphrase or respond conversationally.

Latency (A100-80GB)

Metric	Value
TTFT (p50)	55 ms
TTFT (p90)	61 ms
Cascaded baseline	>1 s

Usage

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained("Sunbird/Sunflower-Speech", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Sunbird/Sunflower-Speech")

# Load audio (16kHz)
inputs = processor(audio=audio_array, sampling_rate=16000, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.decode(outputs[0], skip_special_tokens=True)

Intended Use

Real-time speech translation for Ugandan languages
Voice-based instruction following and Q&A
Accessible interfaces for primarily spoken languages

Limitations

Transcription accuracy lower than standalone Whisper (trades accuracy for latency and flexibility)
May paraphrase or respond conversationally instead of transcribing verbatim
Requires A100-80GB for full-precision inference
Evaluated on single-turn interactions only

Hardware Requirements

Precision	VRAM
FP16	~80 GB
INT4 (GPTQ/AWQ)	~20 GB

Citation

@inproceedings{akera2026realtime,
  title={Real-Time Spoken Instruction Following and Translation in Ugandan Languages},
  author={Akera, Benjamin and Hu, Tim Wenjie and Walukagga, Patrick and Ouma, Evelyn Nafula and Gilbert, Yiga and Mwebaze, Ernest Tonny and Quinn, John},
  booktitle={Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP)},
  year={2026}
}

Acknowledgments

Developed by Sunbird AI. Built on the Ultravox architecture.

Downloads last month: 2,136

Safetensors

Model size

0.7B params

Tensor type

BF16

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sunbird/Sunflower-Speech

Base model

Qwen/Qwen3-32B

Finetuned

Sunbird/Sunflower-32B