AfriHuBERT — Kikuyu + Luhya ASR

Model Card for Model ID

A multilingual automatic speech recognition (ASR) model for Kikuyu and Luhya, two Bantu languages spoken primarily in Kenya. Built on top of AfriHuBERT using a two-stage fine-tuning approach with LoRA adaptation to prevent catastrophic forgetting.

Model Details

Model Description

This model performs automatic speech recognition for two Kenyan Bantu languages — Kikuyu and Luhya. It was fine-tuned from AfriHuBERT using a two-stage approach: first fully fine-tuned on Kikuyu, then LoRA-adapted jointly on both Kikuyu and Luhya to add Luhya support without catastrophic forgetting of Kikuyu.

Developed by: Joan Kinoti
Model type: Automatic Speech Recognition (CTC)
Language(s): Kikuyu (ki), Luhya (luy)
License: MIT
Finetuned from: ajesujoba/AfriHuBERT

Model Sources [optional]

Repository: https://huggingface.co/JoanKinoti/afrihubert-kikuyu-luhya
Demo [optional]: [More Information Needed]

Direct Use

Transcribe spoken Kikuyu or Luhya audio to text. Suitable for:

Voice interfaces for Kenyan languages
Transcription pipelines for Kikuyu and Luhya audio content
Research on low-resource African language ASR

Downstream Use [optional]

Can be integrated into larger pipelines for translation, keyword spotting, or voice-controlled applications targeting Kikuyu and Luhya speakers.

Bias, Risks, and Limitations

Trained on a limited dataset — performance may degrade on out-of-domain speakers or recording conditions
Character-level vocabulary may struggle with loanwords or proper nouns not seen during training
Performance may vary across dialects within Kikuyu and Luhya

[More Information Needed]

How to Get Started with the Model

Use the code below to get started with the model.

import torch
import soundfile as sf
from transformers import HubertForCTC, Wav2Vec2Processor

model = HubertForCTC.from_pretrained("JoanKinoti/afrihubert-kikuyu-luhya")
processor = Wav2Vec2Processor.from_pretrained("JoanKinoti/afrihubert-kikuyu-luhya")

def transcribe(audio_path: str) -> str:
    audio, sr = sf.read(audio_path)
    # Resample to 16kHz if needed
    if sr != 16000:
        import torchaudio
        audio = torchaudio.functional.resample(
            torch.tensor(audio).unsqueeze(0), sr, 16000
        ).squeeze(0).numpy()
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
    with torch.no_grad():
        logits = model(**inputs).logits
    pred_ids = torch.argmax(logits, dim=-1)
    return processor.batch_decode(pred_ids)[0]

print(transcribe("your_audio.wav"))

[More Information Needed]

Training Details

Training Data

Kikuyu: MCAA1-MSU/anv_data_ke — used for the initial full fine-tune on Kikuyu speech.

Luhya: Mozilla Data Collective — Luhya — used jointly with Kikuyu during the LoRA adaptation stage.

Both datasets consist of audio recordings with corresponding transcriptions. Audio was resampled to 16kHz mono and transcriptions were character-level tokenized using a shared 37-token vocabulary covering both languages.

Training Procedure

Stage 1 — Full fine-tune on Kikuyu: AfriHuBERT was fully fine-tuned on Kikuyu speech data producing a strong Kikuyu ASR checkpoint.

Stage 2 — Joint LoRA adaptation on Kikuyu + Luhya: LoRA adapters were applied on top of the Kikuyu checkpoint and trained jointly on both languages using balanced sampling to prevent catastrophic forgetting. The final LoRA adapters were merged into the base model for deployment.

Preprocessing [optional]

[More Information Needed]

Audio resampled to 16kHz mono
Peak normalization with soft tanh clipping
Maximum audio length: 20 seconds
Character-level tokenization with 37-token joint vocabulary

Training Hyperparameters

Training regime: [More Information Needed]
Training regime: fp16 mixed precision
Optimizer: AdamW
LM head learning rate: 1e-3
LoRA layers learning rate: 3e-5
Warmup ratio: 0.1
Training steps: 43,200
Trainable parameters: 618,277 (~0.65% of total)

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Metrics

Word Error Rate (WER) — measures the edit distance between predicted and reference transcriptions at the word level. Lower is better. Character Error Rate (CER) - measures the percentage of incorrectly predicted characters (substitutions, deletions, or insertions) compared to the total characters in the reference text

Results

Language	Samples	WER	CER
Kikuyu	2,033	39.9%	8.8%
Luhya	668	44.2%	13.6%

Summary

The model achieves strong character-level accuracy on both languages, with Kikuyu CER of 8.8% and Luhya CER of 13.6%. The higher WER relative to CER is expected for a character-level CTC model — individual character substitutions cause entire word mismatches. Sample predictions show the model captures most phonetic content correctly, with errors mainly on similar-sounding characters and word boundaries.

The results demonstrate that joint LoRA training successfully preserved Kikuyu performance while adding Luhya support, with no catastrophic forgetting observed.

Model Card Authors [optional]

Joan Kinoti

Model Card Contact

JoanKinoti on HuggingFace

[More Information Needed]

Downloads last month: 45

Safetensors

Model size

94.4M params

Tensor type

F32

Model tree for JoanKinoti/afrihubert-kikuyu-luhya

Base model

utter-project/mHuBERT-147

Finetuned

ajesujoba/AfriHuBERT

Adapter

(1)

this model