AfriHuBERT โ€” Kikuyu + Luhya ASR

Model Card for Model ID

A multilingual automatic speech recognition (ASR) model for Kikuyu and Luhya, two Bantu languages spoken primarily in Kenya. Built on top of AfriHuBERT using a two-stage fine-tuning approach with LoRA adaptation to prevent catastrophic forgetting.

Model Details

Model Description

This model performs automatic speech recognition for two Kenyan Bantu languages โ€” Kikuyu and Luhya. It was fine-tuned from AfriHuBERT using a two-stage approach: first fully fine-tuned on Kikuyu, then LoRA-adapted jointly on both Kikuyu and Luhya to add Luhya support without catastrophic forgetting of Kikuyu.

  • Developed by: Joan Kinoti
  • Model type: Automatic Speech Recognition (CTC)
  • Language(s): Kikuyu (ki), Luhya (luy)
  • License: MIT
  • Finetuned from: ajesujoba/AfriHuBERT

Model Sources [optional]

Direct Use

Transcribe spoken Kikuyu or Luhya audio to text. Suitable for:

  • Voice interfaces for Kenyan languages
  • Transcription pipelines for Kikuyu and Luhya audio content
  • Research on low-resource African language ASR

Downstream Use [optional]

Can be integrated into larger pipelines for translation, keyword spotting, or voice-controlled applications targeting Kikuyu and Luhya speakers.

Bias, Risks, and Limitations

  • Trained on a limited dataset โ€” performance may degrade on out-of-domain speakers or recording conditions
  • Character-level vocabulary may struggle with loanwords or proper nouns not seen during training
  • Performance may vary across dialects within Kikuyu and Luhya

[More Information Needed]

How to Get Started with the Model

Use the code below to get started with the model.

import torch
import soundfile as sf
from transformers import HubertForCTC, Wav2Vec2Processor

model = HubertForCTC.from_pretrained("JoanKinoti/afrihubert-kikuyu-luhya")
processor = Wav2Vec2Processor.from_pretrained("JoanKinoti/afrihubert-kikuyu-luhya")

def transcribe(audio_path: str) -> str:
    audio, sr = sf.read(audio_path)
    # Resample to 16kHz if needed
    if sr != 16000:
        import torchaudio
        audio = torchaudio.functional.resample(
            torch.tensor(audio).unsqueeze(0), sr, 16000
        ).squeeze(0).numpy()
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
    with torch.no_grad():
        logits = model(**inputs).logits
    pred_ids = torch.argmax(logits, dim=-1)
    return processor.batch_decode(pred_ids)[0]

print(transcribe("your_audio.wav"))

[More Information Needed]

Training Details

Training Data

Training Data

Kikuyu: MCAA1-MSU/anv_data_ke โ€” used for the initial full fine-tune on Kikuyu speech.

Luhya: Mozilla Data Collective โ€” Luhya โ€” used jointly with Kikuyu during the LoRA adaptation stage.

Both datasets consist of audio recordings with corresponding transcriptions. Audio was resampled to 16kHz mono and transcriptions were character-level tokenized using a shared 37-token vocabulary covering both languages.

Training Procedure

Stage 1 โ€” Full fine-tune on Kikuyu: AfriHuBERT was fully fine-tuned on Kikuyu speech data producing a strong Kikuyu ASR checkpoint.

Stage 2 โ€” Joint LoRA adaptation on Kikuyu + Luhya: LoRA adapters were applied on top of the Kikuyu checkpoint and trained jointly on both languages using balanced sampling to prevent catastrophic forgetting. The final LoRA adapters were merged into the base model for deployment.

Preprocessing [optional]

[More Information Needed]

  • Audio resampled to 16kHz mono
  • Peak normalization with soft tanh clipping
  • Maximum audio length: 20 seconds
  • Character-level tokenization with 37-token joint vocabulary

Training Hyperparameters

  • Training regime: [More Information Needed]
  • Training regime: fp16 mixed precision
  • Optimizer: AdamW
  • LM head learning rate: 1e-3
  • LoRA layers learning rate: 3e-5
  • Warmup ratio: 0.1
  • Training steps: 43,200
  • Trainable parameters: 618,277 (~0.65% of total)

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Metrics

Word Error Rate (WER) โ€” measures the edit distance between predicted and reference transcriptions at the word level. Lower is better. Character Error Rate (CER) - measures the percentage of incorrectly predicted characters (substitutions, deletions, or insertions) compared to the total characters in the reference text

Results

Language Samples WER CER
Kikuyu 2,033 39.9% 8.8%
Luhya 668 44.2% 13.6%

Summary

The model achieves strong character-level accuracy on both languages, with Kikuyu CER of 8.8% and Luhya CER of 13.6%. The higher WER relative to CER is expected for a character-level CTC model โ€” individual character substitutions cause entire word mismatches. Sample predictions show the model captures most phonetic content correctly, with errors mainly on similar-sounding characters and word boundaries.

The results demonstrate that joint LoRA training successfully preserved Kikuyu performance while adding Luhya support, with no catastrophic forgetting observed.

Model Card Authors [optional]

Joan Kinoti

Model Card Contact

JoanKinoti on HuggingFace

[More Information Needed]

Downloads last month
45
Safetensors
Model size
94.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for JoanKinoti/afrihubert-kikuyu-luhya

Adapter
(1)
this model