Multilingual Whisper (Uz/En/Ru) β€” Fine-tuned Speech-to-Text Model

A fine-tuned Whisper Small model optimized to transcribe Uzbek, English, and Russian equally well.
This model is intended for real-world speech transcription with a balanced multilingual dataset and performs competitively against strong open-source and commercial STT solutions.


Model Details

Model Description

This model extends OpenAI Whisper Small by fine-tuning it on a multilingual speech mixture, aimed to deliver robust ASR performance for Uzbek, English, and Russian speakers.
The goal was to reduce the performance gap between languages, especially improving Uzbek speech recognition, where public ASR resources are scarce.

  • Model type: Automatic Speech Recognition (ASR)
  • Language(s): Uzbek πŸ‡ΊπŸ‡Ώ, English πŸ‡¬πŸ‡§, Russian πŸ‡·πŸ‡Ί
  • License: Apache-2.0
  • Finetuned from: openai/whisper-small
  • Intended usage: Real-time & offline speech-to-text

Trained datasets:

  • DavronSherbaev/uzbekvoice-filtered
  • telegram-voice-messages (private collection)
  • navaistt-open-datasets
  • sovaai/russian-audiobooks
  • librispeech

Evaluation

Word Error Rate (WER) Comparison

All WER results were obtained using the same test set. The test set consists of real-world voice messages collected from public Telegram groups. It contains approximately 2 hours of audio data in total. The dataset will be made publicly available soon.

Model WER ↓
Whisper-small-uz-v1 34.5%
Gemini (Commercial) 36.21%
NavaiSTT v2 (Open-Source medium model) 35.14%
Aisha STT (Commercial) 41.71%

The model outperforms both commercial and open-source Uzbek STT models, showing strong generalization for informal real-world speech.


Usage Example

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio

model_id = "OvozifyLabs/whisper-small-uz-v1"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

audio, sr = torchaudio.load("audio.wav")
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)
text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(text)
Downloads last month
146
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for OvozifyLabs/whisper-small-uz-v1

Finetuned
(3101)
this model

Space using OvozifyLabs/whisper-small-uz-v1 1