vrashad's picture
Update README.md
88c9efd verified
metadata
language:
  - az
license: apache-2.0
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
  - whisper
  - azerbaijani
  - asr
  - speech
  - fine-tuned
base_model: openai/whisper-small
datasets:
  - LocalDoc/azerbaijani_asr
  - LocalDoc/fleurs-azerbaijani-asr
metrics:
  - wer
  - cer
model-index:
  - name: azerbaijani-whisper-small
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          type: LocalDoc/fleurs-azerbaijani-asr
          name: FLEURS Azerbaijani
          split: test
        metrics:
          - type: wer
            value: 20.54
            name: WER
          - type: cer
            value: 5.72
            name: CER

Azerbaijani Whisper Small

Fine-tuned openai/whisper-small for Azerbaijani automatic speech recognition.

Performance

Model Params WER CER
whisper-small (baseline) 242M 52.17% 14.52%
whisper-medium (baseline) 769M 34.54% 9.00%
whisper-large-v3 (baseline) 1543M 21.00% 5.51%
azerbaijani-whisper-small 242M 20.54% 5.72%

This model achieves better quality than whisper-large-v3 while being 6x smaller.

Evaluated on FLEURS Azerbaijani test set.

Usage

pip install --upgrade transformers
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf
import numpy as np

processor = WhisperProcessor.from_pretrained("LocalDoc/azerbaijani-whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("LocalDoc/azerbaijani-whisper-small")

audio, sr = sf.read("audio.wav")

if len(audio.shape) > 1:
    audio = audio.mean(axis=1)

audio = librosa.resample(np.asarray(audio, dtype=np.float32), orig_sr=sr, target_sr=16000)
sr = 16000

inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
forced_ids = processor.get_decoder_prompt_ids(language="az", task="transcribe")

with torch.no_grad():
    ids = model.generate(inputs.input_features, forced_decoder_ids=forced_ids)

text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(text)

Note: Audio must be 16kHz mono. If your audio has a different sample rate, use librosa.resample() as shown above. Passing audio without resampling will produce incorrect results.

Requirements

pip install transformers torch soundfile librosa

Benchmark Details

All models evaluated on FLEURS Azerbaijani test split (921 samples) with the same normalization (lowercase, no punctuation).

Model Params WER CER RTF (GPU)
whisper-tiny 38M 104.48% 53.93% 0.033
whisper-base 73M 82.63% 30.35% 0.032
whisper-small 242M 52.17% 14.52% 0.053
whisper-medium 769M 34.54% 9.00% 0.097
whisper-large-v3 1543M 21.00% 5.51% 0.129
whisper-large-v3-turbo 809M 22.99% 6.55% 0.024
azerbaijani-whisper-small 242M 20.54% 5.72% ~0.05

License

Apache 2.0