Update README.md

88c9efd verified about 2 months ago

3.23 kB

language:
  - az
license: apache-2.0
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
  - whisper
  - azerbaijani
  - asr
  - speech
  - fine-tuned
base_model: openai/whisper-small
datasets:
  - LocalDoc/azerbaijani_asr
  - LocalDoc/fleurs-azerbaijani-asr
metrics:
  - wer
  - cer
model-index:
  - name: azerbaijani-whisper-small
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          type: LocalDoc/fleurs-azerbaijani-asr
          name: FLEURS Azerbaijani
          split: test
        metrics:
          - type: wer
            value: 20.54
            name: WER
          - type: cer
            value: 5.72
            name: CER

Azerbaijani Whisper Small

Fine-tuned openai/whisper-small for Azerbaijani automatic speech recognition.

Performance

Model	Params	WER	CER
whisper-small (baseline)	242M	52.17%	14.52%
whisper-medium (baseline)	769M	34.54%	9.00%
whisper-large-v3 (baseline)	1543M	21.00%	5.51%
azerbaijani-whisper-small	242M	20.54%	5.72%

This model achieves better quality than whisper-large-v3 while being 6x smaller.

Evaluated on FLEURS Azerbaijani test set.

Usage

pip install --upgrade transformers

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf
import numpy as np

processor = WhisperProcessor.from_pretrained("LocalDoc/azerbaijani-whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("LocalDoc/azerbaijani-whisper-small")

audio, sr = sf.read("audio.wav")

if len(audio.shape) > 1:
    audio = audio.mean(axis=1)

audio = librosa.resample(np.asarray(audio, dtype=np.float32), orig_sr=sr, target_sr=16000)
sr = 16000

inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
forced_ids = processor.get_decoder_prompt_ids(language="az", task="transcribe")

with torch.no_grad():
    ids = model.generate(inputs.input_features, forced_decoder_ids=forced_ids)

text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(text)

Note: Audio must be 16kHz mono. If your audio has a different sample rate, use librosa.resample() as shown above. Passing audio without resampling will produce incorrect results.

Requirements

pip install transformers torch soundfile librosa

Benchmark Details

All models evaluated on FLEURS Azerbaijani test split (921 samples) with the same normalization (lowercase, no punctuation).

Model	Params	WER	CER	RTF (GPU)
whisper-tiny	38M	104.48%	53.93%	0.033
whisper-base	73M	82.63%	30.35%	0.032
whisper-small	242M	52.17%	14.52%	0.053
whisper-medium	769M	34.54%	9.00%	0.097
whisper-large-v3	1543M	21.00%	5.51%	0.129
whisper-large-v3-turbo	809M	22.99%	6.55%	0.024
azerbaijani-whisper-small	242M	20.54%	5.72%	~0.05

License

Apache 2.0