Cosmobillian / turkish_whisper_for_noisy_datas

A Whisper-large-v3 model fine-tuned for noisy Turkish speech recognition (short utterances, real-world environments).

🔎 Model Summary

Base model: openai/whisper-large-v3
Language: Turkish (tr)
Task: Automatic Speech Recognition (ASR) – Transcription
Domain: Noisy / real-world audio (street, phone mic, background noise, reverb, etc.)
Input audio: mono, 16 kHz, short segments (≈ 3–8 seconds)
Fine-tuning type: Full model (decoder-focused fine-tuning, encoder frozen)

This model is designed to perform robust speech-to-text for noisy Turkish audio, especially:

mobile / cheap microphone recordings
mild background music or chatter
echo / reverb (rooms, corridors etc.)

It is not a general multilingual model any more; the decoding is heavily biased towards Turkish.

✅ Intended Use

Primary use-case:

Transcribing short Turkish utterances with background noise (e.g. real calls, vlogs, “in the wild” recordings).

Good for:

Prototypes of Turkish ASR systems
Voice-enabled assistants for Turkish users
Noisy datasets (phone, street, public places, YouTube-like content)

Not ideal for:

Long-form audio without chunking (podcasts, 1+ minute single shot)
High-stakes applications (medical/legal dictation) without manual review
Clean studio speech where smaller Whisper models already perform very well

⚙️ Training Details

Note: This is a custom fine-tuned model; base capabilities come from openai/whisper-large-v3.

Base model: openai/whisper-large-v3
Fine-tuned on: Private Turkish dataset of short (~5s) audio clips
- Noisy, real-world conditions
- Paired with manually prepared transcriptions
Sampling rate: 16 kHz, mono
Loss: Cross-entropy with label smoothing
Strategy:
- Encoder frozen (only decoder fine-tuned)
- Small learning rate to avoid catastrophic forgetting
- Short training (1 epoch) to adapt to noise style while preserving base knowledge

Exact dataset is not public; this model should be treated as research / experimental.

📊 Evaluation

The model has been manually checked on several noisy Turkish utterances. Qualitatively:

Much more robust to background noise than vanilla Whisper on the same custom data
Better handling of casual/spontaneous speech (hesitations, filler words, etc.)
Occasionally produces grammatically imperfect sentences (as expected from ASR)

There is no official WER benchmark on a public dataset yet (e.g. Common Voice, MLS).
If you use this model in a paper or product, please:

Benchmark on your own dev/test set
Share WER / CER numbers if possible 🙏

🚀 Quickstart (Hugging Face `pipeline`)

!pip install -q transformers soundfile librosa

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

MODEL_ID = "Cosmobillian/turkish_whisper_for_noisy_datas"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID).to(device)

# Dil/task prompt'unu zorla (TR + transcribe)
forced_ids = processor.get_decoder_prompt_ids(
    language="turkish",
    task="transcribe",
)
model.config.forced_decoder_ids = forced_ids
if hasattr(model, "generation_config"):
    model.generation_config.forced_decoder_ids = forced_ids


def load_audio(path, target_sr=16000):
    audio, sr = librosa.load(path, sr=None, mono=True)
    if sr != target_sr:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=target_sr)
        sr = target_sr
    return audio, sr


def chunked_transcribe(path, chunk_sec=30.0, stride_sec=5.0, max_new_tokens=256):
    speech, sr = load_audio(path, 16000)

    chunk_size = int(chunk_sec * sr)
    stride_size = int(stride_sec * sr)

    texts = []
    start = 0

    while start < len(speech):
        end = start + chunk_size
        chunk = speech[start:end]

        if len(chunk) == 0:
            break

        inputs = processor(
            chunk,
            sampling_rate=sr,
            return_tensors="pt",
        )
        input_features = inputs.input_features.to(device)

        with torch.no_grad():
            generated_ids = model.generate(
                input_features,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                num_beams=1,
                no_repeat_ngram_size=3,
                repetition_penalty=1.2,
            )

        text = processor.batch_decode(
            generated_ids,
            skip_special_tokens=True
        )[0]

        texts.append(text)

        # bir sonraki chunk'a stride kadar kayarak git
        start = end - stride_size

    return " ".join(texts)


# ÖRNEK KULLANIM
AUDIO_PATH = "/content/uzun_kayit.wav"
full_text = chunked_transcribe(AUDIO_PATH, chunk_sec=30, stride_sec=5, max_new_tokens=256)

print("Tam transkripsiyon:\n")
print(full_text)

Downloads last month: 23

Safetensors

Model size

2B params

Tensor type

F32

Model tree for Cosmobillian/turkish_whisper_for_noisy_datas_v1

Base model

openai/whisper-large-v3

Finetuned

(669)

this model