Cosmobillian / turkish_whisper_for_noisy_datas
A Whisper-large-v3 model fine-tuned for noisy Turkish speech recognition (short utterances, real-world environments).
π Model Summary
- Base model:
openai/whisper-large-v3 - Language: Turkish (
tr) - Task: Automatic Speech Recognition (ASR) β Transcription
- Domain: Noisy / real-world audio (street, phone mic, background noise, reverb, etc.)
- Input audio: mono, 16 kHz, short segments (β 3β8 seconds)
- Fine-tuning type: Full model (decoder-focused fine-tuning, encoder frozen)
This model is designed to perform robust speech-to-text for noisy Turkish audio, especially:
- mobile / cheap microphone recordings
- mild background music or chatter
- echo / reverb (rooms, corridors etc.)
It is not a general multilingual model any more; the decoding is heavily biased towards Turkish.
β Intended Use
Primary use-case:
- Transcribing short Turkish utterances with background noise (e.g. real calls, vlogs, βin the wildβ recordings).
Good for:
- Prototypes of Turkish ASR systems
- Voice-enabled assistants for Turkish users
- Noisy datasets (phone, street, public places, YouTube-like content)
Not ideal for:
- Long-form audio without chunking (podcasts, 1+ minute single shot)
- High-stakes applications (medical/legal dictation) without manual review
- Clean studio speech where smaller Whisper models already perform very well
βοΈ Training Details
Note: This is a custom fine-tuned model; base capabilities come from
openai/whisper-large-v3.
- Base model:
openai/whisper-large-v3 - Fine-tuned on: Private Turkish dataset of short (~5s) audio clips
- Noisy, real-world conditions
- Paired with manually prepared transcriptions
- Sampling rate: 16 kHz, mono
- Loss: Cross-entropy with label smoothing
- Strategy:
- Encoder frozen (only decoder fine-tuned)
- Small learning rate to avoid catastrophic forgetting
- Short training (1 epoch) to adapt to noise style while preserving base knowledge
Exact dataset is not public; this model should be treated as research / experimental.
π Evaluation
The model has been manually checked on several noisy Turkish utterances. Qualitatively:
- Much more robust to background noise than vanilla Whisper on the same custom data
- Better handling of casual/spontaneous speech (hesitations, filler words, etc.)
- Occasionally produces grammatically imperfect sentences (as expected from ASR)
There is no official WER benchmark on a public dataset yet (e.g. Common Voice, MLS).
If you use this model in a paper or product, please:
- Benchmark on your own dev/test set
- Share WER / CER numbers if possible π
π Quickstart (Hugging Face pipeline)
!pip install -q transformers soundfile librosa
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
MODEL_ID = "Cosmobillian/turkish_whisper_for_noisy_datas"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID).to(device)
# Dil/task prompt'unu zorla (TR + transcribe)
forced_ids = processor.get_decoder_prompt_ids(
language="turkish",
task="transcribe",
)
model.config.forced_decoder_ids = forced_ids
if hasattr(model, "generation_config"):
model.generation_config.forced_decoder_ids = forced_ids
def load_audio(path, target_sr=16000):
audio, sr = librosa.load(path, sr=None, mono=True)
if sr != target_sr:
audio = librosa.resample(audio, orig_sr=sr, target_sr=target_sr)
sr = target_sr
return audio, sr
def chunked_transcribe(path, chunk_sec=30.0, stride_sec=5.0, max_new_tokens=256):
speech, sr = load_audio(path, 16000)
chunk_size = int(chunk_sec * sr)
stride_size = int(stride_sec * sr)
texts = []
start = 0
while start < len(speech):
end = start + chunk_size
chunk = speech[start:end]
if len(chunk) == 0:
break
inputs = processor(
chunk,
sampling_rate=sr,
return_tensors="pt",
)
input_features = inputs.input_features.to(device)
with torch.no_grad():
generated_ids = model.generate(
input_features,
max_new_tokens=max_new_tokens,
do_sample=False,
num_beams=1,
no_repeat_ngram_size=3,
repetition_penalty=1.2,
)
text = processor.batch_decode(
generated_ids,
skip_special_tokens=True
)[0]
texts.append(text)
# bir sonraki chunk'a stride kadar kayarak git
start = end - stride_size
return " ".join(texts)
# ΓRNEK KULLANIM
AUDIO_PATH = "/content/uzun_kayit.wav"
full_text = chunked_transcribe(AUDIO_PATH, chunk_sec=30, stride_sec=5, max_new_tokens=256)
print("Tam transkripsiyon:\n")
print(full_text)
- Downloads last month
- 23
Model tree for Cosmobillian/turkish_whisper_for_noisy_datas_v1
Base model
openai/whisper-large-v3