You need to abide by Terms of Use to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Request access

This model is released for non-commercial research and educational purposes only.

By requesting access, you agree to:

We strongly recommend verifying all outputs against original audio, especially when working with sensitive recordings.

Log in or Sign Up to review the conditions and access this model content.

Automatic Speech Recognition for Northeastern Yiddish (Phonemic Orthography)

This model is a version of Wav2Vec-BERT 2.0 fine-tuned on a subset of the Corpus of Spoken Yiddish in Europe (CSYE) for automatic speech recognition in Northeastern Yiddish (also known as Litvish or Lithuanian Yiddish). The model outputs a phonemic representation of Yiddish using a Hebrew-based orthography in precomposed Unicode. This output can be respelled in standard Yiddish by transliterating and then detransliterating the text with the yiddish package.

This is the PHON-44 model from: Bleaman, Isaac L. 2026. Automatic Transcription of Holocaust Testimonies in Yiddish: Orthographic Comparison and Cross-Domain Validation. Proceedings of the Second Workshop on Holocaust Testimonies as Language Resources (HTRes-2026). [Link coming soon.]

Description

  • Base model: facebook/w2v-bert-2.0
  • Orthography: Phonemic Hebrew-based script in precomposed Unicode (w/ Alphabetic Presentation Forms)
  • Training data: 30.83 hours from 42 Northeastern Yiddish speakers from CSYE
  • Training seed: 44 (lowest WER of 5 random seeds)

Performance

In-domain (CSYE, Holocaust testimonies)

13,111 segments (8.58 hours) from 12 unseen speakers

  • WER: 37.22%
  • CER: 12.81%

Cross-domain (REYD, audiobooks)

3,632 utterances (5.32 hours) from 2 narrators

  • WER: 24.32%
  • CER: 5.88%

Terms of Use

This model is fine-tuned on transcribed Holocaust survivor testimonies from the CSYE, sourced from the USC Shoah Foundation Visual History Archive. It may only be used for non-commercial research and educational purposes, including Holocaust testimony preservation and accessibility, consistent with the CSYE Terms of Use and the USC Shoah Foundation Terms of Use. Users must request access to the ASR model using the form above.

Demo

An interactive demo notebook is available on Google Colab.

By default, the notebook loads a sample utterance from the REYD audiobook corpus, or you can supply your own single-speaker audio file. Output is provided in four orthographies: PHON (the model's native output), ROM (YIVO transliteration), STD (standard Yiddish with loshn-koydesh words spelled phonemically), and STD with proper loshn-koydesh spellings. Results are saved as a plain text file.

Automatic transcription of multi-speaker files is beyond the scope of this demo, but you can feel free to message me for more details. (You will probably need to diarize your audio first; see our tutorial.)

Orthographic Preprocessing

A notebook documenting the orthographic preprocessing pipeline used for ASR training in the HTRes-2026 paper is available on Google Colab.

The notebook demonstrates the three stages of preprocessing applied to original CSYE romanized transcripts: text-based filtering (removing segments with borrowings, filled pauses, uncertain transcriptions, partial words; ROM); conversion to standard Hebrew orthography with decomposed Unicode (STD); and conversion to the phonemic Hebrew representation with precomposed Unicode (PHON), used to train this model.

Citation

If you use this model, please cite:

Bleaman, Isaac L. 2026. Automatic Transcription of Holocaust Testimonies in Yiddish: Orthographic Comparison and Cross-Domain Validation. Proceedings of the Second Workshop on Holocaust Testimonies as Language Resources (HTRes-2026). [Link coming soon.]

Research Support

This material is based upon work supported by the National Science Foundation under Award No. BCS-2142797. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Downloads last month
16
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ibleaman/w2v-bert-2.0-yiddish-northeastern

Finetuned
(472)
this model