You need to abide by Terms of Use to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Request access

This model is released for non-commercial research and educational purposes only.

By requesting access, you agree to:

Abide by the CSYE Terms of Use and USC Shoah Foundation Terms of Use
Properly cite our research paper from the HTRes-2026 workshop [Link coming soon]

We strongly recommend verifying all outputs against original audio, especially when working with sensitive recordings.

Automatic Speech Recognition for Northeastern Yiddish (Phonemic Orthography)

This model is a version of Wav2Vec-BERT 2.0 fine-tuned on a subset of the Corpus of Spoken Yiddish in Europe (CSYE) for automatic speech recognition in Northeastern Yiddish (also known as Litvish or Lithuanian Yiddish). The model outputs a phonemic representation of Yiddish using a Hebrew-based orthography in precomposed Unicode. This output can be respelled in standard Yiddish by transliterating and then detransliterating the text with the yiddish package.

This is the PHON-44 model from: Bleaman, Isaac L. 2026. Automatic Transcription of Holocaust Testimonies in Yiddish: Orthographic Comparison and Cross-Domain Validation. Proceedings of the Second Workshop on Holocaust Testimonies as Language Resources (HTRes-2026). [Link coming soon.]

Description

Base model: facebook/w2v-bert-2.0
Orthography: Phonemic Hebrew-based script in precomposed Unicode (w/ Alphabetic Presentation Forms)
Training data: 30.83 hours from 42 Northeastern Yiddish speakers from CSYE
Training seed: 44 (lowest WER of 5 random seeds)

Performance

In-domain (CSYE, Holocaust testimonies)

13,111 segments (8.58 hours) from 12 unseen speakers

WER: 37.22%
CER: 12.81%

Cross-domain (REYD, audiobooks)

3,632 utterances (5.32 hours) from 2 narrators

WER: 24.32%
CER: 5.88%

Terms of Use

This model is fine-tuned on transcribed Holocaust survivor testimonies from the CSYE, sourced from the USC Shoah Foundation Visual History Archive. It may only be used for non-commercial research and educational purposes, including Holocaust testimony preservation and accessibility, consistent with the CSYE Terms of Use and the USC Shoah Foundation Terms of Use. Users must request access to the ASR model using the form above.

Demo

An interactive demo notebook is available on Google Colab.

By default, the notebook loads a sample utterance from the REYD audiobook corpus, or you can supply your own single-speaker audio file. Output is provided in four orthographies: PHON (the model's native output), ROM (YIVO transliteration), STD (standard Yiddish with loshn-koydesh words spelled phonemically), and STD with proper loshn-koydesh spellings. Results are saved as a plain text file.

Automatic transcription of multi-speaker files is beyond the scope of this demo, but you can feel free to message me for more details. (You will probably need to diarize your audio first; see our tutorial.)

Orthographic Preprocessing

A notebook documenting the orthographic preprocessing pipeline used for ASR training in the HTRes-2026 paper is available on Google Colab.

The notebook demonstrates the three stages of preprocessing applied to original CSYE romanized transcripts: text-based filtering (removing segments with borrowings, filled pauses, uncertain transcriptions, partial words; ROM); conversion to standard Hebrew orthography with decomposed Unicode (STD); and conversion to the phonemic Hebrew representation with precomposed Unicode (PHON), used to train this model.

Citation

If you use this model, please cite:

Bleaman, Isaac L. 2026. Automatic Transcription of Holocaust Testimonies in Yiddish: Orthographic Comparison and Cross-Domain Validation. Proceedings of the Second Workshop on Holocaust Testimonies as Language Resources (HTRes-2026). [Link coming soon.]

Research Support

This material is based upon work supported by the National Science Foundation under Award No. BCS-2142797. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Downloads last month: 16

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for ibleaman/w2v-bert-2.0-yiddish-northeastern

Base model

facebook/w2v-bert-2.0

Finetuned

(472)

this model