Instructions to use ibleaman/w2v-bert-2.0-yiddish-northeastern with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ibleaman/w2v-bert-2.0-yiddish-northeastern with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="ibleaman/w2v-bert-2.0-yiddish-northeastern")# Load model directly from transformers import AutoProcessor, AutoModelForCTC processor = AutoProcessor.from_pretrained("ibleaman/w2v-bert-2.0-yiddish-northeastern") model = AutoModelForCTC.from_pretrained("ibleaman/w2v-bert-2.0-yiddish-northeastern") - Notebooks
- Google Colab
- Kaggle
You need to abide by Terms of Use to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
Request access
This model is released for non-commercial research and educational purposes only.
By requesting access, you agree to:
- Abide by the CSYE Terms of Use and USC Shoah Foundation Terms of Use
- Properly cite our research paper from the HTRes-2026 workshop [Link coming soon]
We strongly recommend verifying all outputs against original audio, especially when working with sensitive recordings.
Log in or Sign Up to review the conditions and access this model content.
Automatic Speech Recognition for Northeastern Yiddish (Phonemic Orthography)
This model is a version of Wav2Vec-BERT 2.0 fine-tuned on a subset of the
Corpus of Spoken Yiddish in Europe (CSYE) for
automatic speech recognition in Northeastern Yiddish (also known as Litvish
or Lithuanian Yiddish). The model outputs a phonemic representation of Yiddish
using a Hebrew-based orthography in precomposed Unicode. This output can be
respelled in standard Yiddish by transliterating and then detransliterating
the text with the yiddish package.
This is the PHON-44 model from: Bleaman, Isaac L. 2026. Automatic Transcription of Holocaust Testimonies in Yiddish: Orthographic Comparison and Cross-Domain Validation. Proceedings of the Second Workshop on Holocaust Testimonies as Language Resources (HTRes-2026). [Link coming soon.]
Description
- Base model: facebook/w2v-bert-2.0
- Orthography: Phonemic Hebrew-based script in precomposed Unicode (w/ Alphabetic Presentation Forms)
- Training data: 30.83 hours from 42 Northeastern Yiddish speakers from CSYE
- Training seed: 44 (lowest WER of 5 random seeds)
Performance
In-domain (CSYE, Holocaust testimonies)
13,111 segments (8.58 hours) from 12 unseen speakers
- WER: 37.22%
- CER: 12.81%
Cross-domain (REYD, audiobooks)
3,632 utterances (5.32 hours) from 2 narrators
- WER: 24.32%
- CER: 5.88%
Terms of Use
This model is fine-tuned on transcribed Holocaust survivor testimonies from the CSYE, sourced from the USC Shoah Foundation Visual History Archive. It may only be used for non-commercial research and educational purposes, including Holocaust testimony preservation and accessibility, consistent with the CSYE Terms of Use and the USC Shoah Foundation Terms of Use. Users must request access to the ASR model using the form above.
Demo
An interactive demo notebook is available on Google Colab.
By default, the notebook loads a sample utterance from the REYD audiobook corpus, or you can supply your own single-speaker audio file. Output is provided in four orthographies: PHON (the model's native output), ROM (YIVO transliteration), STD (standard Yiddish with loshn-koydesh words spelled phonemically), and STD with proper loshn-koydesh spellings. Results are saved as a plain text file.
Automatic transcription of multi-speaker files is beyond the scope of this demo, but you can feel free to message me for more details. (You will probably need to diarize your audio first; see our tutorial.)
Orthographic Preprocessing
A notebook documenting the orthographic preprocessing pipeline used for ASR training in the HTRes-2026 paper is available on Google Colab.
The notebook demonstrates the three stages of preprocessing applied to original CSYE romanized transcripts: text-based filtering (removing segments with borrowings, filled pauses, uncertain transcriptions, partial words; ROM); conversion to standard Hebrew orthography with decomposed Unicode (STD); and conversion to the phonemic Hebrew representation with precomposed Unicode (PHON), used to train this model.
Citation
If you use this model, please cite:
Bleaman, Isaac L. 2026. Automatic Transcription of Holocaust Testimonies in Yiddish: Orthographic Comparison and Cross-Domain Validation. Proceedings of the Second Workshop on Holocaust Testimonies as Language Resources (HTRes-2026). [Link coming soon.]
Research Support
This material is based upon work supported by the National Science Foundation under Award No. BCS-2142797. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- Downloads last month
- 16
Model tree for ibleaman/w2v-bert-2.0-yiddish-northeastern
Base model
facebook/w2v-bert-2.0