YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
normalisationS2S-nontokenised
A character-level sequence-to-sequence encoder-decoder transformer model for the normalisation of Classical Tibetan, converting diplomatic (non-standard, abbreviated) Tibetan manuscript text into Standard Classical Tibetan. This is the non-tokenised variant of the model — input and output are raw Tibetan Unicode character sequences without prior word segmentation.
This model is part of the PaganTibet project and accompanies the paper:
Meelen, M. & Griffiths, R.M. (2026) 'Historical Tibetan Normalisation: rule-based vs neural & n-gram LM methods for extremely low-resource languages' in Proceedings of the AI4CHIEF conference, Springer.
Please cite the paper and the training repository when using this model.
Model Description
Classical Tibetan manuscripts present major normalisation challenges: extensive abbreviations, non-standard orthography, scribal variation, and a near-complete absence of gold-standard parallel data. This model addresses these challenges using a hybrid approach combining a neural sequence-to-sequence transformer with optional rule-based pre-/post-processing and KenLM n-gram language model ranking (the latter applied at inference time; see the Inference scripts).
The model operates at the character level on non-tokenised input, meaning it processes raw Tibetan Unicode strings (syllables and punctuation) without word segmentation. Results from Meelen & Griffiths (2026) show that, for non-tokenised text, the neural model alone performs reasonably well on standard Buddhist texts, while more challenging diplomatic corpora benefit from the addition of rule-based processing and n-gram ranking.
Architecture
- Type: Character-level encoder-decoder transformer (Seq2Seq)
- Layers: 4
- Attention heads: 8
- Optimiser: Adam (lr = 0.0005, β1 = 0.9, β2 = 0.997)
- Label smoothing: 0.1
- Framework: PyTorch
- Training hardware: RTX ADA 6000 GPU (~5–6 hours training time)
Full hyperparameter settings are reported in the Appendix of Meelen & Griffiths (2026).
Training Data
The model was trained on the dataset pagantibet/normalisation-S2S-training (~2 million rows), which combines:
- Gold-standard data: 7,421 manually normalised parallel sentence pairs from the PaganTibet corpus.
- Augmented data: The gold data was substantially expanded using four data augmentation strategies, each designed to simulate the kinds of variation found in historical Tibetan manuscripts:
- Random noise injection: Probabilistic character substitutions, diacritic variations, and orthographic inconsistencies calibrated to realistic manuscript variation frequencies (following Huang et al. 2023).
- OCR-based noise simulation: OCR-realistic noise patterns generated using the nlpaug library.
- Rule-based diplomatic transformations: Stochastic application of character replacements reflecting common scribal conventions in historical Tibetan manuscripts.
- Dictionary-based augmentation: Insertion of entries from a custom Tibetan abbreviation dictionary (~10,000 abbreviation–expansion pairs) to help the model learn abbreviation resolution.
Additional training data was derived from the Standard Classical Tibetan ACTib corpus (>180 million words; Meelen & Roux 2020), processed into manuscript-length lines using the createTiblines.py script.
Full details of the data preparation and augmentation pipeline are described in the GitHub repository.
Intended Use
This model is intended for:
- Normalisation of diplomatic Classical Tibetan texts into Standard Classical Tibetan, as a preprocessing step for downstream NLP tasks (e.g. tokenisation, tagging, translation).
- Digital humanities workflows for processing historical Tibetan manuscripts, particularly texts with heavy abbreviation or non-standard orthography.
- Research on low-resource historical text normalisation.
Note: Per the results in Meelen & Griffiths (2026), it is recommended to apply normalisation before tokenisation in the processing pipeline. For challenging diplomatic texts, combining this model with the KenLM n-gram ranker and rule-based pre/post-processing (see Inference) yields the best results.
How to Use
The model can be used with the inference scripts provided in the PaganTibet normalisation repository. Six inference modes are available, ranging from rule-based only to combined neural + n-gram + rule-based pipelines:
# Run on a GPU cluster via Slurm
sbatch tibetan-inference-flexible.sh
# Or run directly
python3 tibetan-inference-flexible.py
See the Inference ReadMe for full usage details and configuration options.
Evaluation
The training script includes a built-in beam search evaluation. Separate evaluation is available via the evaluation script, which reports:
- CER (Character Error Rate)
- Precision, Recall, F1
- Correction Precision (CP) and Correction Recall (CR) (following Huang et al. 2023) for a more accurate picture of normalisation effectiveness
- Bootstrapped Confidence Intervals (1,000 iterations) for small test sets
sbatch evaluate-model.sh
# or
python3 evaluate_model.py
Full evaluation results including confidence intervals and example predictions are available in the non-tokenised Evaluations directory of the repository.
Related Models and Resources
All models and datasets from the PaganTibet normalisation project are collected in the Normalisation collection on Hugging Face.
| Resource | Link |
|---|---|
| Training dataset | pagantibet/normalisation-S2S-training |
| Abbreviation dictionary | pagantibet/Tibetan-abbreviation-dictionary |
| Training & inference code | github.com/pagantibet/normalisation |
| ACTib corpus | Zenodo (Meelen & Roux 2020) |
| PaganTibet project | pagantibet.com |
Citation
If you use this model, please cite the accompanying paper and the repository:
@inproceedings{meelen-griffiths-2026-tibetan-normalisation,
author = {Meelen, Marieke and Griffiths, R.M.},
title = {Historical Tibetan Normalisation: rule-based vs neural \& n-gram LM methods for extremely low-resource languages},
booktitle = {Proceedings of the AI4CHIEF conference},
publisher = {Springer},
year = {2026}
}
License
This model is released under CC BY-NC-SA 4.0. It may be used freely for non-commercial research and educational purposes, with attribution and under the same licence terms.
Funding
This work was partially funded by the European Union (ERC, Pagan Tibet, grant no. 101097364). Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency.