--- title: README emoji: 🔥 colorFrom: red colorTo: blue sdk: static pinned: false --- # CoMMA: Corpus of Multilingual Medieval Archives **CoMMA** is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts. Original paper: ```bib @unpublished{clerice:hal-05299220, TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}}, AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t}, URL = {https://inria.hal.science/hal-05299220}, NOTE = {working paper or preprint}, YEAR = {2025}, MONTH = Oct, KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus}, PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf}, HAL_ID = {hal-05299220}, HAL_VERSION = {v1}, } ``` ### 🏛️ What’s Inside * **Multilingual Corpora:** Annotated texts in Old French and Medieval Latin. * **Specialized Models:** Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax. See some demo: - https://comma.inria.fr to browse the corpus - Demo for our normalization model https://huggingface.co/spaces/comma-project/pre-editorial-normalization **License:** CC-BY 4.0