Spaces:
Running
Running
| title: README | |
| emoji: 🔥 | |
| colorFrom: red | |
| colorTo: blue | |
| sdk: static | |
| pinned: false | |
| # CoMMA: Corpus of Multilingual Medieval Archives | |
| **CoMMA** is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts. | |
| Original paper: | |
| ```bib | |
| @unpublished{clerice:hal-05299220, | |
| TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}}, | |
| AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t}, | |
| URL = {https://inria.hal.science/hal-05299220}, | |
| NOTE = {working paper or preprint}, | |
| YEAR = {2025}, | |
| MONTH = Oct, | |
| KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus}, | |
| PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf}, | |
| HAL_ID = {hal-05299220}, | |
| HAL_VERSION = {v1}, | |
| } | |
| ``` | |
| ### 🏛️ What’s Inside | |
| * **Multilingual Corpora:** Annotated texts in Old French and Medieval Latin. | |
| * **Specialized Models:** Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax. | |
| See some demo: | |
| - https://comma.inria.fr to browse the corpus | |
| - Demo for our normalization model https://huggingface.co/spaces/comma-project/pre-editorial-normalization | |
| **License:** CC-BY 4.0 | |