README / README.md
ponteineptique's picture
Update README.md
73d1f43 verified
---
title: README
emoji: 🔥
colorFrom: red
colorTo: blue
sdk: static
pinned: false
---
# CoMMA: Corpus of Multilingual Medieval Archives
**CoMMA** is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.
Original paper:
```bib
@unpublished{clerice:hal-05299220,
TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}},
AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t},
URL = {https://inria.hal.science/hal-05299220},
NOTE = {working paper or preprint},
YEAR = {2025},
MONTH = Oct,
KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus},
PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf},
HAL_ID = {hal-05299220},
HAL_VERSION = {v1},
}
```
### 🏛️ What’s Inside
* **Multilingual Corpora:** Annotated texts in Old French and Medieval Latin.
* **Specialized Models:** Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.
See some demo:
- https://comma.inria.fr to browse the corpus
- Demo for our normalization model https://huggingface.co/spaces/comma-project/pre-editorial-normalization
**License:** CC-BY 4.0