---
title: README
emoji: 🔥
colorFrom: red
colorTo: blue
sdk: static
pinned: false
---

# CoMMA: Corpus of Multilingual Medieval Archives

**CoMMA** is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.

Original paper:

```bib
@unpublished{clerice:hal-05299220,
  TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}},
  AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t},
  URL = {https://inria.hal.science/hal-05299220},
  NOTE = {working paper or preprint},
  YEAR = {2025},
  MONTH = Oct,
  KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus},
  PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf},
  HAL_ID = {hal-05299220},
  HAL_VERSION = {v1},
}
```

### 🏛️ What’s Inside
* **Multilingual Corpora:** Annotated texts in Old French and Medieval Latin.
* **Specialized Models:** Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.

See some demo:
- https://comma.inria.fr to browse the corpus
- Demo for our normalization model https://huggingface.co/spaces/comma-project/pre-editorial-normalization

**License:** CC-BY 4.0