README / README.md
ponteineptique's picture
Update README.md
73d1f43 verified
metadata
title: README
emoji: 🔥
colorFrom: red
colorTo: blue
sdk: static
pinned: false

CoMMA: Corpus of Multilingual Medieval Archives

CoMMA is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.

Original paper:

@unpublished{clerice:hal-05299220,
  TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}},
  AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t},
  URL = {https://inria.hal.science/hal-05299220},
  NOTE = {working paper or preprint},
  YEAR = {2025},
  MONTH = Oct,
  KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus},
  PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf},
  HAL_ID = {hal-05299220},
  HAL_VERSION = {v1},
}

🏛️ What’s Inside

  • Multilingual Corpora: Annotated texts in Old French and Medieval Latin.
  • Specialized Models: Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.

See some demo:

License: CC-BY 4.0