Spaces:

comma-project
/

README

Running

App Files Files Community

README / README.md

ponteineptique

Update README.md

73d1f43 verified 6 days ago

preview code

raw

history blame contribute delete

1.53 kB

metadata

title: README
emoji: 🔥
colorFrom: red
colorTo: blue
sdk: static
pinned: false

CoMMA: Corpus of Multilingual Medieval Archives

CoMMA is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.

Original paper:

@unpublished{clerice:hal-05299220,
  TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}},
  AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t},
  URL = {https://inria.hal.science/hal-05299220},
  NOTE = {working paper or preprint},
  YEAR = {2025},
  MONTH = Oct,
  KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus},
  PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf},
  HAL_ID = {hal-05299220},
  HAL_VERSION = {v1},
}

🏛️ What’s Inside

Multilingual Corpora: Annotated texts in Old French and Medieval Latin.
Specialized Models: Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.

See some demo:

https://comma.inria.fr to browse the corpus
Demo for our normalization model https://huggingface.co/spaces/comma-project/pre-editorial-normalization

License: CC-BY 4.0