Spaces:

comma-project
/

README

Running

README / README.md

Update README.md

73d1f43 verified 7 days ago

1.53 kB

	---
	title: README
	emoji: 🔥
	colorFrom: red
	colorTo: blue
	sdk: static
	pinned: false
	---

	# CoMMA: Corpus of Multilingual Medieval Archives

	CoMMA is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.

	Original paper:

	```bib
	@unpublished{clerice:hal-05299220,
	TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}},
	AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t},
	URL = {https://inria.hal.science/hal-05299220},
	NOTE = {working paper or preprint},
	YEAR = {2025},
	MONTH = Oct,
	KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus},
	PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf},
	HAL_ID = {hal-05299220},
	HAL_VERSION = {v1},
	}
	```

	### 🏛️ What’s Inside
	* Multilingual Corpora: Annotated texts in Old French and Medieval Latin.
	* Specialized Models: Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.

	See some demo:
	- https://comma.inria.fr to browse the corpus
	- Demo for our normalization model https://huggingface.co/spaces/comma-project/pre-editorial-normalization

	License: CC-BY 4.0