Update README.md

04bdd72 verified over 1 year ago

4.19 kB

	---
	language:
	- ca
	license: apache-2.0
	tags:
	- catalan
	- masked-lm
	- distilroberta
	widget:
	- text: El Català és una llengua molt <mask>.
	- text: Salvador Dalí va viure a <mask>.
	- text: La Costa Brava té les millors <mask> d'Espanya.
	- text: El cacaolat és un batut de <mask>.
	- text: <mask> és la capital de la Garrotxa.
	- text: Vaig al <mask> a buscar bolets.
	- text: Antoni Gaudí vas ser un <mask> molt important per la ciutat.
	- text: Catalunya és una referència en <mask> a nivell europeu.
	---

	# DistilRoBERTa-base-ca

	## Model description

	This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2).

	It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation
	from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).

	The resulting architecture consists of 6 layers, 768 dimensional embeddings and 12 attention heads.
	This adds up to a total of 82M parameters, which is considerably less than the 125M of standard RoBERTa-base models.
	This makes the model lighter and faster than the original, at the cost of a slightly lower performance.

	## Training

	### Training procedure

	This model has been trained using a technique known as Knowledge Distillation,
	which is used to shrink networks to a reasonable size while minimizing the loss in performance.

	It basically consists in distilling a large language model (the teacher) into a more
	lightweight, energy-efficient, and production-friendly model (the student).

	So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
	As a result, the student has lower inference time and the ability to run in commodity hardware.

	### Training data

	The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:

	\| Corpus \| Size (GB) \|
	\|--------------------------\|-----------:\|
	\| Catalan Crawling \| 13.00 \|
	\| RacoCatalá \| 8.10 \|
	\| Catalan Oscar \| 4.00 \|
	\| CaWaC \| 3.60 \|
	\| Cat. General Crawling \| 2.50 \|
	\| Wikipedia \| 1.10 \|
	\| DOGC \| 0.78 \|
	\| Padicat \| 0.63 \|
	\| ACN \| 0.42 \|
	\| Nació Digital \| 0.42 \|
	\| Cat. Government Crawling \| 0.24 \|
	\| Vilaweb \| 0.06 \|
	\| Catalan Open Subtitles \| 0.02 \|
	\| Tweets \| 0.02 \|

	## Evaluation

	### Evaluation benchmark

	This model has been fine-tuned on the downstream tasks of the [Catalan Language Understanding Evaluation benchmark (CLUB)](https://club.aina.bsc.es/), which includes the following datasets:

	\| Dataset \| Task\| Total \| Train \| Dev \| Test \|
	\|:----------\|:----\|:--------\|:-------\|:------\|:------\|
	\| AnCora \| NER \| 13,581 \| 10,628 \| 1,427 \| 1,526 \|
	\| AnCora \| POS \| 16,678 \| 13,123 \| 1,709 \| 1,846 \|
	\| STS-ca \| STS \| 3,073 \| 2,073 \| 500 \| 500 \|
	\| TeCla \| TC \| 137,775 \| 110,203\| 13,786\| 13,786\|
	\| TE-ca \| RTE \| 21,163 \| 16,930 \| 2,116 \| 2,117 \|
	\| CatalanQA \| QA \| 21,427 \| 17,135 \| 2,157 \| 2,135 \|
	\| XQuAD-ca \| QA \| - \| - \| - \| 1,189 \|

	### Evaluation results

	This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:

	\| Model \ Task \|NER (F1)\|POS (F1)\|STS-ca (Comb.)\|TeCla (Acc.)\|TEca (Acc.)\|CatalanQA (F1/EM)\| XQuAD-ca <sup>1</sup> (F1/EM) \|
	\| ------------------------\|:-------\|:-------\|:-------------\|:-----------\|:----------\|:----------------\|:------------------------------\|
	\| RoBERTa-base-ca-v2 \| 89.29 \| 98.96 \| 79.07 \| 74.26 \| 83.14 \| 89.50/76.63 \| 73.64/55.42 \|
	\| DistilRoBERTa-base-ca \| 87.88 \| 98.83 \| 77.26 \| 73.20 \| 76.00 \| 84.07/70.77 \| 62.93/45.08 \|

	<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca (no train set).