| --- |
| language: |
| - ca |
| license: apache-2.0 |
| tags: |
| - catalan |
| - masked-lm |
| - distilroberta |
| widget: |
| - text: El Català és una llengua molt <mask>. |
| - text: Salvador Dalí va viure a <mask>. |
| - text: La Costa Brava té les millors <mask> d'Espanya. |
| - text: El cacaolat és un batut de <mask>. |
| - text: <mask> és la capital de la Garrotxa. |
| - text: Vaig al <mask> a buscar bolets. |
| - text: Antoni Gaudí vas ser un <mask> molt important per la ciutat. |
| - text: Catalunya és una referència en <mask> a nivell europeu. |
| --- |
| |
| # DistilRoBERTa-base-ca |
|
|
| ## Model description |
|
|
| This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). |
|
|
| It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation |
| from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation). |
|
|
| The resulting architecture consists of 6 layers, 768 dimensional embeddings and 12 attention heads. |
| This adds up to a total of 82M parameters, which is considerably less than the 125M of standard RoBERTa-base models. |
| This makes the model lighter and faster than the original, at the cost of a slightly lower performance. |
|
|
| ## Training |
|
|
| ### Training procedure |
|
|
| This model has been trained using a technique known as Knowledge Distillation, |
| which is used to shrink networks to a reasonable size while minimizing the loss in performance. |
|
|
| It basically consists in distilling a large language model (the teacher) into a more |
| lightweight, energy-efficient, and production-friendly model (the student). |
|
|
| So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model. |
| As a result, the student has lower inference time and the ability to run in commodity hardware. |
|
|
| ### Training data |
|
|
| The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below: |
|
|
| | Corpus | Size (GB) | |
| |--------------------------|-----------:| |
| | Catalan Crawling | 13.00 | |
| | RacoCatalá | 8.10 | |
| | Catalan Oscar | 4.00 | |
| | CaWaC | 3.60 | |
| | Cat. General Crawling | 2.50 | |
| | Wikipedia | 1.10 | |
| | DOGC | 0.78 | |
| | Padicat | 0.63 | |
| | ACN | 0.42 | |
| | Nació Digital | 0.42 | |
| | Cat. Government Crawling | 0.24 | |
| | Vilaweb | 0.06 | |
| | Catalan Open Subtitles | 0.02 | |
| | Tweets | 0.02 | |
|
|
| ## Evaluation |
|
|
| ### Evaluation benchmark |
|
|
| This model has been fine-tuned on the downstream tasks of the [Catalan Language Understanding Evaluation benchmark (CLUB)](https://club.aina.bsc.es/), which includes the following datasets: |
|
|
| | Dataset | Task| Total | Train | Dev | Test | |
| |:----------|:----|:--------|:-------|:------|:------| |
| | AnCora | NER | 13,581 | 10,628 | 1,427 | 1,526 | |
| | AnCora | POS | 16,678 | 13,123 | 1,709 | 1,846 | |
| | STS-ca | STS | 3,073 | 2,073 | 500 | 500 | |
| | TeCla | TC | 137,775 | 110,203| 13,786| 13,786| |
| | TE-ca | RTE | 21,163 | 16,930 | 2,116 | 2,117 | |
| | CatalanQA | QA | 21,427 | 17,135 | 2,157 | 2,135 | |
| | XQuAD-ca | QA | - | - | - | 1,189 | |
|
|
| ### Evaluation results |
|
|
| This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks: |
|
|
| | Model \ Task |NER (F1)|POS (F1)|STS-ca (Comb.)|TeCla (Acc.)|TEca (Acc.)|CatalanQA (F1/EM)| XQuAD-ca <sup>1</sup> (F1/EM) | |
| | ------------------------|:-------|:-------|:-------------|:-----------|:----------|:----------------|:------------------------------| |
| | RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 89.50/76.63 | 73.64/55.42 | |
| | DistilRoBERTa-base-ca | 87.88 | 98.83 | 77.26 | 73.20 | 76.00 | 84.07/70.77 | 62.93/45.08 | |
|
|
| <sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca (no train set). |