| --- |
| language: |
| - is |
| - da |
| - sv |
| - 'no' |
| - fo |
| widget: |
| - text: Fina lilla<mask>, jag vill inte bliva stur. |
| - text: Nu ved jeg, at du frygter<mask> og end ikke vil nægte mig din eneste søn.. |
| - text: Það er vorhret á<mask>, napur vindur sem hvín. |
| - text: Ja, Gud signi<mask>, mítt land. |
| - text: Alle dyrene i<mask> må være venner. |
| tags: |
| - roberta |
| - icelandic |
| - norwegian |
| - faroese |
| - danish |
| - swedish |
| - masked-lm |
| - pytorch |
| license: agpl-3.0 |
| datasets: |
| - vesteinn/FC3 |
| - vesteinn/IC3 |
| - mideind/icelandic-common-crawl-corpus-IC3 |
| - NbAiLab/NCC |
| - DDSC/partial-danish-gigaword-no-twitter |
| --- |
| |
| # ScandiBERT |
|
|
| Note note: The model has been updated on 2022-09-27 |
|
|
| The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks. |
|
|
| | Language | Data | Size | |
| |-----------|---------------------------------------|--------| |
| | Icelandic | See IceBERT paper | 16 GB | |
| | Danish | Danish Gigaword Corpus (incl Twitter) | 4,7 GB | |
| | Norwegian | NCC corpus | 42 GB | |
| | Swedish | Swedish Gigaword Corpus | 3,4 GB | |
| | Faroese | FC3 + Sosialurinn + Bible | 69 MB | |
|
|
|
|
| Note: At an earlier date a half trained model went up here, it has since been removed. The model has since been updated. |
|
|
| This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text. It is currently the highest ranking model on the ScandEval leaderbord https://scandeval.github.io/pretrained/ |
|
|
| If you find this model useful, please cite |
|
|
| ``` |
| @inproceedings{snaebjarnarson-etal-2023-transfer, |
| title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese", |
| author = "Snæbjarnarson, Vésteinn and |
| Simonsen, Annika and |
| Glavaš, Goran and |
| Vulić, Ivan", |
| booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", |
| month = "may 22--24", |
| year = "2023", |
| address = "Tórshavn, Faroe Islands", |
| publisher = {Link{\"o}ping University Electronic Press, Sweden}, |
| } |
| ``` |