| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | library_name: transformers |
| | tags: |
| | - transformers |
| | - modernbert |
| | - fill-mask |
| | - masked-language-model |
| | pipeline_tag: fill-mask |
| | datasets: |
| | - mjbommar/ogbert-v1-mlm |
| | model-index: |
| | - name: ogbert-2m-base |
| | results: |
| | - task: |
| | type: word-similarity |
| | dataset: |
| | name: SimLex-999 |
| | type: simlex999 |
| | metrics: |
| | - type: spearman |
| | value: 0.162 |
| | --- |
| | |
| | # OGBert-2M-Base |
| |
|
| | A tiny (2.1M parameter) ModernBERT-based masked language model for glossary and domain-specific text. |
| |
|
| | **Related models:** |
| | - [mjbommar/ogbert-2m-sentence](https://huggingface.co/mjbommar/ogbert-2m-sentence) - Sentence embedding version with mean pooling + L2 normalization |
| |
|
| | ## Model Details |
| |
|
| | | Property | Value | |
| | |----------|-------| |
| | | Architecture | ModernBERT | |
| | | Parameters | 2.1M | |
| | | Hidden size | 128 | |
| | | Layers | 4 | |
| | | Attention heads | 4 | |
| | | Vocab size | 8,192 | |
| | | Max sequence | 1,024 tokens | |
| |
|
| | ## Training |
| |
|
| | - **Task**: Masked Language Modeling (MLM) |
| | - **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes |
| | - **Masking**: Standard 15% token masking |
| |
|
| | ## Performance |
| |
|
| | ### Word Similarity (SimLex-999) |
| |
|
| | **SimLex-999** measures Spearman correlation between model cosine similarities and human judgments on 999 word pairs. Higher = better alignment with human perception of word similarity. |
| |
|
| | | Model | Params | SimLex-999 (ρ) | |
| | |-------|--------|----------------| |
| | | **OGBert-2M-Base** | **2.1M** | **0.162** | |
| | | BERT-base | 110M | 0.070 | |
| | | RoBERTa-base | 125M | -0.061 | |
| |
|
| | OGBert-2M-Base achieves **2.3x better** word similarity than BERT-base with **52x fewer parameters**. |
| |
|
| | ## Usage |
| |
|
| | ### Fill-Mask Pipeline |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | fill_mask = pipeline('fill-mask', model='mjbommar/ogbert-2m-base') |
| | result = fill_mask('The financial <|mask|> was approved.') |
| | ``` |
| |
|
| | **Output:** |
| | | Rank | Token | Score | |
| | |------|-------|-------| |
| | | 1 | report | 0.031 | |
| | | 2 | transaction | 0.025 | |
| | | 3 | system | 0.021 | |
| | | 4 | audit | 0.019 | |
| | | 5 | account | 0.017 | |
| |
|
| | ### Direct Model Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForMaskedLM, AutoTokenizer |
| | |
| | tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-base') |
| | model = AutoModelForMaskedLM.from_pretrained('mjbommar/ogbert-2m-base') |
| | |
| | inputs = tokenizer('The <|mask|> definition is clear.', return_tensors='pt') |
| | outputs = model(**inputs) |
| | ``` |
| |
|
| | ### For Sentence Embeddings |
| |
|
| | Use [mjbommar/ogbert-2m-sentence](https://huggingface.co/mjbommar/ogbert-2m-sentence) instead, which includes mean pooling and L2 normalization for optimal similarity search. |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite the OpenGloss dataset: |
| |
|
| | ```bibtex |
| | @article{bommarito2025opengloss, |
| | title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, |
| | author={Bommarito II, Michael J.}, |
| | journal={arXiv preprint arXiv:2511.18622}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|