File size: 3,041 Bytes
6800e11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7ffe7a
6800e11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fcfbf73
 
 
 
 
 
 
 
 
6800e11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7ffe7a
 
 
 
 
 
 
 
 
 
6800e11
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- transformers
- modernbert
- fill-mask
- masked-language-model
pipeline_tag: fill-mask
datasets:
- mjbommar/ogbert-v1-mlm
model-index:
- name: ogbert-2m-base
  results:
  - task:
      type: word-similarity
    dataset:
      name: SimLex-999
      type: simlex999
    metrics:
    - type: spearman
      value: 0.162
---

# OGBert-2M-Base

A tiny (2.1M parameter) ModernBERT-based masked language model for glossary and domain-specific text.

**Related models:**
- [mjbommar/ogbert-2m-sentence](https://huggingface.co/mjbommar/ogbert-2m-sentence) - Sentence embedding version with mean pooling + L2 normalization

## Model Details

| Property | Value |
|----------|-------|
| Architecture | ModernBERT |
| Parameters | 2.1M |
| Hidden size | 128 |
| Layers | 4 |
| Attention heads | 4 |
| Vocab size | 8,192 |
| Max sequence | 1,024 tokens |

## Training

- **Task**: Masked Language Modeling (MLM)
- **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
- **Masking**: Standard 15% token masking

## Performance

### Word Similarity (SimLex-999)

**SimLex-999** measures Spearman correlation between model cosine similarities and human judgments on 999 word pairs. Higher = better alignment with human perception of word similarity.

| Model | Params | SimLex-999 (ρ) |
|-------|--------|----------------|
| **OGBert-2M-Base** | **2.1M** | **0.162** |
| BERT-base | 110M | 0.070 |
| RoBERTa-base | 125M | -0.061 |

OGBert-2M-Base achieves **2.3x better** word similarity than BERT-base with **52x fewer parameters**.

## Usage

### Fill-Mask Pipeline

```python
from transformers import pipeline

fill_mask = pipeline('fill-mask', model='mjbommar/ogbert-2m-base')
result = fill_mask('The financial <|mask|> was approved.')
```

**Output:**
| Rank | Token | Score |
|------|-------|-------|
| 1 | report | 0.031 |
| 2 | transaction | 0.025 |
| 3 | system | 0.021 |
| 4 | audit | 0.019 |
| 5 | account | 0.017 |

### Direct Model Usage

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-base')
model = AutoModelForMaskedLM.from_pretrained('mjbommar/ogbert-2m-base')

inputs = tokenizer('The <|mask|> definition is clear.', return_tensors='pt')
outputs = model(**inputs)
```

### For Sentence Embeddings

Use [mjbommar/ogbert-2m-sentence](https://huggingface.co/mjbommar/ogbert-2m-sentence) instead, which includes mean pooling and L2 normalization for optimal similarity search.

## Citation

If you use this model, please cite the OpenGloss dataset:

```bibtex
@article{bommarito2025opengloss,
  title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
  author={Bommarito II, Michael J.},
  journal={arXiv preprint arXiv:2511.18622},
  year={2025}
}
```

## License

Apache 2.0