ogbert-2m-base / README.md

mjbommar

Upload README.md with huggingface_hub

d7ffe7a verified 26 days ago

preview code

raw

history blame contribute delete

3.04 kB

metadata

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - transformers
  - modernbert
  - fill-mask
  - masked-language-model
pipeline_tag: fill-mask
datasets:
  - mjbommar/ogbert-v1-mlm
model-index:
  - name: ogbert-2m-base
    results:
      - task:
          type: word-similarity
        dataset:
          name: SimLex-999
          type: simlex999
        metrics:
          - type: spearman
            value: 0.162

OGBert-2M-Base

A tiny (2.1M parameter) ModernBERT-based masked language model for glossary and domain-specific text.

Related models:

mjbommar/ogbert-2m-sentence - Sentence embedding version with mean pooling + L2 normalization

Model Details

Property	Value
Architecture	ModernBERT
Parameters	2.1M
Hidden size	128
Layers	4
Attention heads	4
Vocab size	8,192
Max sequence	1,024 tokens

Training

Task: Masked Language Modeling (MLM)
Dataset: mjbommar/ogbert-v1-mlm - derived from OpenGloss, a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
Masking: Standard 15% token masking

Performance

Word Similarity (SimLex-999)

SimLex-999 measures Spearman correlation between model cosine similarities and human judgments on 999 word pairs. Higher = better alignment with human perception of word similarity.

Model	Params	SimLex-999 (ρ)
OGBert-2M-Base	2.1M	0.162
BERT-base	110M	0.070
RoBERTa-base	125M	-0.061

OGBert-2M-Base achieves 2.3x better word similarity than BERT-base with 52x fewer parameters.

Usage

Fill-Mask Pipeline

from transformers import pipeline

fill_mask = pipeline('fill-mask', model='mjbommar/ogbert-2m-base')
result = fill_mask('The financial <|mask|> was approved.')

Output:

Rank	Token	Score
1	report	0.031
2	transaction	0.025
3	system	0.021
4	audit	0.019
5	account	0.017

Direct Model Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-base')
model = AutoModelForMaskedLM.from_pretrained('mjbommar/ogbert-2m-base')

inputs = tokenizer('The <|mask|> definition is clear.', return_tensors='pt')
outputs = model(**inputs)

For Sentence Embeddings

Use mjbommar/ogbert-2m-sentence instead, which includes mean pooling and L2 normalization for optimal similarity search.

Citation

If you use this model, please cite the OpenGloss dataset:

@article{bommarito2025opengloss,
  title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
  author={Bommarito II, Michael J.},
  journal={arXiv preprint arXiv:2511.18622},
  year={2025}
}

License

Apache 2.0