Model Details

Model Description

Fine-tune of distilbert/distilbert-base-multilingual-cased trained on nearly 11 billion tokens from more than 34 million Wikipedia articles to predict paragraph breaks. This model can be used to split arbitrary natural language texts into semantic chunks.

Model Sources

Repository: https://github.com/mamei16/chonky
Demo: https://huggingface.co/spaces/mamei16/chonky_chunk

Uses

This model can be used as part of a RAG pipeline to hopefully improve downstream performance.

Bias, Risks, and Limitations

This model has been fine-tuned on non-fictional natural language from Wikipedia. As such, it may not work as well on fictional texts containing dialog or poems, mathematical expressions or code.

How to Get Started with the Model

pip install git+https://github.com/mamei16/chonky

Usage:

from chonky import ParagraphSplitter

splitter = ParagraphSplitter(device="cpu", model_id="mamei16/chonky_distilbert-base-multilingual-cased")

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

for chunk in splitter(text):
  print(chunk)
  print("--")

# Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.
# --
# The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
# --

Training Details

Training Data

Link: https://huggingface.co/datasets/mamei16/multilingual-wikipedia-paragraphs

Note that the data has been pre-tokenized and truncated using the tokenizer from distilbert/distilbert-base-multilingual-cased.

The training data is based on wikimedia/wikipedia. Although it's claimed that unwanted sections such as "References", "See more" etc. have been removed from that dataset, that is not the case. Hence, a semi-accurate and half-manual procedure was used to identify and remove those sections. This procedure involved finding the most common short paragraphs in the last 500 characters of the articles in each language and translating each of them to identify the section headers corresponding to the English "References" etc.

Another issue was the existence of several million articles, most notably in the languages Waray, Cebuano and Swedish, which had been written by a bot named Lsjbot. To not artificially inflate low-resource languages and to ensure that most (if not all) articles were written by humans, the offending articles were removed.

Lastly, articles containing too many extremely long or short paragraphs were removed, as well as stub articles.

Training Procedure

Preprocessing

All articles were truncated to max. 512 tokens.

For designing the training strategy, the following desirable traits were identified:

Avoid catastrophic forgetting
Avoid sudden drastic changes in distribution
Boost low-resource languages
Uphold good performance on high-resource/most commonly used languages
Must be implementable with static training data

In the end, the training data was generated by combining the datasets of all 104 languages using temperature sampling without replacement and with τ = 2. This would boost low-resource languages, and languages would "die out" one by one spread over the entire epoch (see figure below), thus avoiding sudden big changes in distribution. Furthermore, the model would keep seeing high-resource languages until the end, thus making it likely that it would maintain good performance on them. At the same time, the linear learning rate schedule would ensure that the more low-resource languages were exchausted during training, the lower the learning rate would be, thus making catastrophic forgetting less likely.

Fig.1 - Number of languages remaining vs number of training steps taken

Training Hyperparameters

Training regime: fp16 mixed precision
Epochs: 1
Batch size: 64
Start learning rate: 2e-5
Optimizer: Adam
Weight decay: 0.01
Loss: NLLLoss
Label Smoothing factor: 0.1

Evaluation

MTCB Nano Benchmark (Aggregated Score)

Note: This benchmark is English only (and includes code in multiple programming languages)

Score = Mean of mean_recall, mean_precision, mean_mrr, and mean_ndcg across k=[1, 3, 5, 10] (Metrics reference)

Model / Chunker	Chunk Size 512	Chunk Size 1024	Chunk Size 2048	Avg Score
mirth/chonky_modernbert_large_1	0.5621	0.5621	0.5621	0.5621
mamei16/chonky_mdistilbert-base-english-cased	0.5517	0.5517	0.5517	0.5517
mamei16/chonky_distilbert_base_uncased_1.1	0.5342	0.5342	0.5342	0.5342
mirth/chonky_modernbert_base_1	0.5305	0.5305	0.5305	0.5305
mamei16/chonky_distilbert-base-multilingual-cased	0.5294	0.5294	0.5294	0.5294
mirth/chonky_distilbert_base_uncased_1	0.5116	0.5116	0.5116	0.5116
RecursiveChunker	0.4596	0.5214	0.5431	0.5080
SentenceChunker	0.4612	0.5026	0.5263	0.4967
TokenChunker	0.3155	0.4338	0.4801	0.4098
SemanticChunker_potion-32M	0.4022	0.4021	0.4019	0.4021
SemanticChunker_potion-multi-128M	0.4004	0.3999	0.3991	0.4001
SemanticChunker_potion-8M	0.3987	0.3966	0.3966	0.3973

Model Implementation Details

Benchmark Name	Implementation
RecursiveChunker	`RecursiveChunker(chunk_size=chunk_size)`
SentenceChunker	`SentenceChunker(chunk_size=chunk_size)`
TokenChunker	`TokenChunker(chunk_size=chunk_size)`
SemanticChunker_potion-32M	`SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-32M")`
SemanticChunker_potion-multi-128M	`SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-multilingual-128M")`
SemanticChunker_potion-8M	`SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-8M")`

Testing Data

The testing data can be found in the "test" split of each language in the dataset.

Testing Metrics

Due to the extreme class imbalance, the F1-score was chosen as the main evaluation metric.

Results

Language	F1 Score
Chechen	0.994
Cebuano	0.993
Newari	0.989
Volapük	0.989
Minangkabau	0.984
Bishnupriya	0.982
Malagasy	0.971
Haitian Creole	0.966
Tatar	0.96
Waray	0.956
Piedmontese	0.936
South Azerbaijani	0.934
Ido	0.916
Telugu	0.912
Kazakh	0.907
Welsh	0.897
Serbo-Croatian	0.893
Aragonese	0.886
Basque	0.879
Lombard	0.879
Tajik	0.876
Urdu	0.876
Kyrgyz	0.872
Chuvash	0.868
Marathi	0.865
Dutch	0.854
Sundanese	0.851
Ukrainian	0.848
Serbian	0.847
Polish	0.841
Luxembourgish	0.84
Slovak	0.84
Hungarian	0.834
Armenian	0.833
Malay	0.832
Latin	0.83
French	0.829
Swedish	0.829
Bosnian	0.828
Bavarian	0.826
German	0.826
Belarusian	0.825
Korean	0.821
Slovenian	0.821
Persian	0.82
Italian	0.82
Uzbek	0.82
Japanese	0.818
Swahili	0.818
Macedonian	0.817
English	0.815
Georgian	0.814
Indonesian	0.813
Occitan	0.813
Romanian	0.813
Russian	0.813
Vietnamese	0.812
Norwegian	0.811
Portuguese	0.811
Afrikaans	0.81
Bulgarian	0.809
Catalan	0.807
Czech	0.807
Scots	0.806
Tamil	0.805
Western Frisian	0.804
Arabic	0.804
Turkish	0.803
Bashkir	0.801
Spanish	0.8
Lithuanian	0.797
Asturian	0.796
Breton	0.796
Norwegian Nynorsk	0.795
Galician	0.794
Bangla	0.793
Latvian	0.793
Estonian	0.792
Danish	0.787
Azerbaijani	0.784
Sicilian	0.783
Finnish	0.781
Javanese	0.781
Hindi	0.777
Greek	0.772
Gujarati	0.772
Low Saxon	0.765
Tagalog	0.764
Croatian	0.763
Irish	0.751
Hebrew	0.75
Icelandic	0.743
Malayalam	0.742
Kannada	0.73
Yoruba	0.729
Chinese	0.727
Thai	0.725
Albanian	0.724
Punjabi	0.724
Mongolian	0.722
Burmese	0.711
Classical Chinese	0.706
Western Punjabi	0.658
Nepali	0.593

Mean F1 Score: 0.824

Technical Specifications

Hardware

RTX 5090

Downloads last month: 35

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for mamei16/chonky_distilbert-base-multilingual-cased

Base model

distilbert/distilbert-base-multilingual-cased

Finetuned

(438)

this model

Dataset used to train mamei16/chonky_distilbert-base-multilingual-cased

Space using mamei16/chonky_distilbert-base-multilingual-cased 1

Collection including mamei16/chonky_distilbert-base-multilingual-cased

Paragraph Splitting / Chunking Models

Collection

A collection of models that can be used to split natural language texts into meaningful chunks • 3 items • Updated Nov 13, 2025