Instructions to use mamei16/chonky_distilbert-base-multilingual-cased with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mamei16/chonky_distilbert-base-multilingual-cased with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="mamei16/chonky_distilbert-base-multilingual-cased")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("mamei16/chonky_distilbert-base-multilingual-cased") model = AutoModelForTokenClassification.from_pretrained("mamei16/chonky_distilbert-base-multilingual-cased") - Notebooks
- Google Colab
- Kaggle
Model Details
Model Description
Fine-tune of distilbert/distilbert-base-multilingual-cased trained on nearly 11 billion tokens from more than 34 million Wikipedia articles to predict paragraph breaks. This model can be used to split arbitrary natural language texts into semantic chunks.
Model Sources
- Repository: https://github.com/mamei16/chonky
- Demo: https://huggingface.co/spaces/mamei16/chonky_chunk
Uses
This model can be used as part of a RAG pipeline to hopefully improve downstream performance.
Bias, Risks, and Limitations
This model has been fine-tuned on non-fictional natural language from Wikipedia. As such, it may not work as well on fictional texts containing dialog or poems, mathematical expressions or code.
How to Get Started with the Model
pip install git+https://github.com/mamei16/chonky
Usage:
from chonky import ParagraphSplitter
splitter = ParagraphSplitter(device="cpu", model_id="mamei16/chonky_distilbert-base-multilingual-cased")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
for chunk in splitter(text):
print(chunk)
print("--")
# Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.
# --
# The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
# --
Training Details
Training Data
Link: https://huggingface.co/datasets/mamei16/multilingual-wikipedia-paragraphs
Note that the data has been pre-tokenized and truncated using the tokenizer from distilbert/distilbert-base-multilingual-cased.
The training data is based on wikimedia/wikipedia. Although it's claimed that unwanted sections such as "References", "See more" etc. have been removed from that dataset, that is not the case. Hence, a semi-accurate and half-manual procedure was used to identify and remove those sections. This procedure involved finding the most common short paragraphs in the last 500 characters of the articles in each language and translating each of them to identify the section headers corresponding to the English "References" etc.
Another issue was the existence of several million articles, most notably in the languages Waray, Cebuano and Swedish, which had been written by a bot named Lsjbot. To not artificially inflate low-resource languages and to ensure that most (if not all) articles were written by humans, the offending articles were removed.
Lastly, articles containing too many extremely long or short paragraphs were removed, as well as stub articles.
Training Procedure
Preprocessing
All articles were truncated to max. 512 tokens.
For designing the training strategy, the following desirable traits were identified:
- Avoid catastrophic forgetting
- Avoid sudden drastic changes in distribution
- Boost low-resource languages
- Uphold good performance on high-resource/most commonly used languages
- Must be implementable with static training data
In the end, the training data was generated by combining the datasets of all 104 languages using temperature sampling without replacement and with τ = 2. This would boost low-resource languages, and languages would "die out" one by one spread over the entire epoch (see figure below), thus avoiding sudden big changes in distribution. Furthermore, the model would keep seeing high-resource languages until the end, thus making it likely that it would maintain good performance on them. At the same time, the linear learning rate schedule would ensure that the more low-resource languages were exchausted during training, the lower the learning rate would be, thus making catastrophic forgetting less likely.
Training Hyperparameters
- Training regime: fp16 mixed precision
- Epochs: 1
- Batch size: 64
- Start learning rate: 2e-5
- Optimizer: Adam
- Weight decay: 0.01
- Loss: NLLLoss
- Label Smoothing factor: 0.1
Evaluation
MTCB Nano Benchmark (Aggregated Score)
Note: This benchmark is English only (and includes code in multiple programming languages)
Score = Mean of mean_recall, mean_precision, mean_mrr, and mean_ndcg across k=[1, 3, 5, 10] (Metrics reference)
| Model / Chunker | Chunk Size 512 | Chunk Size 1024 | Chunk Size 2048 | Avg Score |
|---|---|---|---|---|
| mirth/chonky_modernbert_large_1 | 0.5621 | 0.5621 | 0.5621 | 0.5621 |
| mamei16/chonky_mdistilbert-base-english-cased | 0.5517 | 0.5517 | 0.5517 | 0.5517 |
| mamei16/chonky_distilbert_base_uncased_1.1 | 0.5342 | 0.5342 | 0.5342 | 0.5342 |
| mirth/chonky_modernbert_base_1 | 0.5305 | 0.5305 | 0.5305 | 0.5305 |
| mamei16/chonky_distilbert-base-multilingual-cased | 0.5294 | 0.5294 | 0.5294 | 0.5294 |
| mirth/chonky_distilbert_base_uncased_1 | 0.5116 | 0.5116 | 0.5116 | 0.5116 |
| RecursiveChunker | 0.4596 | 0.5214 | 0.5431 | 0.5080 |
| SentenceChunker | 0.4612 | 0.5026 | 0.5263 | 0.4967 |
| TokenChunker | 0.3155 | 0.4338 | 0.4801 | 0.4098 |
| SemanticChunker_potion-32M | 0.4022 | 0.4021 | 0.4019 | 0.4021 |
| SemanticChunker_potion-multi-128M | 0.4004 | 0.3999 | 0.3991 | 0.4001 |
| SemanticChunker_potion-8M | 0.3987 | 0.3966 | 0.3966 | 0.3973 |
Model Implementation Details
| Benchmark Name | Implementation |
|---|---|
| RecursiveChunker | RecursiveChunker(chunk_size=chunk_size) |
| SentenceChunker | SentenceChunker(chunk_size=chunk_size) |
| TokenChunker | TokenChunker(chunk_size=chunk_size) |
| SemanticChunker_potion-32M | SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-32M") |
| SemanticChunker_potion-multi-128M | SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-multilingual-128M") |
| SemanticChunker_potion-8M | SemanticChunker(chunk_size=chunk_size, embedding_model="minishlab/potion-base-8M") |
Testing Data
The testing data can be found in the "test" split of each language in the dataset.
Testing Metrics
Due to the extreme class imbalance, the F1-score was chosen as the main evaluation metric.
Results
| Language | F1 Score |
|---|---|
| Chechen | 0.994 |
| Cebuano | 0.993 |
| Newari | 0.989 |
| Volapük | 0.989 |
| Minangkabau | 0.984 |
| Bishnupriya | 0.982 |
| Malagasy | 0.971 |
| Haitian Creole | 0.966 |
| Tatar | 0.96 |
| Waray | 0.956 |
| Piedmontese | 0.936 |
| South Azerbaijani | 0.934 |
| Ido | 0.916 |
| Telugu | 0.912 |
| Kazakh | 0.907 |
| Welsh | 0.897 |
| Serbo-Croatian | 0.893 |
| Aragonese | 0.886 |
| Basque | 0.879 |
| Lombard | 0.879 |
| Tajik | 0.876 |
| Urdu | 0.876 |
| Kyrgyz | 0.872 |
| Chuvash | 0.868 |
| Marathi | 0.865 |
| Dutch | 0.854 |
| Sundanese | 0.851 |
| Ukrainian | 0.848 |
| Serbian | 0.847 |
| Polish | 0.841 |
| Luxembourgish | 0.84 |
| Slovak | 0.84 |
| Hungarian | 0.834 |
| Armenian | 0.833 |
| Malay | 0.832 |
| Latin | 0.83 |
| French | 0.829 |
| Swedish | 0.829 |
| Bosnian | 0.828 |
| Bavarian | 0.826 |
| German | 0.826 |
| Belarusian | 0.825 |
| Korean | 0.821 |
| Slovenian | 0.821 |
| Persian | 0.82 |
| Italian | 0.82 |
| Uzbek | 0.82 |
| Japanese | 0.818 |
| Swahili | 0.818 |
| Macedonian | 0.817 |
| English | 0.815 |
| Georgian | 0.814 |
| Indonesian | 0.813 |
| Occitan | 0.813 |
| Romanian | 0.813 |
| Russian | 0.813 |
| Vietnamese | 0.812 |
| Norwegian | 0.811 |
| Portuguese | 0.811 |
| Afrikaans | 0.81 |
| Bulgarian | 0.809 |
| Catalan | 0.807 |
| Czech | 0.807 |
| Scots | 0.806 |
| Tamil | 0.805 |
| Western Frisian | 0.804 |
| Arabic | 0.804 |
| Turkish | 0.803 |
| Bashkir | 0.801 |
| Spanish | 0.8 |
| Lithuanian | 0.797 |
| Asturian | 0.796 |
| Breton | 0.796 |
| Norwegian Nynorsk | 0.795 |
| Galician | 0.794 |
| Bangla | 0.793 |
| Latvian | 0.793 |
| Estonian | 0.792 |
| Danish | 0.787 |
| Azerbaijani | 0.784 |
| Sicilian | 0.783 |
| Finnish | 0.781 |
| Javanese | 0.781 |
| Hindi | 0.777 |
| Greek | 0.772 |
| Gujarati | 0.772 |
| Low Saxon | 0.765 |
| Tagalog | 0.764 |
| Croatian | 0.763 |
| Irish | 0.751 |
| Hebrew | 0.75 |
| Icelandic | 0.743 |
| Malayalam | 0.742 |
| Kannada | 0.73 |
| Yoruba | 0.729 |
| Chinese | 0.727 |
| Thai | 0.725 |
| Albanian | 0.724 |
| Punjabi | 0.724 |
| Mongolian | 0.722 |
| Burmese | 0.711 |
| Classical Chinese | 0.706 |
| Western Punjabi | 0.658 |
| Nepali | 0.593 |
Mean F1 Score: 0.824
Technical Specifications
Hardware
RTX 5090
- Downloads last month
- 35