SultanR commited on
Commit
31ce063
·
verified ·
1 Parent(s): decba57

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ pipeline_tag: text-classification
7
+ base_model: jhu-clsp/mmBERT-small
8
+ tags:
9
+ - quality-classifier
10
+ - data-filtering
11
+ - pretraining
12
+ ---
13
+
14
+ <p align="center">
15
+ <a href="https://huggingface.co/collections/AdaMLLab/mixminmatch">
16
+ <img src="https://img.shields.io/badge/🤗_Collection-MixMinMatch-blue" alt="MixMinMatch Collection">
17
+ </a>
18
+ </p>
19
+
20
+ # mmBERT Arabic Quality Classifier
21
+
22
+ A text quality classifier for Arabic pretraining data, trained from [mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small). Used to create [AraMix-HQ](https://huggingface.co/datasets/AdaMLLab/AraMix-HQ).
23
+
24
+ This model implements the FineWeb2-HQ approach ([Messmer et al., 2025](https://arxiv.org/abs/2502.10361)) but uses mmBERT as the encoder for improved Arabic understanding.
25
+
26
+ ## Usage
27
+
28
+ ```python
29
+ from transformers import pipeline
30
+
31
+ classifier = pipeline("text-classification", model="AdaMLLab/mmBERT-Arabic-Quality-Classifier")
32
+ result = classifier("النص العربي هنا")
33
+ ```
34
+
35
+ ## Citation
36
+
37
+ ```bib
38
+ @misc{alrashed2025mixminmatch,
39
+ title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets},
40
+ author={Sultan Alrashed and Francesco Orabona},
41
+ year={2025},
42
+ eprint={2512.18834v2},
43
+ archivePrefix={arXiv},
44
+ primaryClass={cs.CL},
45
+ url={https://arxiv.org/abs/2512.18834v2},
46
+ }
47
+ ```