---
license: cc-by-nc-4.0
language:
  - ar
  - en
tags:
  - tokenizer
  - sarf
  - bilingual
  - arabic
  - english
  - math
  - code
  - sentencepiece-style
library_name: tokenizers
---

# SARFTokenizer v0.3.1 — 4-domain (AR / EN / Math / Code) at 100k vocab

A 4-domain tokenizer at **100,000 vocabulary** with an Arabic-focused normalization pipeline. Adds **math** and **code** to the bilingual AR/EN coverage of v0.2 **without regressing Arabic** — and pushes Arabic CpT to **4.004**, the highest we have measured on any tokenizer at any vocab size.

## The headline — what we actually claim

**SOTA on every domain at any published vocab tier.** v0.3.1 is simultaneously the best Arabic, best English, best math, and best code tokenizer we have measured, beating GPT-5.4-mini / GPT-5.5 `o200k_base` on every domain at half the vocab size.

### Benchmark — 1,200-document held-out 4-domain eval

300 docs each of Arabic, English, math (FineMath-4plus), code (Nemotron-Code). 2,000-char cap per doc. `add_special_tokens=False`. No external preprocessing — each tokenizer's own normalizer/pre-tokenizer runs naturally.

|   Rank | Tokenizer                                   |       Vocab |        AR |        EN |      MATH |      CODE | Parity AR/EN |
| -----: | ------------------------------------------- | ----------: | --------: | --------: | --------: | --------: | -----------: |
| **🥇** | **SARFTokenizer v0.3.1**                    | **100,000** | **4.004** | **3.733** | **4.243** | **4.200** |    **1.073** |
|      2 | SARFTokenizer v0.2                          |      65,000 |     3.683 |     3.522 |     3.922 |     3.913 |        1.046 |
|      3 | Qwen3.6-35B-A3B                             |     248,077 |     3.129 |     2.985 |     3.233 |     3.432 |        1.048 |
|      4 | tiktoken/o200k_base (GPT-5.4-mini, GPT-5.5) |     200,019 |     3.087 |     3.409 |     3.505 |     3.622 |        0.906 |
|      5 | ALLaM-7B-Instruct-preview                   |      64,000 |     2.854 |     2.518 |     3.000 |     3.250 |        1.133 |
|      6 | google/gemma-4-31B-it                       |     262,144 |     2.833 |     3.069 |     3.242 |     3.383 |        0.923 |
|     6t | google/gemma-3-1b-pt                        |     262,145 |     2.833 |     3.069 |     3.242 |     3.384 |        0.923 |
|      8 | google/gemma-2-2b                           |     256,000 |     2.779 |     3.117 |     3.269 |     3.383 |        0.892 |
|      9 | QCRI/Fanar-1-9B-Instruct                    |     128,256 |     2.778 |     3.047 |     3.221 |     3.346 |        0.911 |
|     10 | Qwen2.5-0.5B                                |     151,665 |     2.583 |     2.923 |     3.299 |     3.512 |        0.884 |
|     11 | Hala-350M                                   |      64,400 |     2.219 |     3.220 |     3.367 |     3.477 |        0.689 |
|     12 | Kimi-K2.6                                   |     163,840 |     2.074 |     3.239 |     3.520 |     3.630 |        0.640 |
|     13 | tiktoken/cl100k_base (GPT-4)                |     100,277 |     1.429 |     3.066 |     3.479 |     3.607 |        0.466 |
|     14 | Falcon-7B                                   |      65,024 |     0.991 |     2.720 |     3.108 |     3.210 |        0.364 |

### Token and cost comparison — AR / EN / Math / Code only

CpT means **characters per token**. In other words:

* AR CpT **4.004** means about **4.004 Arabic characters = 1 token**.
* EN CpT **3.733** means about **3.733 English characters = 1 token**.
* Higher CpT means fewer tokens for the same text.
* Estimated tokens for a text block are calculated as: `characters ÷ CpT`.

#### Token count comparison per 1M characters

This table shows how many tokens each tokenizer would produce for **1,000,000 characters** in each domain.

| Tokenizer / Model                           |       Vocab |    AR CpT | AR tokens / 1M chars |    EN CpT | EN tokens / 1M chars |  Math CpT | Math tokens / 1M chars |  Code CpT | Code tokens / 1M chars |
| ------------------------------------------- | ----------: | --------: | -------------------: | --------: | -------------------: | --------: | ---------------------: | --------: | ---------------------: |
| **SARFTokenizer v0.3.1**                    | **100,000** | **4.004** |          **249,750** | **3.733** |          **267,881** | **4.243** |            **235,682** | **4.200** |            **238,095** |
| SARFTokenizer v0.2                          |      65,000 |     3.683 |              271,518 |     3.522 |              283,930 |     3.922 |                254,972 |     3.913 |                255,558 |
| Qwen3.6-35B-A3B                             |     248,077 |     3.129 |              319,591 |     2.985 |              335,008 |     3.233 |                309,310 |     3.432 |                291,375 |
| tiktoken/o200k_base (GPT-5.4-mini, GPT-5.5) |     200,019 |     3.087 |              323,939 |     3.409 |              293,341 |     3.505 |                285,307 |     3.622 |                276,091 |

#### Token change vs `o200k_base`

Positive means **fewer tokens than `o200k_base`**. Negative means **more tokens than `o200k_base`**.

| Tokenizer / Model                           |         AR token change |        EN token change |       Math token change |       Code token change |
| ------------------------------------------- | ----------------------: | ---------------------: | ----------------------: | ----------------------: |
| **SARFTokenizer v0.3.1**                    | **+22.9% fewer tokens** | **+8.7% fewer tokens** | **+17.4% fewer tokens** | **+13.8% fewer tokens** |
| SARFTokenizer v0.2                          |     +16.2% fewer tokens |     +3.2% fewer tokens |     +10.6% fewer tokens |      +7.4% fewer tokens |
| Qwen3.6-35B-A3B                             |      +1.3% fewer tokens | **−14.2% more tokens** |   **−8.4% more tokens** |   **−5.5% more tokens** |
| tiktoken/o200k_base (GPT-5.4-mini, GPT-5.5) |                baseline |               baseline |                baseline |                baseline |

#### Estimated API cost per 1M characters

The cost formulas are:

`tokens per 1M characters = 1,000,000 ÷ CpT`

`cost per 1M characters = price per 1M tokens ÷ CpT`

The per-token *price* doesn't tell you what you'll actually pay — what matters is how many tokens each tokenizer produces for your text. A tokenizer with higher CpT lets you fit more characters into the same number of tokens, so at the **same per-token price** the per-character cost is lower. The bar chart and table below use that fact directly.

Pricing references:

* Qwen3.6-35B-A3B DeepInfra pricing: [https://deepinfra.com/Qwen/Qwen3.6-35B-A3B](https://deepinfra.com/Qwen/Qwen3.6-35B-A3B)
* OpenAI API pricing for GPT models: [https://openai.com/api/pricing/](https://openai.com/api/pricing/)

#### API pricing assumptions

For the chart and tables below, **all SARFTokenizer rows are priced at Qwen3.6-35B's DeepInfra rate** ($0.15 input / $0.95 output per 1M tokens). SARFTokenizer is a tokenizer, not a hosted API — applying Qwen's rate is a notional anchor that lets us compare against the cheapest hosted peer in this set. The hosted peers (Qwen, GPT-5.4 mini, GPT-5.5) keep their *own* real prices.

| Model / Tokenizer                       | Input price / 1M tokens | Cached input / 1M tokens | Output price / 1M tokens | Pricing status                                          |
| --------------------------------------- | ----------------------: | -----------------------: | -----------------------: | ------------------------------------------------------- |
| SARFTokenizer v0.3.1 *(notional Qwen)*  |                   $0.15 |                        — |                    $0.95 | Tokenizer only; using Qwen's DeepInfra rate as anchor   |
| SARFTokenizer v0.2 *(notional Qwen)*    |                   $0.15 |                        — |                    $0.95 | Tokenizer only; using Qwen's DeepInfra rate as anchor   |
| Qwen3.6-35B-A3B on DeepInfra            |                   $0.15 |                        — |                    $0.95 | Hosted API price; no cached tier in listing             |
| GPT-5.4 mini                            |                   $0.75 |                   $0.075 |                    $4.50 | Hosted API price; cached input = 10% of uncached        |
| GPT-5.5                                 |                   $5.00 |                    $0.50 |                   $30.00 | Hosted API price; cached input = 10% of uncached        |

##### API cost per 1M tokens

The chart below shows the raw per-token pricing for each model and the notional Qwen anchor used for the SARFTokenizer rows. OpenAI's cached input rates (10% of uncached) appear as their own bars. This is the **pricing tier** view — the per-character analysis that exploits SARFTokenizer's CpT advantage starts in the next section.

![API cost comparison per 1M tokens](api_cost_comparison_per_1m_tokens.png)

**Lower is better — shorter bars mean less cost per 1M tokens.**

Hatched bars = notional pricing (SARFTokenizer rows). Linear y-axis. Three pricing tiers are visible: Qwen-tier (input+output ≈ $1.10), GPT-5.4 mini tier (≈ $5.25), and GPT-5.5 tier (≈ $35.00). Output tokens dominate every bill — about 6× the uncached input price across both OpenAI models and 6.3× the per-token price on Qwen. The compression and reduction figures in the next sections build on these raw prices.

##### Characters delivered per 1M tokens

The per-token chart above shows the **billing rate**. This chart shows the **content rate** — how much actual text a token actually encodes. Same tokens, but each tokenizer extracts a different amount of characters from them. SARFTokenizer's compression advantage that's invisible on the price chart shows up directly here as taller bars.

GPT-5.4 mini and GPT-5.5 share the same tokenizer (`o200k_base`), so they collapse to a single bar group here.

![Characters delivered per 1M tokens](characters_per_1m_tokens.png)

**Higher is better — taller bars mean more characters packed into each token.**

Reading across the Arabic bars (blue): SARFTokenizer v0.3.1 packs 4.00M characters into 1M tokens, v0.2 packs 3.68M, Qwen packs 3.13M, and `o200k_base` packs 3.09M. At the same per-token price, those gaps **are** the cost advantage — every extra character per token is a character you don't pay extra for. The 22% Arabic cost reduction shown two sections down is exactly this gap, just re-expressed in dollars.

The pattern holds across all four domains, with v0.3.1 leading every column — strongest on math (4.24M chars / 1M tokens) and weakest (still ahead) on English (3.73M).

#### Price reduction from hosted peers to SARFTokenizer v0.3.1

Comparing total (input + output) cost per 1M Arabic characters. SARFTokenizer v0.3.1 is priced at Qwen's DeepInfra rate (notional); hosted peers use their own real prices.

| Peer                     | Peer total / 1M AR chars | SARF v0.3.1 / 1M AR chars | **Reduction** |
| ------------------------ | -----------------------: | ------------------------: | ------------: |
| Qwen3.6-35B on DeepInfra |                  $0.3515 |                   $0.2747 |    **−21.9%** |
| GPT-5.4 mini             |                  $1.7007 |                   $0.2747 |    **−83.8%** |
| GPT-5.5                  |                 $11.3379 |                   $0.2747 |    **−97.6%** |

Formula: `reduction = 1 − (SARF_total ÷ peer_total) = 1 − (SARF_rate × peer_CpT) ÷ (peer_rate × SARF_CpT)`

##### Where the reduction comes from — pricing tier vs compression

| Peer         | Rate factor *(SARF_rate ÷ peer_rate)* | Compression factor *(peer_CpT ÷ SARF_CpT)* | Combined cost ratio | Total reduction |
| ------------ | ------------------------------------: | -----------------------------------------: | ------------------: | --------------: |
| Qwen         |             $1.10 / $1.10 = **1.000** |                  3.129 / 4.004 = **0.781** |               0.781 |          21.9%  |
| GPT-5.4 mini |             $1.10 / $5.25 = **0.210** |                  3.087 / 4.004 = **0.771** |               0.162 |          83.8%  |
| GPT-5.5      |            $1.10 / $35.00 = **0.031** |                  3.087 / 4.004 = **0.771** |               0.024 |          97.6%  |

Reading the columns: against Qwen the full 21.9% reduction comes from compression alone, since the per-token rates are identical. Against the GPT models, most of the reduction is the pricing-tier gap (Qwen-tier inference is ~5× cheaper per token than GPT-5.4 mini and ~32× cheaper than GPT-5.5) and SARFTokenizer's compression contributes a multiplicative ~23% on top. The tokenizer's own contribution is a clean ~22–23% wedge across all three comparisons; the rest is pricing.

#### Full cost table — all four domains

| Model / Tokenizer                       | API pricing used                          | AR input | AR output | EN input | EN output | Math input | Math output | Code input | Code output |
| --------------------------------------- | ----------------------------------------- | -------: | --------: | -------: | --------: | ---------: | ----------: | ---------: | ----------: |
| SARFTokenizer v0.3.1 *(notional Qwen)*  | $0.15 input / $0.95 output per 1M tokens  |   $0.037 |    $0.237 |   $0.040 |    $0.254 |     $0.035 |      $0.224 |     $0.036 |      $0.226 |
| SARFTokenizer v0.2 *(notional Qwen)*    | $0.15 input / $0.95 output per 1M tokens  |   $0.041 |    $0.258 |   $0.043 |    $0.270 |     $0.038 |      $0.242 |     $0.038 |      $0.243 |
| Qwen3.6-35B-A3B on DeepInfra            | $0.15 input / $0.95 output per 1M tokens  |   $0.048 |    $0.304 |   $0.050 |    $0.318 |     $0.046 |      $0.294 |     $0.044 |      $0.277 |
| GPT-5.4 mini with `o200k_base`          | $0.75 input / $4.50 output per 1M tokens  |   $0.243 |    $1.458 |   $0.220 |    $1.320 |     $0.214 |      $1.284 |     $0.207 |      $1.242 |
| GPT-5.5 with `o200k_base`               | $5.00 input / $30.00 output per 1M tokens |   $1.620 |    $9.718 |   $1.467 |    $8.800 |     $1.427 |      $8.559 |     $1.380 |      $8.283 |

#### SARFTokenizer compression at the same per-token rate — savings vs Qwen on DeepInfra

Same per-token rate as Qwen on DeepInfra; the dollar savings come entirely from SARFTokenizer's higher CpT producing fewer tokens for the same text.

| Tokenizer                | AR input |          AR output | EN input |          EN output | Math input |        Math output | Code input |        Code output |
| ------------------------ | -------: | -----------------: | -------: | -----------------: | ---------: | -----------------: | ---------: | -----------------: |
| Qwen3.6 (native)         |   $0.048 |             $0.304 |   $0.050 |             $0.318 |     $0.046 |             $0.294 |     $0.044 |             $0.277 |
| **SARF v0.3.1**          |  $0.037  |             $0.237 |  $0.040  |             $0.254 |    $0.035  |             $0.224 |    $0.036  |             $0.226 |
| Savings v0.3.1 vs Qwen   |  **−22.9%** |       **−22.0%** |  **−20.0%** |       **−20.1%** |   **−23.9%** |       **−23.8%** |   **−18.2%** |       **−18.4%** |
| **SARF v0.2**            |  $0.041  |             $0.258 |  $0.043  |             $0.270 |    $0.038  |             $0.242 |    $0.038  |             $0.243 |
| Savings v0.2 vs Qwen     |  −14.6%  |            −15.1% |  −14.0%  |            −15.1% |   −17.4%  |            −17.7% |   −13.6%  |            −12.3% |

#### What SARFTokenizer compression means at GPT-style pricing

This is not an API price for SARFTokenizer. It shows the **compression advantage only**: if a model had GPT-style pricing but used SARFTokenizer compression instead of `o200k_base`, the estimated cost per 1M characters would be lower because the same text becomes fewer tokens.

| Pricing scenario     | Tokenizer                        |   AR input |  AR output |   EN input |  EN output | Math input | Math output | Code input | Code output |
| -------------------- | -------------------------------- | ---------: | ---------: | ---------: | ---------: | ---------: | ----------: | ---------: | ----------: |
| GPT-5.4 mini pricing | `o200k_base`                     |     $0.243 |     $1.458 |     $0.220 |     $1.320 |     $0.214 |      $1.284 |     $0.207 |      $1.242 |
| GPT-5.4 mini pricing | SARFTokenizer v0.3.1 compression | **$0.187** | **$1.124** | **$0.201** | **$1.205** | **$0.177** |  **$1.061** | **$0.179** |  **$1.071** |
| GPT-5.4 mini pricing | SARFTokenizer v0.2 compression   |     $0.204 |     $1.222 |     $0.213 |     $1.278 |     $0.191 |      $1.147 |     $0.192 |      $1.150 |
| GPT-5.5 pricing      | `o200k_base`                     |     $1.620 |     $9.718 |     $1.467 |     $8.800 |     $1.427 |      $8.559 |     $1.380 |      $8.283 |
| GPT-5.5 pricing      | SARFTokenizer v0.3.1 compression | **$1.249** | **$7.493** | **$1.339** | **$8.036** | **$1.178** |  **$7.070** | **$1.190** |  **$7.143** |
| GPT-5.5 pricing      | SARFTokenizer v0.2 compression   |     $1.358 |     $8.146 |     $1.420 |     $8.518 |     $1.275 |      $7.649 |     $1.278 |      $7.667 |

### v0.3.1 vs the best peer per domain

| Domain  |    v0.3.1 |                                   Best peer |          Δ |
| ------- | --------: | ------------------------------------------: | ---------: |
| Arabic  | **4.004** |                         Qwen3.6-35B (3.129) | **+27.9%** |
| English | **3.733** | GPT-5.4-mini / GPT-5.5 `o200k_base` (3.409) |  **+9.5%** |
| Math    | **4.243** | GPT-5.4-mini / GPT-5.5 `o200k_base` (3.505) | **+21.0%** |
| Code    | **4.200** | GPT-5.4-mini / GPT-5.5 `o200k_base` (3.622) | **+16.0%** |

### v0.3.1 vs prior SARFTokenizer revisions

| Domain  | v0.2 (65k) | v0.3 (80k) | **v0.3.1 (100k)** | Δ vs v0.2 |
| ------- | ---------: | ---------: | ----------------: | --------: |
| Arabic  |      3.683 |      3.192 |         **4.004** | **+8.7%** |
| English |      3.522 |      3.631 |         **3.733** | **+6.0%** |
| Math    |      3.922 |      4.259 |             4.243 | **+8.2%** |
| Code    |      3.913 |      4.224 |             4.200 |     +7.3% |

The 100k vocab gives Arabic ~50,000 effective slots (vs v0.2's 32,500 at 65k), and the 250M-char Arabic training share matches v0.2 exactly — so AR strictly gains from the larger vocab while math/code retain v0.3-class compression.

## Why this matters

* **Arabic-first deployments**: 4.004 AR CpT means ~30% more Arabic context in the same window vs GPT-5.4-mini / GPT-5.5 `o200k_base`, ~9% more vs our own v0.2.
* **Bilingual + technical domains**: math and code now first-class — strong compression on Python, math word problems, and formal reasoning chains.
* **Vocab specialization > vocab size**: at 100k we beat models with 200k–262k vocabularies on every domain.
* **Same infrastructure**: `AutoTokenizer.from_pretrained` without `trust_remote_code`, no Python preprocessing.

## Caveats we want you to know

1. **Lossy Arabic normalization (inherited from v0.2).** Tashkeel, Alef variants, Ya Maksura, and Indic digits are normalized at encode time. Not suitable for Qur'anic text or classical poetry with full diacritics.
2. **Math is web-style.** Trained on FineMath-4plus — natural-language math web text, not LaTeX-heavy formal mathematics.
3. **Code is Python-leaning.** Trained on Nemotron-Code, dominated by Python competitive-programming solutions with `<think>` reasoning. Less common languages may fall back to byte-level pieces more often.
4. **Larger embedding table.** 100k × hidden_dim is ~50% bigger than the v0.2 65k row table. Worth it if you can afford the parameters; if not, see v0.2 (AR/EN only) or v0.3 (4-domain at 80k with AR regression).
5. **Breaking change vs v0.2/v0.3 special tokens.** Old `<s>` / `</s>` / `<unk>` / `<pad>` are no longer present. Pin `revision="v0.2"` if you depend on the old token IDs.

---

## Special tokens

13 atomic special tokens with reserved IDs 0–12 (single-token, never split):

| ID | Token                   | Slot       | Purpose                              |
| -: | ----------------------- | ---------- | ------------------------------------ |
|  0 | `<\|assistant_end\|>`   | additional | end of assistant turn (chat)         |
|  1 | `<\|assistant_start\|>` | additional | start of assistant turn (chat)       |
|  2 | `<\|bos\|>`             | bos_token  | beginning-of-sequence                |
|  3 | `<\|end_of_text\|>`     | eos_token  | end-of-sequence                      |
|  4 | `<\|mask\|>`            | mask_token | mask for FIM / denoising / infilling |
|  5 | `<\|output_end\|>`      | additional | end of tool / exec output block      |
|  6 | `<\|output_start\|>`    | additional | start of tool / exec output block    |
|  7 | `<\|pad\|>`             | pad_token  | padding                              |
|  8 | `<\|python_end\|>`      | additional | end of Python code block             |
|  9 | `<\|python_start\|>`    | additional | start of Python code block           |
| 10 | `<\|unk\|>`             | unk_token  | unknown / byte-fallback signal       |
| 11 | `<\|user_end\|>`        | additional | end of user turn (chat)              |
| 12 | `<\|user_start\|>`      | additional | start of user turn (chat)            |

```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print(tok.bos_token, tok.eos_token, tok.mask_token)
# → <|bos|>  <|end_of_text|>  <|mask|>
```

The chat / code / output tokens enable a downstream model to emit:

```
<|user_start|>solve x^2 + 3x = 10<|user_end|>
<|assistant_start|>
<|python_start|>
from sympy import symbols, solve
x = symbols('x')
print(solve(x**2 + 3*x - 10))
<|python_end|>
<|output_start|>
[-5, 2]
<|output_end|>
The roots are x = -5 and x = 2.
<|assistant_end|>
<|end_of_text|>
```

without any markup-tokenization overhead — every boundary is a single token.

---

## Overview

| Property             | Value                                                                                                 |
| -------------------- | ----------------------------------------------------------------------------------------------------- |
| Vocabulary size      | **100,000**                                                                                           |
| Pre-tokenizer        | Metaspace (`▁` marker, SentencePiece-style)                                                           |
| Normalizer           | Arabic-focused: NFKC → Alef/Ya unification → tashkeel/tatweel/zero-width strip → Indic digits → ASCII |
| Special tokens       | 13 (see table above)                                                                                  |
| Domains              | Arabic + English + Math + Code                                                                        |
| Training corpus      | 500M chars (250 AR / 100 EN / 75 math / 75 code)                                                      |
| Training corpus repo | `almaghrabima/deeplatent-labeled`                                                                     |
| Public API           | `AutoTokenizer.from_pretrained` without `trust_remote_code`                                           |

---

## Quick start

```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print(tok.vocab_size)  # 100000
```

To pin to a specific revision:

```python
# v0.3.1 (latest, 100k, 4-domain, modern specials, this revision)
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer", revision="v0.3.1")

# v0.3 (80k, 4-domain, legacy <s>/</s> specials — accepts AR regression for smaller vocab)
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer", revision="v0.3")

# v0.2 (65k, AR/EN only, legacy specials — original SOTA-Arabic release)
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer", revision="v0.2")
```

## Low-level `tokenizers` API

```python
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("almaghrabima/SARFTokenizer")  # main = v0.3.1

print(tok.encode("المعلم يشرح الدرس في الصف اليوم.", add_special_tokens=False).tokens)
print(tok.encode("def fib(n):\n    return n if n<2 else fib(n-1)+fib(n-2)",
                 add_special_tokens=False).tokens)
```

## Reproduce the benchmark

The eval set (300 AR + 300 EN + 300 math + 300 code) is built from:

* **AR/EN**: the `SARFTokenizer-benchmark-eval` dataset.
* **Math**: held-out tail of `HuggingFaceTB/finemath` (`finemath-4plus`).
* **Code**: held-out tail of `saurabh5/nemotron-post-training-dataset-v1-code` with role markers stripped (problem + solution flattened with `\n\n`).

Each doc capped at 2000 chars, no normalization beyond what each tokenizer applies internally.

## Normalization (lossy on Arabic, by design)

All Arabic text is normalized at encode time:

* **NFKC** compat normalization
* **Tashkeel** (`U+064B`–`U+0652`, `U+0670`) removed
* **Tatweel** `U+0640` removed
* **Zero-width + BiDi controls** removed
* **Alef variants** (`أ`, `إ`, `آ`, `ٱ`) → bare Alef `ا`
* **Alef Maksura** `ى` → Ya `ي`
* **Arabic-Indic digits** (`٠`–`٩`) → ASCII `0`–`9`

Encoding is lossy on diacritics and Alef-Hamza variants — by design. If your downstream task requires preserving these (classical poetry with full diacritics, Qur'anic text), this tokenizer is not suitable.

---

## Files

* `tokenizer.json` — HuggingFace-format tokenizer (6.6 MB)
* `tokenizer_config.json` — `PreTrainedTokenizerFast` config
* `special_tokens_map.json` — special tokens map (5 named slots + 13-item additional)
* `BENCHMARK.md` — full results across 15 tokenizers (this README's table)
* `bench_results.json` — raw per-tokenizer per-domain metrics

## Related

* Training corpus: `almaghrabima/deeplatent-labeled` — 4-domain labeled pretraining corpus
* Eval corpus (AR/EN portion): `almaghrabima/SARFTokenizer-benchmark-eval` — 300 AR + 300 EN held-out documents

## Version history

* **v0.3.1** (latest, this revision) — 100k vocab, 4-domain, **13 modern `<|...|>` specials**. SOTA on AR/EN/math/code.
* **v0.3** — 80k vocab, 4-domain, legacy `<s>` / `</s>` / `<unk>` / `<pad>` specials. Math/code SOTA but AR regresses vs v0.2.
* **v0.2** — 65k vocab, AR/EN only, legacy specials. Original release; SOTA Arabic at sub-100k tier.

## License

Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).