Model Card for KJML/typhoon2.5-qwen3-30b-a3b-FP8-Dynamic

This repository provides an FP8-dynamic quantized variant of
scb10x/typhoon2.5-qwen3-30b-a3b, a Thai–English bilingual instruct large language model built on Qwen3 30B A3B.

⚠️ This model is not further trained or fine-tuned.
It is a post-training FP8 dynamic quantization of the original Typhoon 2.5 Qwen3-30B-A3B weights for more efficient inference.


Model Details

Model Description

  • Base model: scb10x/typhoon2.5-qwen3-30b-a3b
  • Family: Typhoon 2.5 (Thai LLMs by SCB 10X)
  • Architecture: Sparse Mixture-of-Experts (MoE) based on Qwen3 30B A3B
  • Parameters: ~30B parameters, ~3B active per token (MoE)
  • Context length: Up to 256k tokens (inherits from Typhoon 2.5 / Qwen3 A3B)
  • Languages: Thai and English (Thai-first, English-capable)
  • Model type: Instruct conversational causal language model
  • Quantization: FP8 dynamic (weights + activations) for inference
  • License: Apache 2.0 (same as base model)
  • Developer of this variant: KJML
  • Finetuned from: scb10x/typhoon2.5-qwen3-30b-a3b
    (no extra training; quantization only)

Typhoon 2.5 Qwen3-30B-A3B is designed as a high-quality Thai–English bilingual instruct model, optimized for real-world agentic AI workloads with strong performance–cost trade-offs. This FP8-dynamic variant aims to keep that behavior while reducing VRAM usage and improving throughput on FP8-capable hardware.

Model Sources

  • Base model card: scb10x/typhoon2.5-qwen3-30b-a3b on Hugging Face
  • Typhoon project page: opentyphoon.ai (Typhoon 2.5 release)
  • This quantized model: KJML/typhoon2.5-qwen3-30b-a3b-FP8-Dynamic (this repo)

Uses

Direct Use

This FP8-dynamic variant is primarily intended for inference in:

  • Thai-focused and Thai–English chatbots / assistants
  • Thai business and enterprise applications:
    • Customer support and FAQ bots
    • Document Q&A, summarization, and drafting
    • Internal knowledge assistants
  • Agentic AI setups with:
    • Tool calling / function calling
    • Retrieval-Augmented Generation (RAG)
    • Workflow orchestration

Because it inherits the instruct tuning from Typhoon 2.5, it works best when used with a chat-style prompt template using <|im_start|> / <|im_end|> style formatting (the same as the base model).

Downstream Use

You can use this FP8-dynamic model as a drop-in replacement for the base Typhoon 2.5 30B model in:

  • Custom inference backends (Transformers, vLLM, TGI, etc.)
  • Local and on-prem deployments where:
    • VRAM is constrained, or
    • Throughput / cost per token is a priority
  • Multilingual / bilingual applications that need:
    • Strong Thai fluency and tone
    • Good English capability for mixed TH/EN workflows

If you choose to fine-tune this model further, treat it as you would the base model, but be aware that quantization can slightly affect numerical stability for very long or complex generations.

Out-of-Scope Use

This model (and this quantized variant) is not suitable as the sole decision-maker for:

  • Medical, legal, financial, or other high-stakes advice
  • Safety-critical systems (e.g., industrial control, physical robotics)
  • Any scenario where hallucinations or biased outputs can cause harm

It should also not be used to intentionally generate:

  • Hate speech, harassment, or abusive content
  • Disinformation or deceptive content
  • Content that violates applicable laws or platform policies

Always keep a human in the loop for sensitive or impactful use cases.


Bias, Risks, and Limitations

This model inherits the biases and limitations of both:

  1. The Typhoon 2.5 training data and alignment, and
  2. The underlying Qwen3 30B A3B architecture.

Potential issues:

  • Cultural and demographic bias
    Outputs may reflect stereotypes or imbalances present in web-scale data, in both Thai and English contexts.
  • Hallucinations
    The model may generate plausible-sounding but incorrect information or fabricated citations.
  • Overconfidence
    It may respond with high confidence even when it is uncertain or incorrect.
  • Quantization effects
    FP8 dynamic quantization may slightly degrade quality, especially:
    • In very long contexts
    • For edge cases that are numerically sensitive

Recommendations

  • Do not treat the model as an authoritative source of truth.
  • Add safety filters and/or a moderation layer for production use.
  • Use human review on user-facing or high-impact outputs.
  • Evaluate this FP8-dynamic variant on your own tasks (Thai & English) before deployment.

How to Get Started with the Model

Basic usage with 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "KJML/typhoon2.5-qwen3-30b-a3b-FP8-Dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",   # will leverage FP8 where supported
    device_map="auto",
)

messages = [
    {"role": "system", "content": "คุณเป็นผู้ช่วย AI ที่พูดไทยได้ลื่นไหลและช่วยอธิบายอย่างเข้าใจง่าย"},
    {"role": "user", "content": "ช่วยอธิบาย Typhoon 2.5 แบบเข้าใจง่ายหน่อยครับ"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Make sure you are using a recent version of Transformers and a PyTorch build that supports FP8 inference on your GPU.


Training Details

Training Data

No new data is introduced in this repository.

  • This repo does not train a model from scratch.
  • It simply quantizes the existing Typhoon 2.5 Qwen3-30B-A3B model.
  • For full details on training data and preprocessing, consult the Typhoon 2.5 and Qwen3 documentation and papers.

Training Procedure

There is no additional gradient-based training for this model.

Steps performed:

  1. Load base weights from scb10x/typhoon2.5-qwen3-30b-a3b.
  2. Apply FP8 dynamic post-training quantization (weights & activations) for inference.
  3. Export the quantized weights in safetensors format under this repository.

Preprocessing

No extra data preprocessing beyond what the base model used. All tokenization and chat formatting follow the original Typhoon 2.5 / Qwen3 template (e.g. <|im_start|> / <|im_end|>).

Training Hyperparameters

  • Training regime in this repo: None (quantization only).
  • Original base model training: See the Typhoon 2.5 and Qwen3 30B A3B model cards / papers for full details on optimizer, schedule, RL, and instruction tuning.

Speeds, Sizes, Times

Exact numbers depend on your GPU, but you can expect:

  • Lower VRAM usage compared to full-precision / BF16 variants.
  • Higher throughput (tokens/sec), especially with batched requests.
  • Competitive performance relative to Q4_K_M / 4-bit quantizations, while leveraging FP8’s better numeric properties on supported hardware.

You are encouraged to benchmark on your own hardware and share results in issues or discussions.


Evaluation

This repository does not include a separate evaluation suite specific to the FP8-dynamic variant.

Testing Data, Factors & Metrics

  • No independent test set is bundled here.
  • It is reasonable to expect similar qualitative performance to the base Typhoon 2.5 30B model, with minor numerical differences due to FP8.

Results

If you evaluate this model on:

  • Thai QA / reasoning benchmarks
  • Thai chat / fluency tests
  • Mixed Thai–English tasks

please consider sharing your results, scripts, or benchmark setups so others can benefit.

Summary

  • Use KJML/typhoon2.5-qwen3-30b-a3b-FP8-Dynamic when you want:

    • Typhoon 2.5 30B-level Thai fluency & reasoning
    • With reduced VRAM and better inference efficiency via FP8-dynamic quantization.

Model Examination (Optional)

No additional interpretability analysis has been performed specifically for this FP8-dynamic variant.

For deeper understanding of:

  • MoE expert routing,
  • Thai fluency tuning, and
  • Agentic alignment techniques,

please refer to the official Typhoon 2.5 paper and model documentation.


Environmental Impact

This repository does not involve training a new 30B model.

  • Only a one-time quantization pass over the base model’s weights was performed.
  • The environmental impact is therefore minimal compared to the original model training.

For training-time emissions and environmental considerations, please consult:

  • The Typhoon 2 / Typhoon 2.5 papers
  • Qwen3 technical reports

Technical Specifications

Model Architecture and Objective

  • Architecture: Qwen3-based sparse Mixture-of-Experts Transformer (30B, ~3B active)

  • Objective: Causal language modeling (next-token prediction)

  • Capabilities:

    • Thai–English bilingual reasoning
    • Long-context understanding (up to 256k)
    • Instruct / conversational alignment
    • Function calling and tool usage (inherited from Typhoon 2.5)

Compute Infrastructure

Quantization was performed on a modern GPU setup capable of handling the base 30B Typhoon 2.5 model.

Hardware

  • Single or few GPUs with enough VRAM to host the original weights
  • FP8-capable hardware recommended for best inference performance

Software

  • PyTorch (with FP8 support)
  • Hugging Face Transformers
  • Supporting libraries for FP8 dynamic quantization and safetensors export

Citation

If you use this model in your work, please cite at least the Typhoon 2 / Typhoon 2.5 and Qwen3 papers, as well as this repository if relevant.

Example:

@misc{typhoon2.5-2024,
  title        = {Typhoon 2.5: Thai Large Language Models based on Qwen3 30B A3B},
  author       = {SCB 10X and collaborators},
  year         = {2024},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL}
}

@misc{kjml2025typhoon2.5fp8dynamic,
  title        = {KJML/typhoon2.5-qwen3-30b-a3b-FP8-Dynamic: FP8-dynamic Quantized Variant of Typhoon2.5-Qwen3-30B-A3B},
  author       = {KJML},
  year         = {2025},
  howpublished = {Hugging Face model repository},
  url          = {https://huggingface.co/KJML/typhoon2.5-qwen3-30b-a3b-FP8-Dynamic}
}

Glossary

  • Typhoon 2.5: A family of Thai-focused open-source LLMs optimized for real-world Thai & English applications.
  • MoE (Mixture-of-Experts): Architecture where only a subset of experts (layers/blocks) are active per token, reducing compute cost.
  • FP8 dynamic: 8-bit floating point representation with dynamic scaling for weights and activations, improving efficiency while retaining quality.
  • Instruct model: A model tuned to follow natural language instructions and chat-style prompts.

More Information

For more details on:

  • Typhoon 2 / 2.5 releases and benchmarks
  • Thai-specific tuning and human-in-the-loop alignment
  • Integration with agentic systems and tools

see the official Typhoon documentation and community channels.


Model Card Authors


Downloads last month
228
Safetensors
Model size
31B params
Tensor type
F16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KJML/typhoon2.5-qwen3-30b-a3b-FP8-Dynamic

Quantized
(4)
this model