YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
1. Introduction
XGen-Q is a domain-adapted large language model designed for software security analysis and malware analysis.
The model is built on Qwen2.5-Coder-1.5B-Instruct and further pretrained on large-scale malware datasets containing both source code and assembly code. XGen-Q is designed to analyze suspicious code, explain malware behaviors, and produce structured forensic reports that help security analysts understand why a sample may be malicious.
A key feature of XGen-Q is its two-stage reasoning pipeline, which separates detailed behavioral analysis from the final classification decision. This design improves explainability and makes the model suitable for integration into real-world cybersecurity workflows.
The model is domain-adaptively pretrained using the SBAN dataset, a multi-dimensional malware dataset designed for LLM pretraining in software security.
SBAN Dataset:
https://ieeexplore.ieee.org/document/11392071/
2. Evaluation Results
| Model | Assembly Code Perplexity ↓ | Source Code Perplexity ↓ |
|---|---|---|
| XGen-Q | 1.530 | 1.592 |
| DeepSeek-Coder-1.3B | 9.183 | 3.997 |
| Llama-3.1-8B-Instruct | 9.972 | 5.822 |
| Phi-4-Mini | 16.713 | 7.739 |
3. Citation
@article{jelodar2025xgenq, title={XGen-Q: An Explainable Domain-Adaptive LLM Framework with Retrieval-Augmented Generation for Software Security}, author={Jelodar, Hamed and Meymani, Mohammad and Razavi-Far, Roozbeh and Ghorbani, Ali}, journal={arXiv preprint arXiv:2510.19006}, year={2025} }
4. How to Use
Below is a simple example showing how to run XGen-Q using the Hugging Face Transformers library.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "JeloH/xGenq-qwen2.5-coder-1.5b-instruct-OKI"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
code_snippet = """
void inject_polymorphic_dll(DWORD pid) {
// suspicious DLL injection example
}
"""
prompt = f"""
You are a cybersecurity malware analyst.
Analyze the following code and produce:
1. Conclusion
2. Reasoning
3. Evidence
4. Suspicious Behavior Explanation
5. Final Judgment (malware / benign / partially malicious)
Code:
{code_snippet}
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=400,
temperature=0.2
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
- Downloads last month
- 83
