Qwen3.5-9B Humanize DPO Round 2

LoRA adapter fine-tuned with DPO on Qwen3.5-9B for Chinese text humanization. This is the latest and most capable version, especially strong on academic and technical Chinese text.

Uses a pure self-play approach: the rejected samples are outputs from an intermediate checkpoint of the previous training stage, teaching the model to consistently surpass its own prior output.

Model Details

Item Value
Base model unsloth/Qwen3.5-9B
Starting point Intermediate checkpoint from prior DPO stage
Fine-tuning method DPO (self-play rejected)
LoRA rank 16
Training data 2000 pairs (chosen = CSL human text, rejected = prior checkpoint outputs)
Training steps 250 steps (2 epochs)
Final loss 0.34
Final margin ~1.5-2.5
Final accuracy ~93-100%

What It Does

  • Academic papers: preserves all numbers, model names, formulas, and technical terms — verified on 10 academic scenarios (3 samples each)
  • Technical reports: maintains technical register without sounding robotic
  • Daily text: more concise and natural than Round 1

Usage

from unsloth import FastLanguageModel
from peft import PeftModel

base_model, proc = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3.5-9B", max_seq_length=2048, load_in_4bit=False,
)
tokenizer = proc.tokenizer if hasattr(proc, "tokenizer") else proc

model = PeftModel.from_pretrained(
    base_model, "XiangJinYu/Qwen3.5-9B-Humanize-DPO-Round2", is_trainable=False,
)
if hasattr(model, "config") and getattr(model.config, "model_type", "") == "qwen3_5":
    model.config.model_type = "qwen3"
FastLanguageModel.for_inference(model)

instruction = "请将下面文本改写得更像自然人写作,保持原意与事实,不要加标题或说明。"
text = "本文提出了一种基于U-Net改进的医学影像分割方法,Dice系数达到0.923,较基线方法提升了4.7个百分点,推理速度提升约30%。"
messages = [{"role": "user", "content": [{"type": "text", "text": f"{instruction}\n\n原文:{text}"}]}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Recommended: temperature 0.60-0.65 for academic texts
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.65,
                         top_p=0.9, do_sample=True, repetition_penalty=1.1)
gen = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(gen, skip_special_tokens=True))

Training Details

  • DPO data: 2000 pairs — chosen = CSL human academic text, rejected = prior checkpoint outputs (pure self-play, no casual mixing)
  • beta: 0.2, lr=5e-7 (conservative to avoid drift), 2 epochs (250 steps)
  • Rejected reward: stayed negative and decreasing throughout all 250 steps — no instability
  • Key design: pure self-play rejected gives clean, single-distribution gradient signal

Academic Test Results

Tested on 10 academic scenarios (3 samples each), all key numbers preserved:

Scenario Numbers verified
NLP paper abstract BLEU +3.2%, complexity -15%
Medical image segmentation Dice 0.923, +4.7%, speed +30%
Graph neural network O(n log n), F1 +2.8%, time -40%
SPWM inverter 83.9%, 81.9%, 6~18V, IEC 61000-4-2
Embedded system test ±0.5LSB, 8ms, 28mW, 0.6mW

Note: Use temperature 0.60-0.65 for academic texts. Higher temperatures occasionally cause rare technical term substitutions.

Model Series

Model Type Recommended for
SFT SFT Foundation
DPO Round 1 DPO General use, balanced
This model DPO Academic/technical, latest
Downloads last month
66
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for XiangJinYu/Qwen3.5-9B-Humanize-DPO-Round2

Finetuned
Qwen/Qwen3.5-9B
Adapter
(67)
this model

Collection including XiangJinYu/Qwen3.5-9B-Humanize-DPO-Round2