Qwen3.5-9B Humanize DPO Round 2

LoRA adapter fine-tuned with DPO on Qwen3.5-9B for Chinese text humanization. This is the latest and most capable version, especially strong on academic and technical Chinese text.

Uses a pure self-play approach: the rejected samples are outputs from an intermediate checkpoint of the previous training stage, teaching the model to consistently surpass its own prior output.

Model Details

Item	Value
Base model	`unsloth/Qwen3.5-9B`
Starting point	Intermediate checkpoint from prior DPO stage
Fine-tuning method	DPO (self-play rejected)
LoRA rank	16
Training data	2000 pairs (chosen = CSL human text, rejected = prior checkpoint outputs)
Training steps	250 steps (2 epochs)
Final loss	0.34
Final margin	~1.5-2.5
Final accuracy	~93-100%

What It Does

Academic papers: preserves all numbers, model names, formulas, and technical terms — verified on 10 academic scenarios (3 samples each)
Technical reports: maintains technical register without sounding robotic
Daily text: more concise and natural than Round 1

Usage

from unsloth import FastLanguageModel
from peft import PeftModel

base_model, proc = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3.5-9B", max_seq_length=2048, load_in_4bit=False,
)
tokenizer = proc.tokenizer if hasattr(proc, "tokenizer") else proc

model = PeftModel.from_pretrained(
    base_model, "XiangJinYu/Qwen3.5-9B-Humanize-DPO-Round2", is_trainable=False,
)
if hasattr(model, "config") and getattr(model.config, "model_type", "") == "qwen3_5":
    model.config.model_type = "qwen3"
FastLanguageModel.for_inference(model)

instruction = "请将下面文本改写得更像自然人写作，保持原意与事实，不要加标题或说明。"
text = "本文提出了一种基于U-Net改进的医学影像分割方法，Dice系数达到0.923，较基线方法提升了4.7个百分点，推理速度提升约30%。"
messages = [{"role": "user", "content": [{"type": "text", "text": f"{instruction}\n\n原文：{text}"}]}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Recommended: temperature 0.60-0.65 for academic texts
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.65,
                         top_p=0.9, do_sample=True, repetition_penalty=1.1)
gen = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(gen, skip_special_tokens=True))

Training Details

DPO data: 2000 pairs — chosen = CSL human academic text, rejected = prior checkpoint outputs (pure self-play, no casual mixing)
beta: 0.2, lr=5e-7 (conservative to avoid drift), 2 epochs (250 steps)
Rejected reward: stayed negative and decreasing throughout all 250 steps — no instability
Key design: pure self-play rejected gives clean, single-distribution gradient signal

Academic Test Results

Tested on 10 academic scenarios (3 samples each), all key numbers preserved:

Scenario	Numbers verified
NLP paper abstract	BLEU +3.2%, complexity -15%
Medical image segmentation	Dice 0.923, +4.7%, speed +30%
Graph neural network	O(n log n), F1 +2.8%, time -40%
SPWM inverter	83.9%, 81.9%, 6~18V, IEC 61000-4-2
Embedded system test	±0.5LSB, 8ms, 28mW, 0.6mW