You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

CC BY-NC-SA 4.0 — non-commercial research and educational use only.

Log in or Sign Up to review the conditions and access this model content.

⚠️ v6.0 available with stronger homophone defense: thc1006/cyberpuppy-v6-bilingual

  • thc1006/cyberpuppy-v6-pinyin-lora. v6 uses LoRA r=64 (vs v5's r=32) and gives HED-COLD +1.91pt, TC homo +0.14pt, COLD +0.19pt at α=0.60. v5.1.1 still holds slightly higher PCR-ToxiCN (0.7162 vs 0.7119); use v5 if PCR is your priority.

CyberPuppy v5 — Chinese Cyberbullying & Toxicity Detection (Dual-LoRA)

State-of-the-art Chinese toxicity detection that defends against homophone attacks, number substitution, letter replacement, and creative obfuscation used on real social media platforms.

🏆 Exceeds published SOTA on PCR-ToxiCN (real-world RedNote/小紅書 posts): F1 0.6890 vs prior best 0.672

🛡️ 97.2% homophone invariance — immune to 「勾史」=「狗屎」, 「四調」=「死掉」 style attacks

🌐 Bilingual — handles both Traditional (繁體) and Simplified (简体) Chinese natively

Model Description

CyberPuppy v5 is a dual-LoRA ensemble for Chinese toxic content detection. It uses two specialized LoRA adapters on the same Qwen3-8B backbone:

Component Role Input
This model (LoRA-A) Text understanding Original Chinese text
LoRA-B (pinyin) Phonetic invariance Toneless pinyin conversion

The ensemble formula 0.75 × text_probs + 0.25 × pinyin_probs combines semantic understanding with phonetic robustness, making the system highly resistant to homophone-based evasion attacks.

Tasks (4-head multi-task)

Task Labels Description
Toxicity none / toxic / severe Primary: is this text harmful?
Bullying none / harassment / threat Type of cyberbullying behavior
Role none / perpetrator / victim / bystander Speaker's role in bullying
Emotion pos / neu / neg Emotional valence

Benchmark Results

vs Published Methods

Method COLD F1 PCR-ToxiCN F1 TC Homo F1
COLD baseline (paper) 0.78
PCR-ToxiCN SOTA (paper) 0.672
Qwen3Guard zero-shot 0.746
CyberPuppy v5 (ours) 0.8336 0.6890 0.8380

Robustness to Evasion Attacks

Attack Type Example Defense
Homophone substitution 「勾史」→「狗屎」 ✅ Pinyin LoRA sees identical input
Number substitution 「4了」→「死了」 ✅ Bilingual training exposure
Letter substitution 「装X」→「装逼」 ✅ CNTP adversarial training
Creative slang 「密碼」→「你媽」 ⚠️ Partially handled
English phonetic "funny mud pee" ⚠️ Limited coverage

Quick Start

import torch
from peft import PeftModel
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download
from pypinyin import pinyin, Style
import re

device = torch.device("cuda")
dtype = torch.bfloat16

# Tokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B-Base")

# LoRA-A (text) — this model
base_a = AutoModel.from_pretrained("Qwen/Qwen3-8B-Base", torch_dtype=dtype, device_map=device)
model_a = PeftModel.from_pretrained(base_a, "thc1006/cyberpuppy-v5-bilingual", subfolder="lora")

# LoRA-B (pinyin) — companion model
base_b = AutoModel.from_pretrained("Qwen/Qwen3-8B-Base", torch_dtype=dtype, device_map=device)
model_b = PeftModel.from_pretrained(base_b, "thc1006/cyberpuppy-v5-pinyin-lora", subfolder="lora")

# Classification heads
heads_a = torch.load(hf_hub_download("thc1006/cyberpuppy-v5-bilingual", "heads.pt"),
                     map_location=device, weights_only=False)
heads_b = torch.load(hf_hub_download("thc1006/cyberpuppy-v5-pinyin-lora", "heads.pt"),
                     map_location=device, weights_only=False)

# Pinyin converter
_HAN = re.compile(r"[\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff]")
def to_pinyin(text):
    return " ".join(
        pinyin(ch, style=Style.NORMAL)[0][0] if _HAN.match(ch)
        else ch for ch in text if ch.strip()
    )

# Inference
text = "你這個笨蛋,滾開!"
pinyin_text = to_pinyin(text)

enc_t = tok(text, return_tensors="pt", truncation=True, max_length=192).to(device)
enc_p = tok(pinyin_text, return_tensors="pt", truncation=True, max_length=192).to(device)

with torch.inference_mode():
    h_t = model_a(**enc_t).last_hidden_state[:, -1]
    h_p = model_b(**enc_p).last_hidden_state[:, -1]
    logits_t = heads_a["heads"]["toxicity"](h_t.float())
    logits_p = heads_b["heads"]["toxicity"](h_p.float())

# Geometric mean ensemble (v5.1 — better than linear blend on ALL benchmarks)
probs = (logits_t.softmax(-1) ** 0.75) * (logits_p.softmax(-1) ** 0.25)
probs = probs / probs.sum(-1, keepdim=True)  # re-normalize
labels = ["none", "toxic", "severe"]
pred = labels[probs.argmax(-1).item()]
print(f"{text}{pred}")  # toxic

Training Details

Parameter Value
Base model Qwen/Qwen3-8B-Base
LoRA rank 32
LoRA alpha 64
Target modules All linear (q/k/v/o/gate/up/down)
Training data 179,186 samples
Epochs 3
Learning rate 3e-5
Batch size 6 × 6 gradient accumulation
Max length 192 tokens
Precision bf16
Loss Focal (γ=2.5) + uncertainty multi-task + consistency (λ=0.5)
Hardware 1× NVIDIA RTX 5090 (32GB)

Training Data Composition

Source Records Language Purpose
COLD 25,659 Traditional Chinese Base toxicity corpus
SCCD 28,426 Traditional Chinese Session-level context
STATE-ToxiCN 5,781 Traditional Chinese Hate slang vocabulary
ToxiCloakCN × 3 33,012 Traditional Chinese Adversarial triplets
All above (simplified) 70,870 Simplified Chinese Bilingual coverage
CNTP 15,438 Mixed Real perturbation pairs
Total 179,186 Bilingual

Limitations and Bias

Known Limitations

  1. English input: Out of distribution. English-only text will produce unreliable results.
  2. Novel obfuscation: Creative attacks not seen in training (math puzzles like "64.5克黃金", new slang) may evade detection.
  3. Context length: Inputs longer than 192 tokens are truncated. Long-form content may lose critical context.
  4. Annotation bias: Trained primarily on COLD annotation guidelines, which may differ from other cultural contexts of toxicity.
  5. False positives on sarcasm/humor: Ironic or humorous usage of offensive terms may be flagged as toxic.
  6. ToxiCloakCN drop metric: −7.41% relative drop vs ≤5% target. Absolute performance (F1 0.8380) is strong but relative metric not met.

What This Model Cannot Do

  • Cannot moderate images, audio, or video — text-only
  • Cannot understand conversational context — classifies single messages in isolation
  • Cannot detect implicit bias or microaggressions — focused on explicit toxicity
  • Cannot replace human moderators — designed as an assistive tool, not autonomous censor
  • Cannot handle code-switching with non-CJK languages (e.g., mixed Thai-Chinese)

Ethical Considerations

  • Dual-use risk: Could be misused to generate evasion strategies. Mitigated by CC BY-NC-SA license.
  • Cultural sensitivity: Toxicity norms vary across Chinese-speaking regions (PRC, Taiwan, HK, Singapore). Model trained primarily on Taiwanese/Hong Kong norms.
  • Privacy: Model does not store or transmit input text. Deployment should hash/anonymize user data.
  • Over-censorship: False positives can silence legitimate speech. We recommend human-in-the-loop for final moderation decisions.

Recommended Use

Intended for:

  • Academic research on Chinese online safety
  • Educational tools for cyberbullying awareness
  • Content moderation assistive tools (with human review)
  • Benchmark development for adversarial robustness

Not intended for:

  • Autonomous censorship without human oversight
  • Surveillance of political speech
  • Commercial content moderation (requires Apache 2.0 relicensing)
  • Cross-lingual toxicity detection (Chinese-only)

LLM Cascade (Experimental)

We tested a Qwen3-8B Instruct cascade on disagreement samples (where text and pinyin LoRAs disagree). Results:

Strategy COLD PCR TC Homo
Ensemble only (recommended) 0.8336 0.6890 0.8380
+ Full LLM cascade −0.0185 +0.0182 −0.0105
+ Asymmetric (toxic-only) −0.0022 +0.0150 −0.0125

Conclusion: LLM cascade helps on real-world creative obfuscation (PCR) but hurts clean benchmarks. Not adopted as default. Available as optional deployment flag for social-media-like scenarios.

Citation

@misc{cyberpuppy_v5_2026,
  author       = {Tsai, Hung-Che},
  title        = {CyberPuppy v5: Bilingual Dual-LoRA Ensemble for Chinese Cyberbullying Detection with Homophone Robustness},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/thc1006/cyberpuppy-v5-bilingual}},
  note         = {Dual-LoRA ensemble with pinyin branch for adversarial robustness}
}

Related Models

Model Purpose
thc1006/cyberpuppy-v5-pinyin-lora Companion pinyin LoRA (required for ensemble)
thc1006/cyberpuppy-v2.2-adapter Previous version (deprecated)

Contact & Takedown

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thc1006/cyberpuppy-v5-bilingual

Finetuned
(414)
this model

Dataset used to train thc1006/cyberpuppy-v5-bilingual

Evaluation results