arxiv:2509.25843

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Published on Apr 14

· Submitted by

Yein Park on Apr 17

Korea University

Upvote

Authors:

Abstract

Activation-Scaling Guard (ASGuard) mitigates brittle refusal behaviors in large language models by identifying and recalibrating specific attention heads vulnerable to tense-based jailbreaking attacks through mechanistic circuit analysis and targeted fine-tuning.

AI-generated summary

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

View arXiv page View PDF GitHub 4 Add to collection

Community

P-YI

Paper submitter 1 day ago

•

edited 1 day ago

Asguard is a novel mechanistic safety framework that mitigates targeted jailbreak vulnerabilities in LLMs by directly intervening in internal activation dynamics rather than relying solely on data-level alignment.

(1) Background:
Large language models exhibit brittle refusal behavior, where simple linguistic transformations (e.g., tense changes) can bypass safety alignment, revealing a generalization gap in existing alignment methods.

(2) Motivation:
Prior safety approaches lack mechanistic understanding of why jailbreaks succeed, making them ineffective against targeted attacks such as tense-based jailbreaks; this calls for interpretable, circuit-level interventions that preserve utility while improving robustness.

(3) Method:
ASGuard identifies attention heads causally responsible for jailbreak behavior via circuit analysis, learns channel-wise activation scaling to recalibrate these vulnerable components, and integrates this into preventative fine-tuning to enforce robust refusal while maintaining overall model performance.

avahal

about 10 hours ago

the core move here, identifying a small set of heads causally linked to tense jailbreaking and then scaling their activations, is elegant, but i wonder how stable that causal link stays once you push the model with broader prompts. my worry is that the scaling and preventative fine-tuning could drift the heads’ role or shift other behaviors, making the defense brittle under distribution shifts or unseen jailbreak variants. an ablation removing the preventative fine-tuning while keeping the scaling, or testing on non-tense jailbreak prompts, would help confirm whether this pins a robust causal pathway. btw, the arxivlens breakdown helped me parse the method details and shows exactly which heads are touched, which i appreciated. overall i see the promise, but we need more stress tests across diverse prompts to judge real-world robustness.