ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
Abstract
Activation-Scaling Guard (ASGuard) mitigates brittle refusal behaviors in large language models by identifying and recalibrating specific attention heads vulnerable to tense-based jailbreaking attacks through mechanistic circuit analysis and targeted fine-tuning.
Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.
Community
Asguard is a novel mechanistic safety framework that mitigates targeted jailbreak vulnerabilities in LLMs by directly intervening in internal activation dynamics rather than relying solely on data-level alignment.
(1) Background:
Large language models exhibit brittle refusal behavior, where simple linguistic transformations (e.g., tense changes) can bypass safety alignment, revealing a generalization gap in existing alignment methods.
(2) Motivation:
Prior safety approaches lack mechanistic understanding of why jailbreaks succeed, making them ineffective against targeted attacks such as tense-based jailbreaks; this calls for interpretable, circuit-level interventions that preserve utility while improving robustness.
(3) Method:
ASGuard identifies attention heads causally responsible for jailbreak behavior via circuit analysis, learns channel-wise activation scaling to recalibrate these vulnerable components, and integrates this into preventative fine-tuning to enforce robust refusal while maintaining overall model performance.
the core move here, identifying a small set of heads causally linked to tense jailbreaking and then scaling their activations, is elegant, but i wonder how stable that causal link stays once you push the model with broader prompts. my worry is that the scaling and preventative fine-tuning could drift the heads’ role or shift other behaviors, making the defense brittle under distribution shifts or unseen jailbreak variants. an ablation removing the preventative fine-tuning while keeping the scaling, or testing on non-tense jailbreak prompts, would help confirm whether this pins a robust causal pathway. btw, the arxivlens breakdown helped me parse the method details and shows exactly which heads are touched, which i appreciated. overall i see the promise, but we need more stress tests across diverse prompts to judge real-world robustness.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation (2026)
- SafeSeek: Universal Attribution of Safety Circuits in Language Models (2026)
- The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs (2026)
- Fail-Closed Alignment for Large Language Models (2026)
- Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints (2026)
- What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal (2026)
- Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2509.25843 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
