Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation
Abstract
A novel multi-modal autoregressive model called InstaManip uses in-context learning with text and visual guidance to perform image manipulations, outperforming previous models and受益ting from more diverse exemplars.
Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed InstaManip, that can instantly learn a new image manipulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin (geq19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.
Community
@librarian-bot recommend some Image Edit papers for visual prompt edit task (similar to this work
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation (2025)
- SpotDiff: Spotting and Disentangling Interference in Feature Space for Subject-Preserving Image Generation (2025)
- UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception (2025)
- SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP (2025)
- Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers (2025)
- Personalized Vision via Visual In-Context Learning (2025)
- Growing Visual Generative Capacity for Pre-Trained MLLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper