Papers
arxiv:2510.20470

Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Published on Oct 23, 2025
· Submitted by
kun ouyang
on Oct 24, 2025
Authors:
,
,
,
,
,
,

Abstract

Conan, a framework for evidence-grounded multi-step video reasoning, enhances visual grounding and reasoning accuracy through a multi-stage training strategy and outperforms existing models on various benchmarks.

AI-generated summary

Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.

Community

Paper submitter

Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence.
Model: https://huggingface.co/RUBBISHLIKE/Conan-7B
Repo: https://github.com/OuyangKun10/Conan

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.20470 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.20470 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.20470 in a Space README.md to link it from this page.

Collections including this paper 1