Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

    ICCV 2025 + CVPR 2025

image

The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other. This approach enhances all three tasks simultaneously. Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches.

Key contributions:

  1. MaskPose: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
    • Download pre-trained weights below
  2. PMPose: a pose estimation model conditioned by segmentation masks AND predicting full description of each keypoint. Combination of MaskPose and ProbPose (CVPR'25).
  3. BBox-MaskPose (BMP): method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
    • Try the demo!
  4. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
    • Download pre-trained weights below
  5. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.

arXiv           GitHub repository           Project Website

For more details, see the GitHub repository.

πŸ“ Models List

  1. ViTPose-b multi-dataset
  2. MaskPose
  3. PMPose
  4. fine-tuned RTMDet-l

See details of each model below.


1. ViTPose-B [multi-dataset]

  • Model type: ViT-b backbone with multi-layer decoder
  • Input: RGB images (192x256)
  • Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
  • Language(s): Not language-dependent (vision model)
  • License: GPL-3.0
  • Framework: MMPose

Training Details

What's new? ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose. The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.


2. MaskPose-1.1.0

  • Model type: ViT-b backbone with multi-layer decoder
  • Input: RGB images (192x256) + estimated instance segmentation
  • Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 23 keypoints)
  • Language(s): Not language-dependent (vision model)
  • License: GPL-3.0
  • Framework: MMPose
  • Size(s): -S, -B, -L, -H

Training Details

What's new? Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes. No computational overhead compared to ViTPose.

V1.0.0 vs. V1.1.0 The previous version (v1.0.0) predicted 21 keypoints and was trained using a different training recipe. V1.1.0 predicts 23 keypoints and improved training recipe with dataset balancing, which improves numbers.


3. PMPose-1.0.0

  • Model type: ViT-b backbone with multi-layer decoder
  • Input: RGB images (192x256) + estimated instance segmentation
  • Output: Keypoints Coordinates (48x64 probmap for each keypoint, 23 keypoints), Presence Probabilities, Visibilities, Expected OKS for each keypoint
  • Language(s): Not language-dependent (vision model)
  • License: GPL-3.0
  • Framework: MMPose
  • Size(s): -S, -B, -L, -H

Training Details

What's new? PMPose combines MaskPose-1.1.0 and ProbPose (CVPR'25). It is conditioned by masks and has superior in-crowd performance as MaskPose and also precicts proabilities and visibilities as ProbPose.


4. fine-tuned RTMDet-L

  • Model type: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
  • Input: RGB images
  • Output: Detected instances -- bbox, instance mask and class for each
  • Language(s): Not language-dependent (vision model)
  • License: GPL-3.0
  • Framework: MMDetection

Training Details

What's new? RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection. Especially effective in multi-body scenes where background would not be detected otherwise.

πŸ“„ Citation

If you use our work, please cite:

@InProceedings{BMPv2,
    author    = {Purkrabek, Miroslav and Kolomiiets, Constantin and Matas, Jiri},
    title     = {BBoxMaskPose v2: Expanding Mutual Conditioning to 3D},
    booktitle = {arXiv preprint arXiv:to be added},
    year      = {2026}
}
@InProceedings{Purkrabek2025ICCV,
    author    = {Purkrabek, Miroslav and Matas, Jiri},
    title     = {Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025}
}
@InProceedings{Kolomiiets2026CVWW,
    author    = {Kolomiiets, Constantin and Purkrabek, Miroslav and Matas, Jiri},
    title     = {SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds},
    booktitle = {Computer Vision Winter Workshop (CVWW)},
    year      = {2026}
}

πŸ§‘β€πŸ’» Authors

Downloads last month
1,037
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using vrg-prague/BBoxMaskPose 1

Papers for vrg-prague/BBoxMaskPose