Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

ICCV 2025 + CVPR 2025

The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other. This approach enhances all three tasks simultaneously. Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches.

Key contributions:

MaskPose: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
- Download pre-trained weights below
PMPose: a pose estimation model conditioned by segmentation masks AND predicting full description of each keypoint. Combination of MaskPose and ProbPose (CVPR'25).
BBox-MaskPose (BMP): method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
- Try the demo!
Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
- Download pre-trained weights below
Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.

For more details, see the GitHub repository.

📝 Models List

ViTPose-b multi-dataset
MaskPose
PMPose
fine-tuned RTMDet-l

See details of each model below.

1. ViTPose-B [multi-dataset]

Model type: ViT-b backbone with multi-layer decoder
Input: RGB images (192x256)
Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
Language(s): Not language-dependent (vision model)
License: GPL-3.0
Framework: MMPose

Training Details

Training data: COCO Dataset, MPII Dataset, AIC Datasel
Training script: GitHub - BBoxMaskPose_code
Epochs: 210
Batch size: 64
Learning rate: 5e-5
Hardware: 4x NVIDIA A-100

What's new? ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose. The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.

2. MaskPose-1.1.0

Model type: ViT-b backbone with multi-layer decoder
Input: RGB images (192x256) + estimated instance segmentation
Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 23 keypoints)
Language(s): Not language-dependent (vision model)
License: GPL-3.0
Framework: MMPose
Size(s): -S, -B, -L, -H

Training Details

Training data: COCO Dataset, MPII Dataset, AIC Datasel + SAM-estimated instance masks
Training script: GitHub - BBoxMaskPose_code
Epochs: 210
Batch size: 64
Learning rate: 5e-5
Hardware: 4x NVIDIA A-100

What's new? Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes. No computational overhead compared to ViTPose.

V1.0.0 vs. V1.1.0 The previous version (v1.0.0) predicted 21 keypoints and was trained using a different training recipe. V1.1.0 predicts 23 keypoints and improved training recipe with dataset balancing, which improves numbers.

3. PMPose-1.0.0

Model type: ViT-b backbone with multi-layer decoder
Input: RGB images (192x256) + estimated instance segmentation
Output: Keypoints Coordinates (48x64 probmap for each keypoint, 23 keypoints), Presence Probabilities, Visibilities, Expected OKS for each keypoint
Language(s): Not language-dependent (vision model)
License: GPL-3.0
Framework: MMPose
Size(s): -S, -B, -L, -H

Training Details

Training data: COCO Dataset, MPII Dataset, AIC Datasel + SAM-estimated instance masks
Training script: GitHub - BBoxMaskPose_code
Epochs: 20
Batch size: 64
Learning rate: 5e-5
Frozen backbone
Hardware: 4x NVIDIA A-100

What's new? PMPose combines MaskPose-1.1.0 and ProbPose (CVPR'25). It is conditioned by masks and has superior in-crowd performance as MaskPose and also precicts proabilities and visibilities as ProbPose.

4. fine-tuned RTMDet-L

Model type: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
Input: RGB images
Output: Detected instances -- bbox, instance mask and class for each
Language(s): Not language-dependent (vision model)
License: GPL-3.0
Framework: MMDetection

Training Details

Training data: COCO Dataset with randomly masked-out instances
Training script: GitHub - BBoxMaskPose_code
Epochs: 10
Batch size: 16
Learning rate: 2e-2
Hardware: 4x NVIDIA A-100

What's new? RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection. Especially effective in multi-body scenes where background would not be detected otherwise.

📄 Citation

If you use our work, please cite:

@InProceedings{BMPv2,
    author    = {Purkrabek, Miroslav and Kolomiiets, Constantin and Matas, Jiri},
    title     = {BBoxMaskPose v2: Expanding Mutual Conditioning to 3D},
    booktitle = {arXiv preprint arXiv:to be added},
    year      = {2026}
}

@InProceedings{Purkrabek2025ICCV,
    author    = {Purkrabek, Miroslav and Matas, Jiri},
    title     = {Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025}
}

@InProceedings{Kolomiiets2026CVWW,
    author    = {Kolomiiets, Constantin and Purkrabek, Miroslav and Matas, Jiri},
    title     = {SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds},
    booktitle = {Computer Vision Winter Workshop (CVWW)},
    year      = {2026}
}

🧑‍💻 Authors

Miroslav Purkrabek (personal website)
Constantin Kolomiiets
Jiri Matas (personal website)

Downloads last month: 1,037

Inference Providers NEW

Keypoint Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using vrg-prague/BBoxMaskPose 1

Papers for vrg-prague/BBoxMaskPose

Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

Paper • 2412.01562 • Published Dec 2, 2024

AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding

Paper • 1711.06475 • Published Nov 17, 2017