Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
ICCV 2025 + CVPR 2025
Key contributions:
- MaskPose: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
- Download pre-trained weights below
- PMPose: a pose estimation model conditioned by segmentation masks AND predicting full description of each keypoint. Combination of MaskPose and ProbPose (CVPR'25).
- BBox-MaskPose (BMP): method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
- Try the demo!
- Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
- Download pre-trained weights below
- Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
For more details, see the GitHub repository.
π Models List
- ViTPose-b multi-dataset
- MaskPose
- PMPose
- fine-tuned RTMDet-l
See details of each model below.
1. ViTPose-B [multi-dataset]
- Model type: ViT-b backbone with multi-layer decoder
- Input: RGB images (192x256)
- Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 21 keypoints)
- Language(s): Not language-dependent (vision model)
- License: GPL-3.0
- Framework: MMPose
Training Details
- Training data: COCO Dataset, MPII Dataset, AIC Datasel
- Training script: GitHub - BBoxMaskPose_code
- Epochs: 210
- Batch size: 64
- Learning rate: 5e-5
- Hardware: 4x NVIDIA A-100
What's new? ViTPose trained on multiple datasets perform much better in multi-body (and crowded) scenarios than COCO-trained ViTPose. The model was trained in multi-dataset setup by authors before, this is reproduction compatible with MMPose 2.0.
2. MaskPose-1.1.0
- Model type: ViT-b backbone with multi-layer decoder
- Input: RGB images (192x256) + estimated instance segmentation
- Output: Keypoints Coordinates (48x64 heatmap for each keypoint, 23 keypoints)
- Language(s): Not language-dependent (vision model)
- License: GPL-3.0
- Framework: MMPose
- Size(s): -S, -B, -L, -H
Training Details
- Training data: COCO Dataset, MPII Dataset, AIC Datasel + SAM-estimated instance masks
- Training script: GitHub - BBoxMaskPose_code
- Epochs: 210
- Batch size: 64
- Learning rate: 5e-5
- Hardware: 4x NVIDIA A-100
What's new? Compared to ViTPose, MaskPose takes instance segmentation as an input and is even better in distinguishing instances in muli-body scenes. No computational overhead compared to ViTPose.
V1.0.0 vs. V1.1.0 The previous version (v1.0.0) predicted 21 keypoints and was trained using a different training recipe. V1.1.0 predicts 23 keypoints and improved training recipe with dataset balancing, which improves numbers.
3. PMPose-1.0.0
- Model type: ViT-b backbone with multi-layer decoder
- Input: RGB images (192x256) + estimated instance segmentation
- Output: Keypoints Coordinates (48x64 probmap for each keypoint, 23 keypoints), Presence Probabilities, Visibilities, Expected OKS for each keypoint
- Language(s): Not language-dependent (vision model)
- License: GPL-3.0
- Framework: MMPose
- Size(s): -S, -B, -L, -H
Training Details
- Training data: COCO Dataset, MPII Dataset, AIC Datasel + SAM-estimated instance masks
- Training script: GitHub - BBoxMaskPose_code
- Epochs: 20
- Batch size: 64
- Learning rate: 5e-5
- Frozen backbone
- Hardware: 4x NVIDIA A-100
What's new? PMPose combines MaskPose-1.1.0 and ProbPose (CVPR'25). It is conditioned by masks and has superior in-crowd performance as MaskPose and also precicts proabilities and visibilities as ProbPose.
4. fine-tuned RTMDet-L
- Model type: CSPNeXt-P5 backbone, CSPNeXtPAFPN neck, RTMDetInsSepBN head
- Input: RGB images
- Output: Detected instances -- bbox, instance mask and class for each
- Language(s): Not language-dependent (vision model)
- License: GPL-3.0
- Framework: MMDetection
Training Details
- Training data: COCO Dataset with randomly masked-out instances
- Training script: GitHub - BBoxMaskPose_code
- Epochs: 10
- Batch size: 16
- Learning rate: 2e-2
- Hardware: 4x NVIDIA A-100
What's new? RTMDet fine-tuned to ignore masked-out instances is designed for itterative detection. Especially effective in multi-body scenes where background would not be detected otherwise.
π Citation
If you use our work, please cite:
@InProceedings{BMPv2,
author = {Purkrabek, Miroslav and Kolomiiets, Constantin and Matas, Jiri},
title = {BBoxMaskPose v2: Expanding Mutual Conditioning to 3D},
booktitle = {arXiv preprint arXiv:to be added},
year = {2026}
}
@InProceedings{Purkrabek2025ICCV,
author = {Purkrabek, Miroslav and Matas, Jiri},
title = {Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025}
}
@InProceedings{Kolomiiets2026CVWW,
author = {Kolomiiets, Constantin and Purkrabek, Miroslav and Matas, Jiri},
title = {SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds},
booktitle = {Computer Vision Winter Workshop (CVWW)},
year = {2026}
}
π§βπ» Authors
- Miroslav Purkrabek (personal website)
- Constantin Kolomiiets
- Jiri Matas (personal website)
- Downloads last month
- 1,037
