HSR Inverse Dynamics Model (IDM)
Inverse Dynamics Model trained on Toyota HSR robot manipulation episodes.
Architecture
- Vision encoder: SigLIP-2 (
google/siglip2-base-patch16-224, frozen) - Action head: Flow Matching Transformer (4-layer, hidden_dim=512)
- Input: (frame_t, frame_t+1) from head + hand cameras → 4 images
- Output: action_chunk (H=4 future actions, 8-DOF)
Training
- Dataset: 44,892 train / 4,987 val frame-action pairs from approved HSR episodes
- 50 epochs, AdamW lr=1e-4, cosine+warmup schedule (per-batch stepping)
- Mixed precision: BF16
Files
best_model.pt: Full model checkpoint (weights only, no optimizer state)action_stats.json: Action normalization statistics (mean, std, min, max)