UWGZQ nielsr HF Staff commited on
Commit
bf1f7af
·
1 Parent(s): 9d5a73d

Improve model card metadata, author list, and architecture image path (#1)

Browse files

- Improve model card metadata, author list, and architecture image path (005bc1c6d7489d7218631fd36e7b80faf778678f)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +13 -10
README.md CHANGED
@@ -1,7 +1,10 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
 
 
 
5
  tags:
6
  - video-scene-graph
7
  - scene-graph-generation
@@ -9,25 +12,25 @@ tags:
9
  - trajectory-aware
10
  - perceiver-resampler
11
  - qwen2.5-vl
12
- base_model: Qwen/Qwen2.5-VL-3B-Instruct
13
- pipeline_tag: video-text-to-text
14
  ---
15
 
16
- # TRASER:
17
 
18
  TRASER is the video scene graph generation model introduced in **Synthetic Visual Genome 2 (SVG2)**. Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.
19
 
20
- **Paper:** [Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos](https://arxiv.org/pdf/2602.23543)
21
 
22
  **Website:** [Synthetic Visual Genome 2](https://uwgzq.github.io/papers/SVG2/)
23
 
24
- **Authors:** Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI · University of Washington · Woven by Toyota)
25
 
26
  ---
27
 
28
  ## Model Architecture
29
 
30
- ![TRASER Architecture](static/model.png)
31
 
32
  TRASER extends **Qwen2.5-VL-3B-Instruct** with two trainable Perceiver Resampler modules that implement **Trajectory-Aligned Token Arrangement**:
33
 
@@ -155,8 +158,8 @@ Then follow the preprocessing steps in `inference.py`: load masks → build obje
155
 
156
  TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_Visual_Genome2), a large-scale automatically annotated video scene graph dataset:
157
 
158
- - **\~636K videos** with dense panoptic, per-frame annotations
159
- - **\~6.6M objects · \~52M attributes · \~6.7M relations**
160
 
161
  ---
162
 
@@ -173,4 +176,4 @@ TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_
173
  primaryClass={cs.CV},
174
  url={https://arxiv.org/abs/2602.23543},
175
  }
176
- ```
 
1
  ---
2
+ base_model: Qwen/Qwen2.5-VL-3B-Instruct
3
  language:
4
  - en
5
+ license: apache-2.0
6
+ pipeline_tag: video-text-to-text
7
+ library_name: transformers
8
  tags:
9
  - video-scene-graph
10
  - scene-graph-generation
 
12
  - trajectory-aware
13
  - perceiver-resampler
14
  - qwen2.5-vl
15
+ datasets:
16
+ - UWGZQ/Synthetic_Visual_Genome2
17
  ---
18
 
19
+ # TRASER
20
 
21
  TRASER is the video scene graph generation model introduced in **Synthetic Visual Genome 2 (SVG2)**. Given a video and per-object segmentation trajectories, it generates a structured spatio-temporal scene graph describing objects, attributes, and their relations across time.
22
 
23
+ **Paper:** [Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos](https://arxiv.org/abs/2602.23543)
24
 
25
  **Website:** [Synthetic Visual Genome 2](https://uwgzq.github.io/papers/SVG2/)
26
 
27
+ **Authors:** Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna. (Allen Institute for AI · University of Washington · Woven by Toyota)
28
 
29
  ---
30
 
31
  ## Model Architecture
32
 
33
+ ![TRASER Architecture](static/image.png)
34
 
35
  TRASER extends **Qwen2.5-VL-3B-Instruct** with two trainable Perceiver Resampler modules that implement **Trajectory-Aligned Token Arrangement**:
36
 
 
158
 
159
  TRASER is trained on [**SVG2**](https://huggingface.co/datasets/UWGZQ/Synthetic_Visual_Genome2), a large-scale automatically annotated video scene graph dataset:
160
 
161
+ - **~636K videos** with dense panoptic, per-frame annotations
162
+ - **~6.6M objects · ~52M attributes · ~6.7M relations**
163
 
164
  ---
165
 
 
176
  primaryClass={cs.CV},
177
  url={https://arxiv.org/abs/2602.23543},
178
  }
179
+ ```