SreyanG-NVIDIA commited on
Commit
2c92866
·
verified ·
1 Parent(s): 6601899

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -15
README.md CHANGED
@@ -2,11 +2,17 @@
2
  license: other
3
  language:
4
  - en
 
5
  tags:
6
  - audio
7
  - reasoning
8
  - audio understanding
9
  - ASR
 
 
 
 
 
10
  ---
11
  # Model Overview
12
 
@@ -66,10 +72,12 @@ Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on o
66
 
67
  **This model is for non-commercial research purposes only.**
68
 
69
- <center><img src="static/af3_radial-1.png" width="400"></center>
 
70
 
71
- <br>
72
 
 
73
  <center><img src="static/af3_main_diagram-1.png" width="800"></center>
74
 
75
 
@@ -110,21 +118,21 @@ AF3 uses:
110
  **This model was developed based on [NVILA](https://github.com/NVlabs/VILA/tree/main/scripts/NVILA-Lite) and [Qwen-2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) <br>
111
 
112
  ## Input:
113
- Input Type: Audio, Text <br>
114
- Input Format: WAV/MP3/FLAC, UTF-8 text <br>
115
- Input Parameters: Audio is Two-Dimensional (2D) and Text is One-Dimensional (1D)<br>
116
- Other Properties Related to Input: <br>
117
- -Max Audio Length: 10 Minutes <br>
118
- -Max Text Length: 16000 tokens<br>
119
 
120
 
121
  ## Output:
122
- Output Type: Text (and optional speech) <br>
123
- Text Format: UTF-8 string <br>
124
- Output Parameters: One-Dimensional (1D)<br>
125
- Other Properties Related to Output: <br>
126
- -Max Text Length: 1024 tokens <br>
127
- -Speech Format: streaming TTS (text-to-speech) waveform<br>
128
 
129
 
130
  Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems (A100/H100). By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
@@ -266,4 +274,4 @@ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.
266
  ---
267
 
268
  ## Acknowledgements
269
- Built with Qwen, NVILA and the open audio-ML community.
 
2
  license: other
3
  language:
4
  - en
5
+ arxiv: 2503.03983
6
  tags:
7
  - audio
8
  - reasoning
9
  - audio understanding
10
  - ASR
11
+ datasets:
12
+ - nvidia/AudioSkills
13
+ - nvidia/AF-Chat
14
+ - nvidia/AF-Think
15
+ - nvidia/LongAudio
16
  ---
17
  # Model Overview
18
 
 
72
 
73
  **This model is for non-commercial research purposes only.**
74
 
75
+ ## Model Architecture:
76
+ Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.
77
 
78
+ <center><img src="static/af3_radial-1.png" width="400"></center>
79
 
80
+ ## Results:
81
  <center><img src="static/af3_main_diagram-1.png" width="800"></center>
82
 
83
 
 
118
  **This model was developed based on [NVILA](https://github.com/NVlabs/VILA/tree/main/scripts/NVILA-Lite) and [Qwen-2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) <br>
119
 
120
  ## Input:
121
+ - Input Type: Audio, Text <br>
122
+ - Input Format: WAV/MP3/FLAC, UTF-8 text <br>
123
+ - Input Parameters: Audio is Two-Dimensional (2D) and Text is One-Dimensional (1D)<br>
124
+ - Other Properties Related to Input: <br>
125
+ - Max Audio Length: 10 Minutes <br>
126
+ - Max Text Length: 16000 tokens<br>
127
 
128
 
129
  ## Output:
130
+ - Output Type: Text (and optional speech) <br>
131
+ - Text Format: UTF-8 string <br>
132
+ - Output Parameters: One-Dimensional (1D)<br>
133
+ - Other Properties Related to Output: <br>
134
+ - Max Text Length: 1024 tokens <br>
135
+ - Speech Format: streaming TTS (text-to-speech) waveform<br>
136
 
137
 
138
  Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems (A100/H100). By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
 
274
  ---
275
 
276
  ## Acknowledgements
277
+ Built with Qwen, NVILA and the open audio-ML community.