nvidia
/

audio-flamingo-3

@@ -2,11 +2,17 @@
 license: other
 language:
 - en
 tags:
 - audio
 - reasoning
 - audio understanding
 - ASR
 ---
 # Model Overview
@@ -66,10 +72,12 @@ Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on o
 **This model is for non-commercial research purposes only.**
-<center><img src="static/af3_radial-1.png" width="400"></center>
-<br>
 <center><img src="static/af3_main_diagram-1.png" width="800"></center>
@@ -110,21 +118,21 @@ AF3 uses:
 **This model was developed based on [NVILA](https://github.com/NVlabs/VILA/tree/main/scripts/NVILA-Lite) and [Qwen-2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) <br>
 ## Input:
-Input Type: Audio, Text <br>
-Input Format: WAV/MP3/FLAC, UTF-8 text <br>
-Input Parameters: Audio is Two-Dimensional (2D) and Text is One-Dimensional (1D)<br>
-Other Properties Related to Input: <br>
--Max Audio Length: 10 Minutes <br>
--Max Text Length: 16000 tokens<br>
 ## Output:
-Output Type: Text (and optional speech) <br>
-Text Format: UTF-8 string  <br>
-Output Parameters: One-Dimensional (1D)<br>
-Other Properties Related to Output: <br>
--Max Text Length: 1024 tokens <br>
--Speech Format: streaming TTS (text-to-speech) waveform<br>
 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems (A100/H100). By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
@@ -266,4 +274,4 @@ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.
 ---
 ## Acknowledgements
-Built with Qwen, NVILA and the open audio-ML community.

 license: other
 language:
 - en
+arxiv: 2503.03983
 tags:
 - audio
 - reasoning
 - audio understanding
 - ASR
+datasets:
+- nvidia/AudioSkills
+- nvidia/AF-Chat
+- nvidia/AF-Think
+- nvidia/LongAudio
 ---
 # Model Overview
 **This model is for non-commercial research purposes only.**
+## Model Architecture:
+Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.
+<center><img src="static/af3_radial-1.png" width="400"></center>
+## Results:
 <center><img src="static/af3_main_diagram-1.png" width="800"></center>
 **This model was developed based on [NVILA](https://github.com/NVlabs/VILA/tree/main/scripts/NVILA-Lite) and [Qwen-2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) <br>
 ## Input:
+- Input Type: Audio, Text <br>
+- Input Format: WAV/MP3/FLAC, UTF-8 text <br>
+- Input Parameters: Audio is Two-Dimensional (2D) and Text is One-Dimensional (1D)<br>
+- Other Properties Related to Input: <br>
+- Max Audio Length: 10 Minutes <br>
+- Max Text Length: 16000 tokens<br>
 ## Output:
+- Output Type: Text (and optional speech) <br>
+- Text Format: UTF-8 string  <br>
+- Output Parameters: One-Dimensional (1D)<br>
+- Other Properties Related to Output: <br>
+- Max Text Length: 1024 tokens <br>
+- Speech Format: streaming TTS (text-to-speech) waveform<br>
 Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems (A100/H100). By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
 ---
 ## Acknowledgements
+Built with Qwen, NVILA and the open audio-ML community.