SreyanG-NVIDIA commited on
Commit
438ab7b
·
verified ·
1 Parent(s): 8327ed6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +281 -0
README.md CHANGED
@@ -73,6 +73,287 @@ Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on o
73
 
74
  **This model is for non-commercial research purposes only.**
75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  ## Results:
77
  <center><img src="static/af3_radial-1.png" width="400"></center>
78
 
 
73
 
74
  **This model is for non-commercial research purposes only.**
75
 
76
+ ## Usage
77
+
78
+ Audio Flamingo 3 is supported in 🤗 Transformers. To run the model, first install Transformers:
79
+
80
+ ```bash
81
+ pip install --upgrade pip
82
+ pip install --upgrade git+https://github.com/huggingface/transformers
83
+ ```
84
+
85
+ > **Note:** AF3 processes audio in 30-second windows with a **10-minute** total cap per sample. Longer inputs are truncated.
86
+
87
+ ### Single-turn: audio + text instruction
88
+
89
+ ```python
90
+ from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
91
+
92
+ model_id = "nvidia/audio-flamingo-3-hf"
93
+ processor = AutoProcessor.from_pretrained(model_id)
94
+ model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
95
+
96
+ conversation = [
97
+ {
98
+ "role": "user",
99
+ "content": [
100
+ {"type": "text", "text": "Transcribe the input speech."},
101
+ {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
102
+ ],
103
+ }
104
+ ]
105
+
106
+ inputs = processor.apply_chat_template(
107
+ conversation,
108
+ tokenize=True,
109
+ add_generation_prompt=True,
110
+ return_dict=True,
111
+ ).to(model.device)
112
+
113
+ outputs = model.generate(**inputs, max_new_tokens=500)
114
+
115
+ decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
116
+ print(decoded_outputs)
117
+ ```
118
+
119
+ ### Multi-turn chat
120
+
121
+ ```python
122
+ from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
123
+
124
+ model_id = "nvidia/audio-flamingo-3-hf"
125
+ processor = AutoProcessor.from_pretrained(model_id)
126
+ model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
127
+
128
+ conversation = [
129
+ {
130
+ "role": "user",
131
+ "content": [
132
+ {
133
+ "type": "text",
134
+ "text": "Instruction: How does the tone of female speech change throughout the audio? Choose the correct option among the options below: (A) Sad to happy (B) Happy to sad (C) Neutral to happy (D) Happy to neutral.",
135
+ },
136
+ {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/000000786159.31.wav"},
137
+ ],
138
+ },
139
+ {
140
+ "role": "assistant",
141
+ "content": [{"type": "text", "text": "(A) Sad to happy"}],
142
+ },
143
+ {
144
+ "role": "user",
145
+ "content": [
146
+ {"type": "text", "text": "Why do you think so?"},
147
+ ],
148
+ },
149
+ ]
150
+
151
+ inputs = processor.apply_chat_template(
152
+ conversation,
153
+ tokenize=True,
154
+ add_generation_prompt=True,
155
+ return_dict=True,
156
+ ).to(model.device)
157
+
158
+ outputs = model.generate(**inputs, max_new_tokens=500)
159
+
160
+ decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
161
+ print(decoded_outputs)
162
+ ```
163
+
164
+ ### Batch multiple conversations
165
+
166
+ ```python
167
+ from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
168
+
169
+ model_id = "nvidia/audio-flamingo-3-hf"
170
+ processor = AutoProcessor.from_pretrained(model_id)
171
+ model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
172
+
173
+ conversations = [
174
+ [
175
+ {
176
+ "role": "user",
177
+ "content": [
178
+ {"type": "text", "text": "Transcribe the input speech."},
179
+ {
180
+ "type": "audio",
181
+ "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav",
182
+ },
183
+ ],
184
+ }
185
+ ],
186
+ [
187
+ {
188
+ "role": "user",
189
+ "content": [
190
+ {
191
+ "type": "text",
192
+ "text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
193
+ },
194
+ {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
195
+ ],
196
+ }
197
+ ],
198
+ ]
199
+
200
+ inputs = processor.apply_chat_template(
201
+ conversations,
202
+ tokenize=True,
203
+ add_generation_prompt=True,
204
+ return_dict=True,
205
+ ).to(model.device)
206
+
207
+ outputs = model.generate(**inputs, max_new_tokens=500)
208
+
209
+ decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
210
+ print(decoded_outputs)
211
+ ```
212
+
213
+ ### Text-only and audio-only prompts
214
+
215
+ ```python
216
+ # text-only
217
+ conv = [{"role": "user", "content": [{"type": "text", "text": "What is the capital of France?"}]}]
218
+ batch = processor.apply_chat_template(conv, tokenize=True, add_generation_prompt=True, return_dict=True).to(device)
219
+ print(processor.batch_decode(model.generate(**batch)[:, batch["input_ids"].shape[1]:], skip_special_tokens=True)[0])
220
+
221
+ # audio-only
222
+ conv = [{"role": "user", "content": [{"type": "audio", "path": "https://.../sample.wav"}]}]
223
+ batch = processor.apply_chat_template(conv, tokenize=True, add_generation_prompt=True, return_dict=True).to(device)
224
+ print(processor.batch_decode(model.generate(**batch)[:, batch["input_ids"].shape[1]:], skip_special_tokens=True)[0])
225
+ ```
226
+
227
+ AF3 transcription checkpoints prepend answers with fixed assistant phrasing such as `The spoken content of the audio is "<text>".`. Passing `strip_prefix=True` removes that canned prefix and the surrounding quotes so you only keep the transcription.
228
+
229
+ ### Transcribe a local/remote file (shortcut)
230
+
231
+ ```python
232
+ from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
233
+
234
+ model_id = "nvidia/audio-flamingo-3-hf"
235
+ processor = AutoProcessor.from_pretrained(model_id)
236
+ model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
237
+
238
+ inputs = processor.apply_transcription_request(audio="https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/t_837b89f2-26aa-4ee2-bdf6-f73f0dd59b26.wav").to(model.device)
239
+
240
+ outputs = model.generate(**inputs, max_new_tokens=500)
241
+ decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True, strip_prefix=True)
242
+
243
+ print(decoded_outputs)
244
+ ```
245
+
246
+ ### Training / Fine-tuning
247
+
248
+ ```python
249
+ from transformers import AudioFlamingo3ForConditionalGeneration, AutoProcessor
250
+
251
+ model_id = "nvidia/audio-flamingo-3-hf"
252
+ processor = AutoProcessor.from_pretrained(model_id)
253
+ model = AudioFlamingo3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
254
+ model.train()
255
+
256
+ conversation = [
257
+ [
258
+ {
259
+ "role": "user",
260
+ "content": [
261
+ {"type": "text", "text": "Transcribe the input speech."},
262
+ {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"},
263
+ ],
264
+ },
265
+ {
266
+ "role": "assistant",
267
+ "content": [{"type": "text", "text": "The transcription of the audio is 'summer follows spring the days grow longer and the nights are warm'."}],
268
+ }
269
+ ],
270
+ [
271
+ {
272
+ "role": "user",
273
+ "content": [
274
+ {
275
+ "type": "text",
276
+ "text": "This track feels really peaceful and introspective. What elements make it feel so calming and meditative?",
277
+ },
278
+ {"type": "audio", "path": "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/FPSbCAANfbJLVSwD.mp3"},
279
+ ],
280
+ },
281
+ {
282
+ "role": "assistant",
283
+ "content": [{"type": "text", "text": "The transcription of the audio is 'some transcription of the audio'."}],
284
+ }
285
+
286
+ ]
287
+ ]
288
+
289
+ inputs = processor.apply_chat_template(
290
+ conversation,
291
+ tokenize=True,
292
+ add_generation_prompt=True,
293
+ return_dict=True,
294
+ output_labels=True,
295
+ ).to(model.device)
296
+
297
+ loss = model(**inputs).loss
298
+ loss.backward()
299
+ ```
300
+
301
+ ### Generation options
302
+
303
+ You can tune decoding similar to other text-generation models:
304
+
305
+ ```python
306
+ generate_kwargs = {
307
+ "max_new_tokens": 256,
308
+ "do_sample": True,
309
+ "temperature": 0.7,
310
+ "top_p": 0.9,
311
+ }
312
+ out = model.generate(**batch, **generate_kwargs)
313
+ ```
314
+
315
+ ## Additional Speed & Memory Improvements
316
+
317
+ ### Flash Attention 2
318
+
319
+ If your GPU supports it and you are **not** using `torch.compile`, install Flash-Attention and enable it at load time:
320
+
321
+ ```bash
322
+ pip install flash-attn --no-build-isolation
323
+ ```
324
+
325
+ ```python
326
+ model = AudioFlamingo3ForConditionalGeneration.from_pretrained(
327
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2"
328
+ ).to(device)
329
+ ```
330
+
331
+ ### Torch compile
332
+
333
+ AF3’s forward pass is compatible with `torch.compile` for significant speed-ups:
334
+
335
+ ```python
336
+ import torch
337
+ torch.set_float32_matmul_precision("high")
338
+
339
+ model.generation_config.cache_implementation = "static"
340
+ model.generation_config.max_new_tokens = 256
341
+ model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
342
+ ```
343
+
344
+ > `torch.compile` is not compatible with Flash Attention 2 at the same time.
345
+
346
+ ### PyTorch SDPA
347
+
348
+ If Flash-Attention isn’t available, AF3 will use PyTorch scaled-dot product attention (SDPA) by default on supported PyTorch versions. You can set it explicitly:
349
+
350
+ ```python
351
+ model = AudioFlamingo3ForConditionalGeneration.from_pretrained(
352
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="sdpa"
353
+ ).to(device)
354
+ ```
355
+
356
+
357
  ## Results:
358
  <center><img src="static/af3_radial-1.png" width="400"></center>
359