Image-Text-to-Text
Transformers
Safetensors
nvidia
VLM
conversational
amalad Nvidia-NetanelHaber commited on
Commit
5d250e2
·
verified ·
1 Parent(s): 5f7b7fa

Add SGLang inference commands to README (#7)

Browse files

- Add SGLang inference commands to README (bc8d397781e5ac79e2521ecf4bb1aac099a6db43)


Co-authored-by: Netanel Haber <[email protected]>

Files changed (1) hide show
  1. README.md +25 -0
README.md CHANGED
@@ -89,6 +89,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
89
  **Runtime Engine(s):**
90
  * [vLLM] <br>
91
  * [TRT-LLM] <br>
 
92
 
93
  **Supported Hardware Microarchitecture Compatibility:** <br>
94
  * NVIDIA L40S <br>*
@@ -346,6 +347,29 @@ vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8 --trust-remote-code --quant
346
  vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD --trust-remote-code --quantization modelopt_fp4 --video-pruning-rate 0
347
  ```
348
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
349
  ## Training, Testing, and Evaluation Datasets:
350
 
351
  ### Training Datasets:
@@ -498,6 +522,7 @@ Evaluation benchmarks scores: <br>
498
  # Inference: <br>
499
  **Acceleration Engine:** vLLM <br>
500
  **Acceleration Engine:** TRT-LLM <br>
 
501
 
502
  **Test Hardware:** <br>
503
  * NVIDIA L40S <br>
 
89
  **Runtime Engine(s):**
90
  * [vLLM] <br>
91
  * [TRT-LLM] <br>
92
+ * [SGLang] <br>
93
 
94
  **Supported Hardware Microarchitecture Compatibility:** <br>
95
  * NVIDIA L40S <br>*
 
347
  vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD --trust-remote-code --quantization modelopt_fp4 --video-pruning-rate 0
348
  ```
349
 
350
+ #### Inference with SGLang
351
+ Support is verified in **main**:
352
+
353
+ ```bash
354
+ pip install "git+https://github.com/sgl-project/sglang.git@main#subdirectory=python"
355
+ ```
356
+
357
+ **BF16**
358
+ ```bash
359
+ sglang serve --trust-remote-code --model-path nvidia/Nemotron-Nano-12B-v2-VL-BF16 --max-mamba-cache-size 256 # Adjust '--max-mamba-cache-size' as needed, to fit in memory
360
+ ```
361
+
362
+ **FP8**
363
+ ```bash
364
+ sglang serve --trust-remote-code --model-path nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8
365
+ ```
366
+
367
+ **FP4**
368
+ ```bash
369
+ sglang serve --trust-remote-code --model-path nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD --quantization modelopt_fp4
370
+ ```
371
+
372
+
373
  ## Training, Testing, and Evaluation Datasets:
374
 
375
  ### Training Datasets:
 
522
  # Inference: <br>
523
  **Acceleration Engine:** vLLM <br>
524
  **Acceleration Engine:** TRT-LLM <br>
525
+ **Acceleration Engine:** SGLang <br>
526
 
527
  **Test Hardware:** <br>
528
  * NVIDIA L40S <br>