Add SGLang inference commands to README (#7)
Browse files- Add SGLang inference commands to README (bc8d397781e5ac79e2521ecf4bb1aac099a6db43)
Co-authored-by: Netanel Haber <[email protected]>
README.md
CHANGED
|
@@ -89,6 +89,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
|
|
| 89 |
**Runtime Engine(s):**
|
| 90 |
* [vLLM] <br>
|
| 91 |
* [TRT-LLM] <br>
|
|
|
|
| 92 |
|
| 93 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
| 94 |
* NVIDIA L40S <br>*
|
|
@@ -346,6 +347,29 @@ vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8 --trust-remote-code --quant
|
|
| 346 |
vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD --trust-remote-code --quantization modelopt_fp4 --video-pruning-rate 0
|
| 347 |
```
|
| 348 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 349 |
## Training, Testing, and Evaluation Datasets:
|
| 350 |
|
| 351 |
### Training Datasets:
|
|
@@ -498,6 +522,7 @@ Evaluation benchmarks scores: <br>
|
|
| 498 |
# Inference: <br>
|
| 499 |
**Acceleration Engine:** vLLM <br>
|
| 500 |
**Acceleration Engine:** TRT-LLM <br>
|
|
|
|
| 501 |
|
| 502 |
**Test Hardware:** <br>
|
| 503 |
* NVIDIA L40S <br>
|
|
|
|
| 89 |
**Runtime Engine(s):**
|
| 90 |
* [vLLM] <br>
|
| 91 |
* [TRT-LLM] <br>
|
| 92 |
+
* [SGLang] <br>
|
| 93 |
|
| 94 |
**Supported Hardware Microarchitecture Compatibility:** <br>
|
| 95 |
* NVIDIA L40S <br>*
|
|
|
|
| 347 |
vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD --trust-remote-code --quantization modelopt_fp4 --video-pruning-rate 0
|
| 348 |
```
|
| 349 |
|
| 350 |
+
#### Inference with SGLang
|
| 351 |
+
Support is verified in **main**:
|
| 352 |
+
|
| 353 |
+
```bash
|
| 354 |
+
pip install "git+https://github.com/sgl-project/sglang.git@main#subdirectory=python"
|
| 355 |
+
```
|
| 356 |
+
|
| 357 |
+
**BF16**
|
| 358 |
+
```bash
|
| 359 |
+
sglang serve --trust-remote-code --model-path nvidia/Nemotron-Nano-12B-v2-VL-BF16 --max-mamba-cache-size 256 # Adjust '--max-mamba-cache-size' as needed, to fit in memory
|
| 360 |
+
```
|
| 361 |
+
|
| 362 |
+
**FP8**
|
| 363 |
+
```bash
|
| 364 |
+
sglang serve --trust-remote-code --model-path nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8
|
| 365 |
+
```
|
| 366 |
+
|
| 367 |
+
**FP4**
|
| 368 |
+
```bash
|
| 369 |
+
sglang serve --trust-remote-code --model-path nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD --quantization modelopt_fp4
|
| 370 |
+
```
|
| 371 |
+
|
| 372 |
+
|
| 373 |
## Training, Testing, and Evaluation Datasets:
|
| 374 |
|
| 375 |
### Training Datasets:
|
|
|
|
| 522 |
# Inference: <br>
|
| 523 |
**Acceleration Engine:** vLLM <br>
|
| 524 |
**Acceleration Engine:** TRT-LLM <br>
|
| 525 |
+
**Acceleration Engine:** SGLang <br>
|
| 526 |
|
| 527 |
**Test Hardware:** <br>
|
| 528 |
* NVIDIA L40S <br>
|