Instructions to use nvidia/NVIDIA-Nemotron-Parse-v1.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/NVIDIA-Nemotron-Parse-v1.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="nvidia/NVIDIA-Nemotron-Parse-v1.1", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("nvidia/NVIDIA-Nemotron-Parse-v1.1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nvidia/NVIDIA-Nemotron-Parse-v1.1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/NVIDIA-Nemotron-Parse-v1.1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-Parse-v1.1", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/nvidia/NVIDIA-Nemotron-Parse-v1.1
- SGLang
How to use nvidia/NVIDIA-Nemotron-Parse-v1.1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/NVIDIA-Nemotron-Parse-v1.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-Parse-v1.1", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/NVIDIA-Nemotron-Parse-v1.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-Parse-v1.1", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-Parse-v1.1 with Docker Model Runner:
docker model run hf.co/nvidia/NVIDIA-Nemotron-Parse-v1.1
Openai Api Compatible Inference
Hello,
first of all, this is a nice Model! Thank you!
Now to my Question:
Is there a Way to deploy this easily (without a NIM) for inference as an Openai Api Compatible Endpoint?
When trying to run it with vllm/vllm-openai it doesnt seem to work properly
Anyone Got it to work ?
Hi! Thank you for interest. Nemotron-Parse is now also supported in vllm ToT (besides our fork) - maybe that would work for you?
Hey!
Thank you for the information,
i will try it out with vllm Nightly and report back!
EDIT:
Might need to wait a bit as im using docker and the nightly image was not re-built since the merge
Im getting it to start with the new nightly build
although with tensor parallel it wont work
but now im facing the problem that i cannot really use the endpoint as the model doesnt have a chat template
do you maybe have a chat template you are using in the NIM container ?
or it there a specific endpoint i can use?
i tried /v1/chat/completitions (here i get a chat template missing error)
and /v1/completitions (here multi_modal_content doesnt work)
Any hints in how i could use it properly via API ?
Hi! Added an example with vllm serve/openai-compatible api and chat template. Let me know if this works for you
Hey @katerynaCh thank you for your help!
it seems to still not work though, it seems the image is not properly processed as im getting this:
<x_0.1641><y_0.1844><tbc>**_S_**- \(\mathbf{u}'=\mathbf{v}^*\mathbf{s}+\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}'\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{1}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{2}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{3}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{4}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{5}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{6}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{7}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{8}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{9}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \
it feels like it only outputs data not based on the image.
import base64
from openai import OpenAI
client = OpenAI(
base_url="https://my-domain-endpoint.com/v1",
api_key="sk-not-needed",
)
# Read and base64-encode the image
with open("./page_1.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
prompt_text = "</s><s><predict_bbox><predict_classes><output_markdown>"
resp = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-Parse-v1.1",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt_text,
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{img_b64}",
},
},
],
}
],
max_tokens=7000,
temperature=0.0,
extra_body={
"repetition_penalty": 1.1,
"top_k": 1,
"skip_special_tokens": False,
},
)
print(resp.choices[0].message.content)
--model nvidia/NVIDIA-Nemotron-Parse-v1.1 \
--uvicorn-log-level info \
--gpu-memory-utilization 0.60 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 8000 \
--chat-template /templates/nemotron.jinja \
--trust-remote-code \
--dtype auto \
--port ${PORT} \
--host 0.0.0.0
I cannot explain why =/
Did you make it work on your end the way you described it in the model card?
It does work for me following these steps:
- vllm/vllm-openai:nightly
- pip install albumentations
- vllm serve like in Readme
- openai example like in readme (setting max_tokens < 9000)
Are you using the same environment?
im using this Dockerfile to build vllm/vllm-openai:nemotron:
( vllm nightly pulled today )
FROM vllm/vllm-openai:nightly
RUN pip install open_clip_torch timm albumentations
Only installing albumentations doesn't cut it, open_clip is also needed or it won't start thats why i added that
[0;36m(APIServer pid=1)[0;0m Encountered exception while importing open_clip: No module named 'open_clip'
[0;36m(APIServer pid=1)[0;0m Traceback (most recent call last):
Im running the docker container like this :
docker run --rm \
--gpus 'all' \
--network llama-swap_llama-swap \
--name vllm-${PORT} \
--shm-size 15gb \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e NVIDIA_VISIBLE_DEVICES=all \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
-e VLLM_SLEEP_WHEN_IDLE=1 \
-v /home/meganoob1337/.cache/huggingface/hub/:/root/.cache/huggingface/hub/ \
-v /home/meganoob1337/projects/ollama/vllm_cache2:/root/.cache/vllm \
vllm/vllm-openai:nemotron \
--dtype bfloat16 \
--max-num-seqs 8 \
--limit-mm-per-prompt '{"image": 1}' \
--model nvidia/NVIDIA-Nemotron-Parse-v1.1 \
--uvicorn-log-level info \
--gpu-memory-utilization 0.60 \
--trust-remote-code \
--port ${PORT} \
--host 0.0.0.0
with this script:
import base64
from openai import OpenAI
client = OpenAI(
base_url="https://my-domain-endpoint.com/v1",
api_key="sk-not-needed",
)
# Read and base64-encode the image
with open("./page_1.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
prompt_text = "</s><s><predict_bbox><predict_classes><output_markdown>"
resp = client.chat.completions.create(
model="nemotron-parse-1-1",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt_text,
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{img_b64}",
},
},
],
}
],
max_tokens=7000,
temperature=0.0,
extra_body={
"repetition_penalty": 1.1,
"top_k": 1,
"skip_special_tokens": False,
},
)
print(resp.choices[0].message.content)
and im still getting jibberish..
I dont really understand why this is happening.
Im using the chat template from the Repository (i tried with using it mounted from filesystem aswell but that didnt change anything)
i feel like im missing something but i cannot pinpoint it ...
Hi, we have found some issues when running vllm on non-H100 that were resolved by using 0.14.1 image + using export VLLM_ATTENTION_BACKEND=TRITON_ATTN when serving - possibly this would resolve your issues too?
@katerynaCh yes! That was it, thank you very much!!
Great, happy it worked!