AIEI-VL1-2B

A simplified VLM model (Vision-Language Model) capable of Image Captioning and Visual Question Answering (VQA).

Model Description

This model integrates a SigLIP vision encoder with a Qwen3.1-based LLM using a custom projector. It supports:

  • Image Captioning: Generating descriptive text for images.
  • Visual Question Answering: Answering questions based on visual input.

Dependencies

To use this model, you need the following Python libraries:

pip install torch transformers pillow requests einops

Note: einops might be required by specific vision backbones depending on the configuration.

Inference Example

Below is a simple code snippet to load the model and run inference on an example image.

from transformers import AutoModel, AutoTokenizer
from PIL import Image
import requests
import torch

# 1. Load Model and Tokenizer
model_id = "mano066/AIEI-VL1-2B"
# trust_remote_code=True is required as this model uses custom modeling code.
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# 2. Load Example Image
image_url = "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg?cs=srgb&dl=pexels-chevanon-1108099.jpg&fm=jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

# Ensure image is RGB
if image.mode != "RGB":
    image = image.convert("RGB")

# 3. Generate Caption
print("Generating Caption...")
captions = model.generate_caption([image], prompt_format="detailed")
print(f"Caption: {captions[0]}")

# 4. Visual Question Answering (VQA)
print("\nPerforming VQA...")
questions = ["What animals are in the image?"]
answers = model.generate_vqa([image], questions)
print(f"Question: {questions[0]}")
print(f"Answer: {answers[0]}")
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support