AIEI-VL1-2B
A simplified VLM model (Vision-Language Model) capable of Image Captioning and Visual Question Answering (VQA).
Model Description
This model integrates a SigLIP vision encoder with a Qwen3.1-based LLM using a custom projector. It supports:
- Image Captioning: Generating descriptive text for images.
- Visual Question Answering: Answering questions based on visual input.
Dependencies
To use this model, you need the following Python libraries:
pip install torch transformers pillow requests einops
Note: einops might be required by specific vision backbones depending on the configuration.
Inference Example
Below is a simple code snippet to load the model and run inference on an example image.
from transformers import AutoModel, AutoTokenizer
from PIL import Image
import requests
import torch
# 1. Load Model and Tokenizer
model_id = "mano066/AIEI-VL1-2B"
# trust_remote_code=True is required as this model uses custom modeling code.
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# 2. Load Example Image
image_url = "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg?cs=srgb&dl=pexels-chevanon-1108099.jpg&fm=jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
# Ensure image is RGB
if image.mode != "RGB":
image = image.convert("RGB")
# 3. Generate Caption
print("Generating Caption...")
captions = model.generate_caption([image], prompt_format="detailed")
print(f"Caption: {captions[0]}")
# 4. Visual Question Answering (VQA)
print("\nPerforming VQA...")
questions = ["What animals are in the image?"]
answers = model.generate_vqa([image], questions)
print(f"Question: {questions[0]}")
print(f"Answer: {answers[0]}")
- Downloads last month
- 14
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support