FirstSight: Distilled Qwen2-VL-2B for Efficient Egocentric QA
Model Description
FirstSight is a knowledge-distilled vision-language model optimized for efficient egocentric question answering on edge devices. This model is distilled from Qwen2-VL-7B-Instruct using advanced distillation techniques, achieving 3.75ร compression with minimal performance degradation.
Key Highlights
- ๐ 5.16ร faster inference than teacher model
- ๐พ 67.9% VRAM reduction (9.39 GB savings)
- ๐ฆ 73.4% smaller model size (2.21B vs 8.29B parameters)
- โก Optimized for edge deployment on resource-constrained devices
- ๐ฏ Specialized for egocentric scenarios (first-person perspective)
Model Architecture
- Base Model: Qwen2-VL-2B-Instruct
- Teacher Model: Qwen2-VL-7B-Instruct
- Student Parameters: 2.21B
- Precision: BFloat16 mixed precision
- Distillation Method: Logit-based knowledge distillation with KL divergence
Training Details
Training Data
- Dataset: Synthetic egocentric QA dataset with 5,000 training samples
- Validation Set: 1,000 samples
- Question Types: Object recognition, spatial reasoning, action understanding, temporal queries, environment understanding
- Scenarios: Kitchen, living room, office, outdoor, workshop
Training Procedure
- Framework: PyTorch with Hugging Face Transformers
- Epochs: 10
- Batch Size: 2 per GPU with 4ร gradient accumulation
- Learning Rate: 1e-5 (AdamW optimizer)
- Scheduler: Cosine annealing with 100 warmup steps
- Loss Function: Weighted combination of distillation loss (ฮฑ=0.7) and hard label loss (ฮฑ=0.3)
- Temperature: 2.0 for knowledge distillation
- Hardware: NVIDIA Quadro RTX 8000 (48GB)
- Training Time: ~4 hours
Training Hyperparameters
{
"learning_rate": 1e-5,
"optimizer": "AdamW",
"weight_decay": 0.01,
"gradient_accumulation_steps": 4,
"max_grad_norm": 1.0,
"warmup_steps": 100,
"temperature": 2.0,
"alpha": 0.7,
"epochs": 10
}
Performance Metrics
Inference Speed
| Metric | Teacher (7B) | Student (2B) | Improvement |
|---|---|---|---|
| Avg Latency | 1.260s | 0.244s | 5.16ร |
| Throughput (samples/s) | 0.79 | 4.10 | 5.16ร |
| Throughput (tokens/s) | 34.33 | 162.55 | 5.16ร |
Memory Usage
| Metric | Teacher (7B) | Student (2B) | Savings |
|---|---|---|---|
| Model Size | 8.29B params | 2.21B params | 73.4% |
| Peak VRAM | 13.81 GB | 4.43 GB | 9.39 GB |
| VRAM Reduction | - | - | 67.9% |
Model Compression
- Compression Ratio: 3.75ร
- Parameter Reduction: 73.4%
- From: 8.29B parameters
- To: 2.21B parameters
Usage
Installation
pip install transformers torch pillow qwen-vl-utils
Inference Example
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Load model and processor
model_name = "YOUR_USERNAME/firstsight-qwen2-vl-2b-distilled"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
# Prepare image and question
image = Image.open("egocentric_image.jpg")
question = "What object am I holding in my right hand?"
# Create conversation template
messages = [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": f"Question: {question}\nAnswer concisely:"}
]
}
]
# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = inputs.to(model.device)
# Generate answer
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=False
)
# Decode response
response = processor.batch_decode(
outputs[:, inputs['input_ids'].shape[1]:],
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
print(f"Answer: {response}")
Intended Use
Primary Use Cases
- Egocentric Visual Question Answering: Answer questions about first-person perspective images/videos
- Edge Device Deployment: Run VLM inference on resource-constrained hardware (mobile, IoT, AR/VR)
- Real-time Assistive Systems: Power low-latency visual assistants for wearable cameras
- Smart Glasses Applications: Enable efficient VLM capabilities on AR/VR headsets
Supported Question Types
- Object Recognition: "What object did I just pick up?"
- Spatial Reasoning: "Where is the nearest door?"
- Action Understanding: "What action am I performing?"
- Temporal Queries: "What was I looking at 5 seconds ago?"
- Environment Understanding: "What room am I in?"
- Counting: "How many items are on the table?"
- Attribute Recognition: "What color is the object I'm holding?"
Limitations
- Model is specialized for egocentric scenarios and may perform worse on third-person images
- Trained on synthetic data - real-world performance may vary
- No multimodal training - relies solely on knowledge distillation
- May inherit biases from the teacher model (Qwen2-VL-7B)
- Limited to short-form QA - not optimized for long conversations
Ethical Considerations
- Privacy: Egocentric images often contain sensitive personal information. Ensure proper consent and data protection.
- Bias: Model may exhibit biases from training data and teacher model. Evaluate on diverse datasets.
- Misuse: Could be used for unauthorized surveillance. Deploy responsibly with user consent.
Citation
If you use this model in your research, please cite:
@misc{firstsight2024,
title={FirstSight: Efficient Knowledge Distillation for Vision-Language Models on Edge Devices},
author={NYU HPML Project Team},
year={2024},
howpublished={\url{https://huggingface.co/YOUR_USERNAME/firstsight-qwen2-vl-2b-distilled}},
note={Distilled from Qwen2-VL-7B-Instruct for egocentric question answering}
}
Model Card Authors
NYU High Performance Machine Learning (HPML) Project Team
Model Card Contact
For questions or feedback, please open an issue on the GitHub repository.
Training Date: December 8-9, 2024
Evaluation Date: 2025-12-09T00:32:52.669538
Framework: PyTorch 2.3.0, Transformers 4.57.3, BitsAndBytes 0.48.2
- Downloads last month
- 77
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Evaluation results
- Model Compressionself-reported3.750
- Inference Speedupself-reported5.160
- VRAM Reduction (%)self-reported67.900