FirstSight: Distilled Qwen2-VL-2B for Efficient Egocentric QA

Model Description

FirstSight is a knowledge-distilled vision-language model optimized for efficient egocentric question answering on edge devices. This model is distilled from Qwen2-VL-7B-Instruct using advanced distillation techniques, achieving 3.75ร— compression with minimal performance degradation.

Key Highlights

  • ๐Ÿš€ 5.16ร— faster inference than teacher model
  • ๐Ÿ’พ 67.9% VRAM reduction (9.39 GB savings)
  • ๐Ÿ“ฆ 73.4% smaller model size (2.21B vs 8.29B parameters)
  • โšก Optimized for edge deployment on resource-constrained devices
  • ๐ŸŽฏ Specialized for egocentric scenarios (first-person perspective)

Model Architecture

  • Base Model: Qwen2-VL-2B-Instruct
  • Teacher Model: Qwen2-VL-7B-Instruct
  • Student Parameters: 2.21B
  • Precision: BFloat16 mixed precision
  • Distillation Method: Logit-based knowledge distillation with KL divergence

Training Details

Training Data

  • Dataset: Synthetic egocentric QA dataset with 5,000 training samples
  • Validation Set: 1,000 samples
  • Question Types: Object recognition, spatial reasoning, action understanding, temporal queries, environment understanding
  • Scenarios: Kitchen, living room, office, outdoor, workshop

Training Procedure

  • Framework: PyTorch with Hugging Face Transformers
  • Epochs: 10
  • Batch Size: 2 per GPU with 4ร— gradient accumulation
  • Learning Rate: 1e-5 (AdamW optimizer)
  • Scheduler: Cosine annealing with 100 warmup steps
  • Loss Function: Weighted combination of distillation loss (ฮฑ=0.7) and hard label loss (ฮฑ=0.3)
  • Temperature: 2.0 for knowledge distillation
  • Hardware: NVIDIA Quadro RTX 8000 (48GB)
  • Training Time: ~4 hours

Training Hyperparameters

{
    "learning_rate": 1e-5,
    "optimizer": "AdamW",
    "weight_decay": 0.01,
    "gradient_accumulation_steps": 4,
    "max_grad_norm": 1.0,
    "warmup_steps": 100,
    "temperature": 2.0,
    "alpha": 0.7,
    "epochs": 10
}

Performance Metrics

Inference Speed

Metric Teacher (7B) Student (2B) Improvement
Avg Latency 1.260s 0.244s 5.16ร—
Throughput (samples/s) 0.79 4.10 5.16ร—
Throughput (tokens/s) 34.33 162.55 5.16ร—

Memory Usage

Metric Teacher (7B) Student (2B) Savings
Model Size 8.29B params 2.21B params 73.4%
Peak VRAM 13.81 GB 4.43 GB 9.39 GB
VRAM Reduction - - 67.9%

Model Compression

  • Compression Ratio: 3.75ร—
  • Parameter Reduction: 73.4%
  • From: 8.29B parameters
  • To: 2.21B parameters

Usage

Installation

pip install transformers torch pillow qwen-vl-utils

Inference Example

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model_name = "YOUR_USERNAME/firstsight-qwen2-vl-2b-distilled"
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Prepare image and question
image = Image.open("egocentric_image.jpg")
question = "What object am I holding in my right hand?"

# Create conversation template
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": f"Question: {question}\nAnswer concisely:"}
        ]
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = inputs.to(model.device)

# Generate answer
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=False
    )

# Decode response
response = processor.batch_decode(
    outputs[:, inputs['input_ids'].shape[1]:],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

print(f"Answer: {response}")

Intended Use

Primary Use Cases

  • Egocentric Visual Question Answering: Answer questions about first-person perspective images/videos
  • Edge Device Deployment: Run VLM inference on resource-constrained hardware (mobile, IoT, AR/VR)
  • Real-time Assistive Systems: Power low-latency visual assistants for wearable cameras
  • Smart Glasses Applications: Enable efficient VLM capabilities on AR/VR headsets

Supported Question Types

  1. Object Recognition: "What object did I just pick up?"
  2. Spatial Reasoning: "Where is the nearest door?"
  3. Action Understanding: "What action am I performing?"
  4. Temporal Queries: "What was I looking at 5 seconds ago?"
  5. Environment Understanding: "What room am I in?"
  6. Counting: "How many items are on the table?"
  7. Attribute Recognition: "What color is the object I'm holding?"

Limitations

  • Model is specialized for egocentric scenarios and may perform worse on third-person images
  • Trained on synthetic data - real-world performance may vary
  • No multimodal training - relies solely on knowledge distillation
  • May inherit biases from the teacher model (Qwen2-VL-7B)
  • Limited to short-form QA - not optimized for long conversations

Ethical Considerations

  • Privacy: Egocentric images often contain sensitive personal information. Ensure proper consent and data protection.
  • Bias: Model may exhibit biases from training data and teacher model. Evaluate on diverse datasets.
  • Misuse: Could be used for unauthorized surveillance. Deploy responsibly with user consent.

Citation

If you use this model in your research, please cite:

@misc{firstsight2024,
  title={FirstSight: Efficient Knowledge Distillation for Vision-Language Models on Edge Devices},
  author={NYU HPML Project Team},
  year={2024},
  howpublished={\url{https://huggingface.co/YOUR_USERNAME/firstsight-qwen2-vl-2b-distilled}},
  note={Distilled from Qwen2-VL-7B-Instruct for egocentric question answering}
}

Model Card Authors

NYU High Performance Machine Learning (HPML) Project Team

Model Card Contact

For questions or feedback, please open an issue on the GitHub repository.


Training Date: December 8-9, 2024
Evaluation Date: 2025-12-09T00:32:52.669538
Framework: PyTorch 2.3.0, Transformers 4.57.3, BitsAndBytes 0.48.2

Downloads last month
77
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results