Gameplay Vision LLM Adapters

Trained adapters for multimodal gameplay video understanding, designed to work with Qwen3-VL-8B-Instruct.

Contents

File Description Size
projector_weights.pt Multimodal projectors (SigLIP, VideoMAE, Wav2Vec2 β†’ 4096-dim) ~120MB
lora_adapter/ LoRA adapter fine-tuned on gameplay Q&A ~50MB

Usage

from peft import PeftModel
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

# Load base model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Apply LoRA adapter
model = PeftModel.from_pretrained(model, "YOUR_USERNAME/gameplay-vision-llm-adapters", subfolder="lora_adapter")

# Load projector weights
projector_weights = torch.load("projector_weights.pt")

Architecture

  • Base Model: Qwen3-VL-8B-Instruct
  • LoRA Config: r=16, alpha=32, target_modules=[q_proj, k_proj, v_proj, o_proj]
  • Projectors: Linear β†’ GELU β†’ Linear (input_dim β†’ 4096)
    • SigLIP: 1152 β†’ 4096
    • VideoMAE: 768 β†’ 4096
    • Wav2Vec2: 1024 β†’ 4096

Training

  • Trained on gameplay Q&A pairs from visual novel/RPG footage
  • Projectors trained with generative alignment (frozen LLM)
  • LoRA fine-tuned on instruction-following gameplay questions

GitHub Repository

Full codebase: https://github.com/chasemetoyer/gameplay-vision-llm

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cjm249/gameplay-vision-llm-adapters

Adapter
(29)
this model