About
This project provides an image captioning model trained on the visual-layer/imagenet-1k-vl-enriched dataset. The model architecture combines a ViT backbone timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k for image feature extraction and a GPT-2 language model openai-community/gpt2 for text generation.
A custom projection layer was implemented to map the image features from the vision backbone to the input space of the language model, enabling seamless integration between the two modalities.
How to use
To run this app, follow these steps:
Install dependencies
This project uses uv for fast dependency management. To install all dependencies, run:
uv sync
Run inference To test the model and generate captions, run:
uv run inference.py
This will process your input images and output captions using the trained model.
Example
Input
Output
a boy holding a fish in the woods
Model tree for poonai/imagenet-caption
Base model
openai-community/gpt2