English

About

This project provides an image captioning model trained on the visual-layer/imagenet-1k-vl-enriched dataset. The model architecture combines a ViT backbone timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k for image feature extraction and a GPT-2 language model openai-community/gpt2 for text generation.

A custom projection layer was implemented to map the image features from the vision backbone to the input space of the language model, enabling seamless integration between the two modalities.

How to use

To run this app, follow these steps:

Install dependencies

This project uses uv for fast dependency management. To install all dependencies, run:

uv sync

Run inference To test the model and generate captions, run:

uv run inference.py

This will process your input images and output captions using the trained model.

Example

Input

test image

Output

a boy holding a fish in the woods

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for poonai/imagenet-caption

Finetuned
(2082)
this model

Dataset used to train poonai/imagenet-caption