SteerViT: Steerable Visual Representations

SteerViT equips pretrained Vision Transformers with steerable visual representations.
Given an image and a natural-language prompt, it conditions the visual encoder through lightweight gated cross-attention to produce:

prompt-aware global embeddings
prompt-aware dense patch features
prompt-conditioned heatmaps

This Hugging Face repository hosts the model checkpoints only.

For full documentation, installation instructions, demos, and updates, please see:

Project page: https://jonaruthardt.github.io/project/SteerViT/
GitHub repository: https://github.com/JonaRuthardt/SteerViT

Available checkpoints

steervit_dinov2_base.pth — SteerDINOv2-Base
steervit_mae_base.pth — SteerMAE-Base

Quick start

import torch
from PIL import Image
from steervit import SteerViT

device = "cuda" if torch.cuda.is_available() else "cpu"

model = SteerViT.from_pretrained("steervit_dinov2_base.pth", device=device)
transform = model.get_transforms()

image = Image.open("path/to/image.jpg").convert("RGB")
image_tensor = transform(image).unsqueeze(0)

prompt = ["the red car"]

global_features = model.get_global_features(image_tensor, texts=prompt) # pooled image embeddings
dense_features = model.get_dense_features(image_tensor, texts=prompt) # patch-level visual features
heatmaps = model.get_heatmaps(image_tensor, texts=prompt) # prompt-conditioned localization heatmaps
attention_heatmaps = model.get_attention_heatmaps(image_tensor, texts=prompt) # attention-based heatmaps

If texts=None, SteerViT behaves like the underlying frozen ViT backbone and returns query-agnostic features.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JonaRuthardt/SteerViT

Base model

FacebookAI/roberta-large

Finetuned

(446)

this model