What is the resolution of input image?

#10

by coincheung - opened Aug 29

Aug 29

I do not have access to this repo, but can you tell me what is the good resolution for extracting image embeddings from this vit-7b model? Is it 224x224?

cipherself

Oct 2

From the Model card

These models follow a ViT architecture, with a patch size of 16. For a 224x224 image, this results in 1 class token + 4 register tokens + 196 patch tokens = 201 tokens (for DINOv2 with registers this resulted in 1 + 4 + 256 = 261 tokens).

The models can accept larger images provided the image shapes are multiples of the patch size (16). If this condition is not verified, the model will crop to the closest smaller multiple of the patch size.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment