What is the resolution of input image?

#10
by coincheung - opened

I do not have access to this repo, but can you tell me what is the good resolution for extracting image embeddings from this vit-7b model? Is it 224x224?

From the Model card

These models follow a ViT architecture, with a patch size of 16. For a 224x224 image, this results in 1 class token + 4 register tokens + 196 patch tokens = 201 tokens (for DINOv2 with registers this resulted in 1 + 4 + 256 = 261 tokens).

The models can accept larger images provided the image shapes are multiples of the patch size (16). If this condition is not verified, the model will crop to the closest smaller multiple of the patch size.

Sign up or log in to comment