SigLIP or SigLIP2 encoder?

#37

by orrzohar - opened Apr 2, 2025

Discussion

orrzohar

Apr 2, 2025

SigLIP or SigLIP2 encoder?

GopiUppari

Google org Apr 3, 2025

Hi @orrzohar ,

Yes, SigLIP and SigLIP 2 utilize similar encoder architectures, both employing the Vision Transformer (ViT) design with learned positional embeddings.
Could you please refer this reference.

Thank you.

orrzohar

Apr 3, 2025

Hi @GopiUppari ,
I am familiar with SigLIP.
However, in the Gemma3 paper, it was not stated whether SigLIP or SigLIP2 was utilized. From the config, it is impossible to tests either because the arch is the same so both are defined as siglip_vision_model.
Did Gemma3 utilize the SigLIP2 or SigLIP checkpoints?

Best,
Orr

udaybondi

Apr 3, 2025

I'm also curious if the siglip_vision_model's embeddings remain general purpose (i.e frozen during gemma training) or the SigLIP has been finetuned to improve Gemma's performance

orrzohar

Apr 3, 2025

@udaybondi i would be shocked if they kept the encoder frozen, everyone trains now a days

Electric-Sheep

May 6, 2025

•

edited May 6, 2025

According to the Gemma3 paper, they used SigLIP instead of SigLIP 2, and they froze its weights during the training process for "simplicity". But it's not stated whether the weight they used is the same as the public version of the SigLIP model.
https://arxiv.org/pdf/2503.19786

orrzohar

May 7, 2025

"We use a vision encoder based on SigLIP (Zhai et al., 2023)." could be SigLIP2, SigLIP, or even encoders from Paligemma/similar...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment