|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- Jialuo21/Science-T2I-Trainset |
|
|
base_model: |
|
|
- laion/CLIP-ViT-H-14-laion2B-s32B-b79K |
|
|
--- |
|
|
|
|
|
|
|
|
<img src="teaser.png" align="center"> |
|
|
|
|
|
# SciScore |
|
|
SciScore is finetuned on the base model [CLIP-H](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) using [Science-T2I](https://huggingface.co/datasets/Jialuo21/Science-T2I-Trainset) dataset. It takes an implicit prompt and a generated image as input and outputs a score that represents the scientific alignment between them. |
|
|
|
|
|
|
|
|
## Resources |
|
|
- [Website](https://jialuo-li.github.io/Science-T2I-Web/) |
|
|
- [arXiv: Paper](https://arxiv.org/abs/2504.13129) |
|
|
- [GitHub: Code](https://github.com/Jialuo-Li/Science-T2I) |
|
|
- [Huggingface: Science-T2I-S&C Benchmark](https://huggingface.co/collections/Jialuo21/science-t2i-67d3bfe43253da2bc7cfaf06) |
|
|
- [Huggingface: Science-T2I Trainset](https://huggingface.co/datasets/Jialuo21/Science-T2I-Trainset) |
|
|
|
|
|
## Feature |
|
|
<img src="exp.png" align="center"> |
|
|
|
|
|
## Qick Start |
|
|
``` |
|
|
from transformers import AutoProcessor, AutoModel |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
device = "cuda" |
|
|
processor_name_or_path = "Jialuo21/SciScore" |
|
|
model_pretrained_name_or_path = "Jialuo21/SciScore" |
|
|
|
|
|
processor = AutoProcessor.from_pretrained(processor_name_or_path) |
|
|
model = AutoModel.from_pretrained(model_pretrained_name_or_path).eval().to(device) |
|
|
|
|
|
def calc_probs(prompt, images): |
|
|
|
|
|
image_inputs = processor( |
|
|
images=images, |
|
|
padding=True, |
|
|
truncation=True, |
|
|
max_length=77, |
|
|
return_tensors="pt", |
|
|
).to(device) |
|
|
|
|
|
text_inputs = processor( |
|
|
text=prompt, |
|
|
padding=True, |
|
|
truncation=True, |
|
|
max_length=77, |
|
|
return_tensors="pt", |
|
|
).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
image_embs = model.get_image_features(**image_inputs) |
|
|
image_embs = image_embs / torch.norm(image_embs, dim=-1, keepdim=True) |
|
|
|
|
|
text_embs = model.get_text_features(**text_inputs) |
|
|
text_embs = text_embs / torch.norm(text_embs, dim=-1, keepdim=True) |
|
|
|
|
|
scores = model.logit_scale.exp() * (text_embs @ image_embs.T)[0] |
|
|
probs = torch.softmax(scores, dim=-1) |
|
|
return probs.cpu().tolist() |
|
|
|
|
|
pil_images = [Image.open("./examples/camera_1.png"), Image.open("./examples/camera_2.png")] |
|
|
prompt = "A camera screen without electricity sits beside the window, realistic." |
|
|
print(calc_probs(prompt, pil_images)) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
``` |
|
|
@misc{li2025sciencet2iaddressingscientificillusions, |
|
|
title={Science-T2I: Addressing Scientific Illusions in Image Synthesis}, |
|
|
author={Jialuo Li and Wenhao Chai and Xingyu Fu and Haiyang Xu and Saining Xie}, |
|
|
year={2025}, |
|
|
eprint={2504.13129}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2504.13129}, |
|
|
} |
|
|
``` |