jongwooko
/

Flex-VL-7B

@@ -1,198 +1,216 @@
 ---
 library_name: transformers
-tags: []
 ---
 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
-[Flex-Judge](https://arxiv.org/abs/2505.18601)
-## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
 <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
 ---
 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
+[Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
+](https://arxiv.org/abs/2505.18601)
+**Flex-VL-7B** is a vision-language model developed as part of the Flex-Judge framework, designed to perform robust evaluation of multimodal content using primarily text-only reasoning. Despite being trained with minimal supervision, it generalizes effectively to complex image- and video-based evaluation tasks, enabling consistent and interpretable judgments across diverse multimodal inputs.
 ### Model Description
+- We propose **Flex-Judge**, a reasoning-guided multimodal evaluator that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats.
+- Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable, multimodal model-as-a-judge.
+### Model Sources
 <!-- Provide the basic links for the model. -->
+- **Repository:** https://github.com/jongwooko/flex-judge
+- **Paper:** [Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
+](https://arxiv.org/abs/2505.18601)
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+For more comprehensive usage examples and implementation details, please refer to our official repository.
+### Requirements
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+```
+pip install git+https://github.com/huggingface/transformers accelerate
+pip install qwen-vl-utils[decord]==0.0.8
+pip install vllm
+pip install datasets
+```
+### Using 🤗 Transformers to Chat
 <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+Here we show a conde snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
+```
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+from datasets import load_dataset
+import torch
+# default: Load the model on the available device(s)
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "jongwooko/Flex-VL-7B", torch_dtype="auto", device_map="auto"
+)
+# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
+# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+#     "jongwooko/Flex-VL-7B",
+#     torch_dtype=torch.bfloat16,
+#     attn_implementation="flash_attention_2",
+#     device_map="auto",
+# )
+# default processer
+processor = AutoProcessor.from_pretrained("jongwooko/Flex-VL-7B")
+# Example
+example = load_dataset('MMInstruction/VL-RewardBench', split='test')[0]
+question, image = example["query"], example["image"]
+answer1, answer2 = example["response"]
+# System prompt for Flex-Judge
+SYSTEM_PROMPT = (
+    "You are a helpful assistant. The assistant first performs a detailed, "
+    "step-by-step reasoning process in its mind and then provides the user with"
+    "the answer. The reasoning process and answer are enclosed within <think> "
+    "reasoning process here, explaining each step of your evaluation for both "
+    "assistants </think><answer> answer here </answer>. Now the user asks you "
+    "to judge the performance of two AI assistants in response to the question. "
+    "Score assistants 1-10 (higher=better). Criteria includes helpfulness, "
+    "relevance, accuracy, and level of detail. Avoid order, length, style or "
+    "other bias. After thinking, when you finally reach a conclusion, clearly "
+    "provide your evaluation scores within <answer> </answer> tags, i.e., for "
+    "example, <answer>3</answer><answer>5</answer>"
+)
+messages = [
+    {
+        "role": "system", "content": SYSTEM_PROMPT
+    },
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": image,
+            },
+            {"type": "text", "text": "[Question]\n{question}\n\n[Assistant 1's Answer]\n{answer1}\n\n[Assistant 2's Answer]\n{answer2}"},
+        ]
+    },
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text+"\n<think>\n\n"],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=4096)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+### Using vLLM
+Here, we recommend using `vllm` instead of `transformers` to improve inference speed. The results in our papers are based on the `vllm` library.
+```
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+from datasets import load_dataset
+from vllm import LLM, SamplingParams
+# default: Load the model on the available device(s)
+llm = LLM(
+    "jongwooko/Flex-VL-7B",
+    tensor_parallel_size=4,
+    limit_mm_per_prompt={"image": 1},  # The maximum number to accept
+)
+sampling_params = SamplingParams(
+    max_tokens=4096,
+    temperature=0.2,
+    top_p=0.95,
+)
+# default processer
+processor = AutoProcessor.from_pretrained("jongwooko/Flex-VL-7B", use_fast=True)
+# Example
+example = load_dataset('MMInstruction/VL-RewardBench', split='test')[0]
+question, image = example["query"], example["image"]
+answer1, answer2 = example["response"]
+# System prompt for Flex-Judge
+SYSTEM_PROMPT = (
+    "You are a helpful assistant. The assistant first performs a detailed, "
+    "step-by-step reasoning process in its mind and then provides the user with"
+    "the answer. The reasoning process and answer are enclosed within <think> "
+    "reasoning process here, explaining each step of your evaluation for both "
+    "assistants </think><answer> answer here </answer>. Now the user asks you "
+    "to judge the performance of two AI assistants in response to the question. "
+    "Score assistants 1-10 (higher=better). Criteria includes helpfulness, "
+    "relevance, accuracy, and level of detail. Avoid order, length, style or "
+    "other bias. After thinking, when you finally reach a conclusion, clearly "
+    "provide your evaluation scores within <answer> </answer> tags, i.e., for "
+    "example, <answer>3</answer><answer>5</answer>"
+)
+messages = [
+    {
+        "role": "system", "content": SYSTEM_PROMPT
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "<|vision_start|><|image_pad|><|vision_end|>\n\n[Question]\n{question}\n\n[Assistant 1's Answer]\n{answer1}\n\n[Assistant 2's Answer]\n{answer2}"},
+        ]
+    },
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+inputs = {"prompt": text, "multi_modal_data": {"image": [image]}}
+# Inference: Generation of the output
+outputs = llm.generate([inputs], sampling_params=sampling_params)
+output_text = outputs[0].outputs[0].text
+print (output_text)
+```
+## Citation
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
+```
+@article{ko2025flex,
+  title={Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators},
+  author={Ko, Jongwoo and Kim, Sungnyun and Cho, Sungwoo and Yun, Se-Young},
+  journal={arXiv preprint arXiv:2505.18601},
+  year={2025}
+}
+```