xiangan nielsr HF Staff commited on
Commit
9724845
Β·
verified Β·
1 Parent(s): 1b92d6b

Improve model card: Add paper/code/demo links, sample usage, update title & citations (#1)

Browse files

- Improve model card: Add paper/code/demo links, sample usage, update title & citations (31c1d620f70613996cb9f15f7c71fbd52cb517dd)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +398 -17
README.md CHANGED
@@ -1,16 +1,41 @@
1
  ---
2
- license: apache-2.0
3
- datasets:
4
- - lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
5
  base_model:
6
  - DeepGlint-AI/rice-vit-large-patch14-560
7
  - Qwen/Qwen3-4B-Instruct-2507
8
- pipeline_tag: image-text-to-text
 
9
  library_name: transformers
 
 
 
10
  ---
11
- # LLaVA-OneVision-1.5: Fully Open-Source State-of-the-Art VLM Model
12
 
13
- **LLaVA-OneVision1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  - **Superior Performance**
16
  A family of fully open-source large multimodal models demonstrating
@@ -18,13 +43,12 @@ A family of fully open-source large multimodal models demonstrating
18
  - outperforming **Qwen2.5-VL** in most evaluation tasks.
19
 
20
  - **High-Quality Data at Scale**
21
- Meticulously curated **pre-training and SFT data** with rigorous filtering and quality control, achieving **superior data efficiency** with only **64B tokens**.
22
  - Concept-balanced, highly diverse, high-quality caption data
23
  - Comprehensive instruction fine-tuning data covering a wide range of tasks
24
 
25
  - **Ultra-Efficient Training Framework** Complete end-to-end training framework designed for maximum efficiency:
26
  - $16000 total budget for full model training on A100 GPUs ($0.6 per GPU/Hour)
27
- - 45% HFU efficiency in 8k context length
28
  - Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
29
  - Optimized codebase for cost-effective scaling
30
 
@@ -35,18 +59,375 @@ Meticulously curated **pre-training and SFT data** with rigorous filtering and q
35
  - Training recipes & configurations
36
  - Comprehensive training logs & metrics
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## Citation
39
 
40
  If you find *LLaVA-OneVision-1.5* useful in your research, please consider to cite the following related papers:
41
 
42
  ```
43
- @misc{an2025llavaonevision15fullyopenframework,
44
- title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
45
- author={Xiang An and Yin Xie and Kaicheng Yang and Wenkang Zhang and Xiuwei Zhao and Zheng Cheng and Yirui Wang and Songcen Xu and Changrui Chen and Chunsheng Wu and Huajie Tan and Chunyuan Li and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng},
46
- year={2025},
47
- eprint={2509.23661},
48
- archivePrefix={arXiv},
49
- primaryClass={cs.CV},
50
- url={https://arxiv.org/abs/2509.23661},
 
 
 
 
 
 
 
 
 
 
 
51
  }
52
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - DeepGlint-AI/rice-vit-large-patch14-560
4
  - Qwen/Qwen3-4B-Instruct-2507
5
+ datasets:
6
+ - lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
7
  library_name: transformers
8
+ license: apache-2.0
9
+ pipeline_tag: image-text-to-text
10
+ language: en
11
  ---
 
12
 
13
+ <p align="center">
14
+ <picture>
15
+ <source media="(prefers-color-scheme: dark)" srcset="https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/raw/main/asset/llava_onevision_black.png">
16
+ <source media="(prefers-color-scheme: light)" srcset="https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/raw/main/asset/llava_onevision_white.png">
17
+ <img alt="LLaVA-OneVision 1.5" src="https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/raw/main/asset/llava_onevision_white.png" width="600" style="max-width: 100%;">
18
+ </picture>
19
+ </p>
20
+
21
+ # LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
22
+
23
+ **LLaVA-OneVision1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
24
+
25
+ **Paper**: [LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training](https://huggingface.co/papers/2509.23661)
26
+
27
+ **Code**: [https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
28
+
29
+ **Demo**: [https://huggingface.co/spaces/lmms-lab/LLaVA-OneVision-1.5](https://huggingface.co/spaces/lmms-lab/LLaVA-OneVision-1.5)
30
+
31
+ ---
32
+
33
+ ## NEWS
34
+ - 2025-09-30: Released a comprehensive [Offline Data Pack documentation](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/tree/main/examples_offline_packing).
35
+ - 2025-09-30: Released the LLaVA-OneVision-1.5 [Technical Report](https://arxiv.org/abs/2509.23661).
36
+
37
+ ## Introduction
38
+ **LLaVA-OneVision1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
39
 
40
  - **Superior Performance**
41
  A family of fully open-source large multimodal models demonstrating
 
43
  - outperforming **Qwen2.5-VL** in most evaluation tasks.
44
 
45
  - **High-Quality Data at Scale**
46
+ Meticulously curated **pre-training and SFT data** with rigorous filtering and quality control.
47
  - Concept-balanced, highly diverse, high-quality caption data
48
  - Comprehensive instruction fine-tuning data covering a wide range of tasks
49
 
50
  - **Ultra-Efficient Training Framework** Complete end-to-end training framework designed for maximum efficiency:
51
  - $16000 total budget for full model training on A100 GPUs ($0.6 per GPU/Hour)
 
52
  - Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
53
  - Optimized codebase for cost-effective scaling
54
 
 
59
  - Training recipes & configurations
60
  - Comprehensive training logs & metrics
61
 
62
+
63
+ ## Models
64
+
65
+ | Model | HF Link | Training Log |
66
+ |--------------------------|--------------------------------------------------------------------------------------------------------|-------------|
67
+ | LLaVA-OV-1.5-4B-Instruct | [πŸ€— HF / 4B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct) | [πŸ“ˆ Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct/tensorboard) |
68
+ | LLaVA-OV-1.5-8B-Instruct | [πŸ€— HF / 8B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) | [πŸ“ˆ Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct/tensorboard) |
69
+ | LLaVA-OV-1.5-4B-Base | [πŸ€— HF / 4B-Base](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Base) | [πŸ“ˆ Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct/tensorboard) |
70
+ | LLaVA-OV-1.5-8B-Base | [πŸ€— HF / 8B-Base](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Base) | Uploading… |
71
+ ## Datasets
72
+
73
+ ![Dataset Visualization](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/raw/main/asset/dataset.jpg)
74
+ <p align="left">
75
+ <strong>(a)</strong> The vocabulary coverage proportion in the LLaVA-OneVision-1.5 Mid-Training dataset before and after concept balancing.
76
+ <strong>(b)</strong> Distribution of data sources within the LLaVA-OneVision-1.5 Mid-Training dataset.
77
+ <strong>(c)</strong> Distribution of data sources within the LLaVA-OneVision-1.5 Insturct dataset.
78
+ </p>
79
+
80
+ | Description | Link | Status |
81
+ |--------------------|--------------------------------------------------------------------------------------------------------|-------------|
82
+ | LLaVA-OV-1.5-Mid-Training-85M | [πŸ€—HF / Mid-Training 85M](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) | Uploading… |
83
+ | LLaVA-OV-1.5-Instruct | [πŸ€—HF / Insturct-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) | Uploading… |
84
+
85
+
86
+ ## Evaluation Results
87
+
88
+
89
+ All evaluations were conducted using lmms_eval.
90
+
91
+ ![](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/raw/main/asset/performance.png)
92
+
93
+
94
+ ## Quick Start with HuggingFace
95
+
96
+ ```python
97
+ from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
98
+ from qwen_vl_utils import process_vision_info
99
+ model_path = "lmms-lab/LLaVA-OneVision-1.5-8B-Instruct"
100
+
101
+ # default: Load the model on the available device(s)
102
+ model = AutoModelForCausalLM.from_pretrained(
103
+ model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
104
+ )
105
+
106
+ # default processer
107
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
108
+
109
+ messages = [
110
+ {
111
+ "role": "user",
112
+ "content": [
113
+ {
114
+ "type": "image",
115
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
116
+ },
117
+ {"type": "text", "text": "Describe this image."},
118
+ ],
119
+ }
120
+ ]
121
+
122
+ # Preparation for inference
123
+ text = processor.apply_chat_template(
124
+ messages, tokenize=False, add_generation_prompt=True
125
+ )
126
+ image_inputs, video_inputs = process_vision_info(messages)
127
+ inputs = processor(
128
+ text=[text],
129
+ images=image_inputs,
130
+ videos=video_inputs,
131
+ padding=True,
132
+ return_tensors="pt",
133
+ )
134
+ inputs = inputs.to("cuda")
135
+
136
+ # Inference: Generation of the output
137
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
138
+ generated_ids_trimmed = [
139
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
140
+ ]
141
+ output_text = processor.batch_decode(
142
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
143
+ )
144
+ print(output_text)
145
+
146
+ ```
147
+
148
+ ## Evaluation
149
+ ```
150
+ # pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
151
+
152
+ accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
153
+ --model=llava_onevision1_5 \
154
+ --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
155
+ --tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \
156
+ --batch_size=1
157
+ ```
158
+
159
+ ## Quick Start Guide
160
+
161
+ ### 1.🐳 Docker (Recommended)
162
+
163
+ We strongly recommend using the docker environment for a seamless experience. The following instructions are tailored for the A100 80GB GPU environment.
164
+
165
+
166
+ ```bash
167
+ # Clone repository
168
+ git clone https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5
169
+ cd LLaVA-OneVision-1.5
170
+
171
+ docker build -t llava_megatron:25.04 .
172
+
173
+ # Run container with -w to set working directory directly to the mounted volume
174
+ docker run -it --gpus all \
175
+ --ipc host --net host --privileged --cap-add IPC_LOCK \
176
+ --ulimit memlock=-1 --ulimit stack=67108864 --rm \
177
+ -v $(pwd):/workspace/LLaVA-OneVision-1.5 \
178
+ -w /workspace/LLaVA-OneVision-1.5 \
179
+ --name "llava_megatron_container" \
180
+ llava_megatron:25.04 /bin/bash
181
+ ```
182
+
183
+ ### 2. Checkpoint and Format Conversion
184
+
185
+ You have two options to get started with LLaVA-OneVision-1.5-stage-0:
186
+
187
+ #### Option 1: Download pre-trained model from HuggingFace
188
+ Download our `LLaVA-OneVision-1.5-4B-stage0` model directly from [HuggingFace](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-stage0).
189
+
190
+ #### Option 2: Merge initial weights yourself
191
+ Alternatively, you can merge the initial weights from the original ViT and LLM:
192
+ ```bash
193
+ python ds/merge_model.py \
194
+ --vit_path DeepGlint-AI/rice-vit-large-patch14-560 \
195
+ --llm_path Qwen/Qwen3-4B-Instruct-2507 \
196
+ --output LLaVA-OneVision-1.5-4B-stage0
197
+ ```
198
+ Note: When merging weights, the adapter component will be initialized with default values.
199
+
200
+ Convert the model from HuggingFace format to Megatron format:
201
+
202
+ ```bash
203
+ AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 bash examples/llava_ov_1_5/convert/convert_4b_hf_to_mcore.sh \
204
+ LLaVA-OneVision-1.5-4B-stage0 \
205
+ LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 \
206
+ 1 1
207
+ ```
208
+
209
+ ### 3. Stage 1 Alignment-Training
210
+
211
+ Download LLaVA from [LLaVA-558K-Webdataset](https://huggingface.co/datasets/lmms-lab/LLaVA-558K-Webdataset).
212
+
213
+
214
+ ```bash
215
+ # ============================================================
216
+ # Required environment variables:
217
+ # AIAK_TRAINING_PATH Root directory of the AIAK-Training-LLM project
218
+ # DATA_PATH Directory with WebDataset shards (.tar) for pretraining
219
+ # TOKENIZER_PATH Hugging Face tokenizer directory
220
+ # CHECKPOINT_PATH Megatron-formatted checkpoint directory (e.g., mcore TP1/PP1)
221
+ # SAVE_CKPT_PATH Output directory for saving training checkpoints
222
+ AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \
223
+ DATA_PATH=LLaVA-558K-Webdataset \
224
+ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \
225
+ CHECKPOINT_PATH=LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 \
226
+ bash examples/llava_ov_1_5/quick_start/stage_1_alignment_llava_ov_4b.sh
227
+ ```
228
+
229
+ ### 4. Stage 1.5 Mid-Training
230
+
231
+ Download our lightweight packed subset from [LLaVA-OneVision-1.5-Mid-Training-Quick-Start-3M-Webdataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Mid-Training-Webdataset-Quick-Start-3M).
232
+
233
+ ```bash
234
+ # ============================================================
235
+ # Convert model to release format
236
+ bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_release.sh \
237
+ stage_1_alignment_llava_ov_4b/iter_0002500/ \
238
+ stage_1_alignment_llava_ov_4b_release 1 1
239
+ # ============================================================
240
+ # Launch
241
+ AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \
242
+ DATA_PATH=LLaVA-OneVision-1.5-Mid-Training-Quick-Start-3M-Webdataset \
243
+ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \
244
+ CHECKPOINT_PATH=stage_1_alignment_llava_ov_4b_release \
245
+ bash examples/llava_ov_1_5/quick_start/stage_1.5_mid_training_llava_ov_4b.sh
246
+ ```
247
+
248
+
249
+ ### 5. Stage 2 Instruct-Training
250
+
251
+ Download LLaVA-NeXT-780k-webdataset at [LLaVA-NeXT-780K Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-780k-webdataset).
252
+
253
+ ```bash
254
+ # ============================================================
255
+ # Convert model to release format
256
+ bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_release.sh \
257
+ stage_1.5_mid_training_llava_ov_4b/iter_0020000/ \
258
+ stage_1.5_mid_training_llava_ov_4b_release 1 1
259
+ # ============================================================
260
+ # # Launch
261
+ AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \
262
+ DATA_PATH=LLaVA-NeXT-780k-Webdataset \
263
+ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \
264
+ CHECKPOINT_PATH=stage_1.5_mid_training_llava_ov_4b_release \
265
+ bash examples/llava_ov_1_5/quick_start/stage_2_instruct_llava_ov_4b.sh
266
+ ```
267
+
268
+
269
+ ### 6. Convert mcore to huggingface
270
+ ```bash
271
+ AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \
272
+ bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_hf.sh \
273
+ stage_2_instruct_llava_ov_4b/iter_0003500 \
274
+ LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct \
275
+ 1 1
276
+ # Copy non-model files (e.g., tokenizer config) to the new directory
277
+ find LLaVA-OneVision-1.5-4B-stage0/ -type f -not -iname '*safetensors*' -exec cp {} LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct/ ';'
278
+ ```
279
+
280
+ ### 7. Evaluation
281
+ ```bash
282
+ # pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
283
+ CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch \
284
+ --num_processes=4 --main_process_port 12399 -m lmms_eval --model=llava_onevision1_5 --batch_size=1 --tasks=mme \
285
+ --model_args=pretrained=/workspace/LLaVA-OneVision-1.5/LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct,max_pixels=3240000
286
+ ```
287
+
288
+ ## Fully Reproducing Guide
289
+
290
+ > [!TIP]
291
+ > More detailed reproduction steps for the complete process will be provided after the dataset upload is completed.
292
+
293
+
294
+ ### Mid-Training
295
+
296
+ To improve model training efficiency, we implement offline sample packing:
297
+
298
+ 1. Download the [**Mid-Training-85M Dataset**](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M)
299
+ 2. Pack the data into webdataset format, refer to [**Examples offlinepacking**](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/tree/main/examples_offline_packing) and [**Offline Padding-Free Data Packing**](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/tree/main/examples/llava_ov_1_5/sample_packing/README.md)
300
+
301
+
302
+ ### Instruct
303
+ 1. Download the [**LLaVA-OneVision-1.5-Insturct-Data**](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data)
304
+ 2. Convert the data into webdataset format, refer to [**Conversion for Mixed Instruction Data**](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5/blob/main/docs/sft_data_preprocessing.md)
305
+
306
+ ## Roadmaps
307
+
308
+ Q4 2025 Key Deliverables:
309
+
310
+ 1. **Ultra-efficient MoE Training**
311
+ 2. **Full Video Input LLM**
312
+
313
+
314
+ ## Contributors
315
+ Thanks so much to all of our amazing contributors!
316
+
317
+ <!-- readme: collaborators,contributors,jiankangdeng/- -start -->
318
+ <table>
319
+ <tbody>
320
+ <tr>
321
+ <td align="center">
322
+ <a href="https://github.com/fdcp">
323
+ <img src="https://avatars.githubusercontent.com/u/15667917?v=4" width="80;" alt="fdcp"/>
324
+ <br />
325
+ <sub><b>fdcp</b></sub>
326
+ </a>
327
+ </td>
328
+ <td align="center">
329
+ <a href="https://github.com/anxiangsir">
330
+ <img src="https://avatars.githubusercontent.com/u/31175974?v=4" width="80;" alt="anxiangsir"/>
331
+ <br />
332
+ <sub><b>anxiangsir</b></sub>
333
+ </a>
334
+ </td>
335
+ <td align="center">
336
+ <a href="https://github.com/yiyexy">
337
+ <img src="https://avatars.githubusercontent.com/u/35927125?v=4" width="80;" alt="yiyexy"/>
338
+ <br />
339
+ <sub><b>yiyexy</b></sub>
340
+ </a>
341
+ </td>
342
+ <td align="center">
343
+ <a href="https://github.com/wideyard">
344
+ <img src="https://avatars.githubusercontent.com/u/101321826?v=4" width="80;" alt="wideyard"/>
345
+ <br />
346
+ <sub><b>wideyard</b></sub>
347
+ </a>
348
+ </td>
349
+ <td align="center">
350
+ <a href="https://github.com/chengzheng345">
351
+ <img src="https://avatars.githubusercontent.com/u/209475443?v=4" width="80;" alt="chengzheng345"/>
352
+ <br />
353
+ <sub><b>chengzheng345</b></sub>
354
+ </a>
355
+ </td>
356
+ <td align="center">
357
+ <a href="https://github.com/killTheHostage">
358
+ <img src="https://avatars.githubusercontent.com/u/16442720?v=4" width="80;" alt="killTheHostage"/>
359
+ <br />
360
+ <sub><b>killTheHostage</b></sub>
361
+ </a>
362
+ </td>
363
+ <td align="center">
364
+ <a href="https://github.com/mathCrazyy">
365
+ <img src="https://avatars.githubusercontent.com/u/20607153?v=4" width="80;" alt="mathCrazyy"/>
366
+ <br />
367
+ <sub><b>mathCrazyy</b></sub>
368
+ </a>
369
+ </td>
370
+ <td align="center">
371
+ <a href="https://github.com/yunglechao">
372
+ <img src="https://avatars.githubusercontent.com/u/7631185?v=4" width="80;" alt="yunglechao"/>
373
+ <br />
374
+ <sub><b>yunglechao</b></sub>
375
+ </a>
376
+ </td>
377
+ </tr>
378
+ <tr>
379
+ <td align="center">
380
+ <a href="https://github.com/RobitYadda">
381
+ <img src="https://avatars.githubusercontent.com/u/6811311?v=4" width="80;" alt="RobitYadda"/>
382
+ <br />
383
+ <sub><b>RobitYadda</b></sub>
384
+ </a>
385
+ </td>
386
+ </tr>
387
+ <tbody>
388
+ </table>
389
+ <!-- readme: collaborators,contributors,jiankangdeng/- -end -->
390
+
391
  ## Citation
392
 
393
  If you find *LLaVA-OneVision-1.5* useful in your research, please consider to cite the following related papers:
394
 
395
  ```
396
+ @inproceedings{LLaVA-OneVision-1.5,
397
+ title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
398
+ author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
399
+ booktitle={arxiv},
400
+ year={2025}
401
+ }
402
+
403
+ @inproceedings{xie2025region,
404
+ title={Region-based Cluster Discrimination for Visual Representation Learning},
405
+ author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
406
+ booktitle={ICCV},
407
+ year={2025}
408
+ }
409
+
410
+ @article{lillava,
411
+ title={LLaVA-OneVision: Easy Visual Task Transfer},
412
+ author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
413
+ journal={Transactions on Machine Learning Research}
414
+ year={2024}
415
  }
416
+ ```
417
+
418
+ ## Acknowledgement
419
+
420
+ We extend our sincere gratitude to **AIAK team of the** [**Baige AI computing platform**](https://cloud.baidu.com/product/aihc.html) **from Baidu AI Cloud** for providing the exceptional training framework. The outstanding capabilities of AIAK-Training-LLM and AIAK-Megatron have significantly accelerated our training process with remarkable efficiency. These cutting-edge frameworks have been instrumental in achieving our research goals. `To get full AIAK support, you can contact Baidu Cloud.`
421
+
422
+
423
+ We also thank the maintainers and contributors of the following open-source projects, whose work greatly inspired and supported our research:
424
+
425
+ - LLaVA: Large Language-and-Vision Assistant β€” [LLaVA](https://github.com/haotian-liu/LLaVA)
426
+ - LLaVA-NeXT: Next-generation multi-modal assistant β€” [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT)
427
+ - lmms-eval: A standardized evaluation framework for Large Multimodal Models β€” [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)
428
+ - Megatron-LM: Efficient, scalable training for large language models β€” [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
429
+ - Qwen2.5-VL: Strong vision-language foundation model β€” [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
430
+ - InternVL: Open-source large-scale vision-language foundation model β€” [InternVL](https://github.com/OpenGVLab/InternVL)
431
+ - Qwen3: Next-generation Qwen LLM β€” [Qwen](https://github.com/QwenLM/Qwen)
432
+ - MetaCLIP: Scalable contrastive pretraining β€” [MetaCLIP](https://github.com/facebookresearch/MetaCLIP)
433
+ - FineVision: Open Data Is All You Need β€” [FineVision](https://huggingface.co/spaces/HuggingFaceM4/FineVision)